When I first started my journey into data science in 2015, there were a lot of resources. Today, it’s just downright overwhelming. It seems that everyone everywhere has written an article, book, or is selling a course telling you how to go from zero to data scientist in <insert ridiculously short time period here>. Overthinking prevents action, resulting in analysis paralysis. This is a feeling I still get because, even after doing data science for several years, I am still learning and developing my skillset. However, out of all of the MOOCs I’ve taken and books I’ve read, there is one resource that stands above the rest. The book, “An Introduction to Statistical Learning” should be in every data scientist’s library.

What this book is…
Written by Stanford professors and leaders in the field of penalized regression, this book provides a strong foundation in the core concepts of statistical learning along with programming examples in R. It strikes the perfect balance of mathematical formalism and intuition, leaving the reader with a real sense of understanding. This understanding can be put into practice through the guided programming labs and exercises at the end of each chapter. The book covers many of the major algorithms every data scientist should know including several flavors of regression, classification, tree-based methods, sampling, and clustering.
The book also has a pdf version freely available, making it easy to quickly reference if there’s a specific question you’re trying to answer or don’t have/want a physical copy on hand. Even better, there are lectures by the author’s themselves that walk through the chapters in the book and provide cogent examples of the various methods in practice. It’s worth noting that their dry humor and banter with each other on camera are the icing on the cake.
What this book isn’t…
While this book does have an introduction to R section, it only covers the basic syntax and data structures necessary to get started. Installation of R and RStudio are already expected.
One thing that should be made clear about this book is that it does not cover the entire data science lifecycle. There is very little data exploration, wrangling or model deployment within its pages. And for good reason; by only covering statistical modeling, the book does this topic justice by thoroughly expanding on the intricacies and intuition necessary to be able to successfully use them in practice.
Have you read this book? What are your thoughts? Any other “must read” data science books?