Getting started in data science

2022 Update: I believe I wrote this post around 2013 (!) and, while many of my views remain the same, some of them have been updated as the field of data science has rapidly evolved. I will be updating this post in due course to reflect what I believe are contemporary best practices.

There’s no denying that ‘data scientist’ is a hot job title to have right now, and for good reason. It’s a tremendously fun and challenging field to be in, and despite all of the often undeserved hoopla that surrounds it, data scientists are doing some pretty amazing things. So it’s no surprise that many people are clamoring to find out how to become data scientists. As I used to run a blog that attempts to teach some basic data science using sports analytics, I often get email asking how one gets started in data science and/or how quickly one can learn the prerequisites for being a data scientist. Instead of replying to these all the time, I thought I’d write my thoughts up here.

In short, there are lots of great, free resources out there for the motivated autodidact. I’ll list some of them here. The more nuanced take, though, is that I’m highly skeptical that many or even most people can ‘become’ a data scientist through MOOCs and tutorials. And certainly not quickly enough to be qualified to get a job as a data scientist before the data scientist salary market comes crashing back down to earth.

This isn’t me doing boundary work, it’s that I don’t think that being a good data scientist just means knowing some programming languages and some algorithms. To be fair, there are many well-established algorithms with many great and increasingly user-friendly implementations in many programming languages. It’s incredibly easy to estimate a linear or logistic regression.

The reason I’m skeptical is because I believe in the science portion of our field’s name. One of the primary things that separates a data scientist from someone just building models is the ability to think carefully about things like endogeneity, causal inference, and experimental and quasi-experimental design. Data scientists must understand and think about things like data generating processes and reason through how misspecifying them could influence or undermine the inferences they draw from their analyses.

It takes a long time and a lot of training for this to come naturally. I don’t think I gave much thought to selecting on the dependent variable and how endemic it is until I got to grad school. Now it sticks out like a sore thumb everywhere I look. Similarly, thinking carefully about outliers (extreme values) or the process by which your data came to have missing values; these are things that often get swept aside by tutorials showing you how to use R. This isn’t to say you have to go to grad school (you probably shouldn’t) or even to college; it just means that data science is not simply a series of programs and tutorials that automatically make inferences from your data. Often times, what isn’t in your data has significant implications for inference. Your software package isn’t going to tell you what they are.

All this being said, I do think we live in an extremely exciting time for democratizing education. I hope some good comes out of it. Enough doom and gloom, and let’s get on to the links.

Math. There is no getting around it. You simply must study math and statistics. I use linear algebra in my work daily. If you have never taken calculus and can only take one math course right now, I highly recommend Gilbert Strang’s course on MIT’s Open Courseware. edX is also currently offering an introductory linear algebra course that is well put-together, but Strang is the gold standard. Of course, you should also have taken a basic multivariate calculus course if you want to read research papers that implement new algorithms, but I use pure calculus far less in my day-to-day work. I don’t recommend it, but many social scientists make it all the way through the PhD and into academic postitions having never taken a calculus course.

Statistics. The vast majority of my job revolves around statistical inference. As mentioned above, linear regressions are incredibly simple to estimate, yet there are some core assumptions that, if not met, can render your results sketchy at best and completely invalid at worst. Training in statistics will teach you to know these assumptions, understand what happens when they’re not met, and what to do about it. In fact, training in statistics usually takes the path of “here’s some simple linear models” in a course or two. Then nearly every course following that tries to figure out how to estimate models that violate the assumptions of linear models, but in different ways: autocorrelation in time series data, non-independent observations due to time or spatial clustering, dependent variables that are counts with lots of zeros, and so on.

Coursera recently introduced a data science track that is being taught by some absolute superstars in statistics. This is a great start. But make no mistake, it’s a start. I’d also recommend anything by Andrew Gelman, particularly his book Data Analysis Using Regression and Mulitlevel/Hierarchical Models. These will get you started thinking about linear models and generalizations of them. Gelman also is an expert at making sure you continually revisit your modeling assumptions and clearly explaining why they’re important. Once you’re finished with these, you should focus on the next topic that’s most applicable to your work: time series analysis and forecasting, survival analysis, etc.

Experiments and causal inference. You should also be well-versed in thinking about research design. If you’re going to be in charge of your company’s split tests and experiments, you’ll want to master this stuff. Judea Pearl’s Causality is probably the most well-known and referenced work, but it’s not for beginners. You could do really well for yourself by starting with a basic research methods textbook, especially from the social sciences as they’re often concerened about doing experiments when you’re not in a laboratory setting. Designing Social Inquiry, referred to by many as “KKV”, is a really good starting point for some of this.

Machine learning. Machine learning and statistics have significant overlap, but while statistics is often concerned with precise and unbiased estimates of parameters, machine learning is usually focused on making accurate predictions on unseen data. Andrew Ng’s course runs routinely on Coursera and is a good first step. The barriers to entry in reading about machine learning are significantly higher than many other topics, as machine learning is an applied subfield of computer science. Since the math requirements for CS majors are often non-trivial, a working knowledge of multivariate calculus is often assumed. In fact, one of the fundamental estimation techniques used in many machine learning algorithms, stochastic gradient descent, assumes you know what a gradient is.

For a more applied, less math-heavy introduction to the concepts in R, I highly recommend Machine Learning for Hackers by Drew Conway and John Myles White. Full disclosure, they are my friends, but don’t hold that against them. Machine Learning in Action is also good, for those that prefer Python.

Tooling. Ah yes, R vs. Python. Julia. Scala. Clojure. Java. C! There are so many languages out there. Should you learn to program? Yes. Do you need to be a master of a language out of the gate? No. In fact, you can do a tremendous amount in Excel, though I wouldn’t recommend it. That book by John Foreman that I just linked has some of the most entertaining and enlightening tutorials I’ve ever read.

It doesn’t matter what language you learn first. I’ll repeat that for emphasis and dramatic Fight Club effect. It doesn’t matter what language you learn first. Pick a language and learn it. Write bad code that breaks. Just learn it. Any language can do all of the things that you’ll need as a beginner. By the time you figure out what your language is bad at or can’t do, you’ll already know enough about programming and the languages that you’ll know which language you need to learn next to solve your problem. That being said, do I think it’s a GOOD idea to pick Javascript or C++ as a first language to do interactive data analysis? No. R and Python are popular for a reason. Programmers and data scientists are a fad-driven bunch, and new programming languages become en vogue and disappear quickly. There’s a reason why, when Apple announced the new Swift programming language, people were joking about receiving recruiter emails requiring five years of Swift experience.

Finally, once you’re well on your way to become an expert in all of these different areas, you’ll want to get a job. DANGER! You need to be very careful about finding a job as a data scientist. The same buzz and hype that probably got your attention is getting the attention of recruiters and hiring managers everywhere. “WE NEED A DATA SCIENTIST!” is ringing out from human resources departments across the world. But you need to find an organization that can see beyond the hype, understands what is and isn’t possible for a data scientist to do, and will value your input as well as your caution. Beware job listings that read like they’re copied and pasted from Hacker News.

Whew. I think that scratches the surface. I hope you’ve found this helpful.