Read any business book blurb, walk through any airport bookstore, visit any popular business blog — bonus points if the blog focuses on mental models or becoming a clearer, more rigorous thinker — and you will find that the lessons on offer are frequently united in committing a common error: selecting on the dependent variable. I’m firmly convinced that if this concept were taught more effectively (or had a pithier name), we’d see large portions of the popular non-fiction industry crumble.
What does it mean? Technically speaking, selecting on the dependent variable occurs when you are trying to answer a question and examine (“select’) a group of observations — people, companies, etc. — only if they have exhibit the outcome (the “dependent variable”) that you are interested in explaining. Let’s walk through that with a bit of a clichéd example.
Suppose you’re interested in learning how successful entrepreneurs spend their mornings, as you want to structure your morning similarly to put you on the path to being a successful entrepreneur. You collect a bunch of biographies of the most productive and inspiring entrepreneurs, take notes on all of their morning routines, and compile a list of commonalities. To your surprise, 75% of the most successful entrepreneurs in your study woke up very early and engaged in some form of contemplative practice first thing in the morning. A ha! A less savvy person would think they’ve identified the hack, dutifully set their alarm for 4:30 the next morning, and download a copy of Headspace ready to go when they wake up.
You, having read your share of internet arguments know that it’s not so straightforward. “Correlation isn’t causation”1, you assert confidently, and disregard the advice, going back to sleeping in and scrolling social media as soon as you wake up. (Admittedly, this is a bit too obvious of an example, given the completely justified ribbing that “5 things the most successful people do in the morning” listicles have received over the years.)
“Correlation isn’t causation” is of course a true statement, but it’s not even wrong in this scenario.
- The dependent variable is the outcome that ostensibly depends on the intervention you’re interested in, also called the independent variable2. Your dependent variable is becoming a successful entrepreneur.
- The admittedly poorly specified independent variable is the morning routine.
- The question you’re trying to answer with your research is whether having a certain kind of morning routine increases the chances of being a successful entrepreneur. While it’s true that this is fundamentally a causal question, you’re willing to accept a strong correlation as evidence that suggests or implies a causal relationship or, hey, at least it can’t hurt, right?
- Without diving into formulas or formal definitions, a correlation can be broadly thought of as a measure of how often two things vary together (i.e., if they covary) and if they do so in the same direction or not.
And now the problem with selecting on the dependent variable comes into view — we need the two things to vary in order to know if they vary together or not (it’s right there in the name ”variable”). If one or both of the things don’t vary, then we’re left unable to answer the question if the things covary!
To make this clearer, let’s think about it in an even simpler format, where our variables can only take on two values each: entrepreneurs can be successful or unsuccessful, and morning routines can involve waking up early or not. That allows us to create a trusty 2x2 matrix.
Because our dependent variable never takes on the “unsuccessful” value, we can’t say anything about the relationship between the two variables. It’s not the case that we can’t make any causal claims because we only have correlational data — we can’t make any correlational claims either! The most we can say in this case is that 75% of successful entrepreneurs wake up early.
Ahh, you say, but surely that level of commonality amongst successful investors is enough. Such a strong trend is undeniable and you’re just letting imperfect data get in the way of progress, you contend. Fine, though. Let’s go to the Library of Unsuccessful Entrepreneurs and collect more data to answer the question once and for all.
Uh oh — unsuccessful entrepreneurs wake-up early even more than successful entrepreneurs. And, it turns out, there’s not even a correlation between the waking up early and success. The best part about this example is that we could have put any possible values in the unsuccessful column and reached all possible conclusions — the point is that knowing the values of the independent variable after selecting on the dependent variables carries no information about the values of the independent variable if you had not selected on the dependent variable.
The important takeaway from all of this is that the standard rebuttal to demands for rigor in reasoning is to not let the perfect be the enemy of the good, relax the rigor, and arrive at a finding anyway — hopefully with the appropriate caveats and uncertainty. This is a good rebuttal and a reasonable course of action! But in some cases, such as this one, it confuses asking for the minimum level of rigor required with the maximum level of rigor required. It’s not that you can arrive at a watered-down finding, it’s that you can’t arrive at a finding at all.
Once you learn how to spot selection on the dependent variable, I promise you’ll see it everywhere3. Even some of the most well-respected and intelligent thinkers you’ll encounter fall into this trap sometimes — it’s a very tempting thing to do! And, quite honestly, for many questions we want to answer, it can be difficult or impossible to observe some of the values — after all, you don’t get to read a lot of case studies or biographies of unsuccessful entrepreneurs and, if you do, they’re unlikely to be representative of the broader category of unsuccessful entrepreneurs merely due to the fact that they’re being written about.
Now, you might still not be convinced. You might be arguing that I’m using an unfair “academic” standard. These aren’t life-and-death questions, we don’t need to be 100% certain or worry about statistical significance or anything — we’re just hoping to be directionally correct. And, you rightly point out, if you end up waking up early and meditating and it has no effect on success, no harm done. You’re probably right (unless you’re chronically sleep-deprived and make poor decisions as a result). However, the point isn’t that this is some idealistic, laboratory-grade standard we’re applying. It’s just that without seeing all the variables vary, we just can’t know if they covary. “Come on,” you say, “75% of these business leaders all do the same thing! You’re ignoring evidence right in front of your face!”
Unfortunately, when something is unknown or unknowable, it’s a ripe target for confirmation bias and for layering our own intuitions on with false certainty. The reason that trying to draw conclusions from answers that select on the dependent variable is so seductive is that we can confidently draw our preferred conclusions without being able to be proven wrong.
Despite being the rebuttal most frequently trotted out to win arguments, “correlation isn’t causation” is perhaps an ideal typical truism — a statement that is on its face true but that adds nothing to the discussion. Entire swaths of scientific literature frame correlational arguments as if they are causal, and correlations are frequently discussed as if they are causal, even with the appropriate hedging that they may not be. The rise in popularity of causal inference over the past ~10 years is a partial response to these errors, but we have a long way to go. ↩
Like many scientific concepts, the name could be better, but this is a strict improvement over some other naming fails such as Type I and Type II errors, System 1 and System 2, and specificity and sensitivity. ↩