The pro-estrogen consensus emerged as a result of numerous observational studies showing that in all kinds of populations on record, patients receiving estrogen therapy had, on average, fewer heart attacks. That is, estrogen therapy is negatively correlated with heart attacks. What nobody realized, or knew how to analyze properly, was that healthier women were more likely to receive estrogen in the first place. As a result, the observational studies amounted to a round-about way of showing that healthy women are, well, healthy. Correlation, we are reminded, does not imply causation.
David Freedman, Professor of statistics at UC Berkeley, who passed away in October, was dedicated to revealing the inadequacies of statistical methods for addressing problems of observational data. He taught that traditionally, causation is demonstrated by intervention. Take two identical objects, administer the treatment of interest to one but not to the other, and see what happens. When the objects involved are people, no two of which are the same, randomized groups are used instead. Two random groups, if they are sufficiently large, are unlikely to have different average characteristics, and the causal analysis is carried out for the average member of each group.
As Freedman liked to point out, observational studies assume that this randomization is carried out by Nature: The group getting estrogen therapy is assumed no different on average from the group not receiving it. Except in very limited circumstances, Nature does no such thing. Every semester, Freedman gave students in his graduate statistics course the same assignment: Choose any article containing some statistical model from a leading economics, social science, or political science journal and discuss whether its conclusions are tenable under a rigorous analysis of assumptions and statistical methods. Over some thirty years, fewer than ten articles held up under this scrutiny. It’s not that conclusive analysis of observational data is impossible, just rare.
In the 1880s, Robert Koch and Friedrich Loeffler were struggling to identify the microbes responsible for tuberculosis. Nobody had ever really made a scientific causal argument before, let alone one involving something nobody could see. They formulated Koch’s Postulates, published in 1890. Looking at the postulates now, one is particularly striking: “The micro-organism must be isolated from a diseased organism and grown in pure culture”. To infer that some mysterious invisible entity is causing a disease, you need to find it and grow it on its own until you can see it.
Such a requirement is an attempt to avoid the trickery of confounders, alternate explanations (like the socio-economic status of the women taking estrogen supplements) for the observed phenomenon. Koch was saying that to infer a causal relationship, you need to do more than theorize the relationship; you need to have a firm understanding of the mechanisms at work. Without the means to intervene in nature, such an understanding cannot be under-emphasized. Nobody knows exactly how estrogen is involved in coronary health so inferring some causal relationship was risky at best.
In honor of Professor Freedman, whose class I took last year, here in his own words:
Causal inferences can be drawn from non-experimental data. However, no mechanical rules can be laid down for the activity. Since Hume, that is almost a truism. Instead, causal inference seems to require an enormous investment of skill, intelligence, and hard work. Many convergent lines of evidence must be developed. Natural variation needs to be identified and exploited. Data must be collected. Confounders need to be considered. Alternative explanations have to be exhaustively tested. Before anything else, the right question needs to be framed. Naturally, there is a desire to substitute intellectual capital for labor. That is why investigators try to base causal inference on statistical models. The technology is relatively easy to use, and promises to open a wide variety of questions to the research effort. However, the appearance of methodological rigor can be deceptive. The models themselves demand critical scrutiny. Mathematical equations are used to adjust for confounding and other sources of bias. These equations may appear formidably precise, but they typically derive from many somewhat arbitrary choices. Which variables to enter in the regression? What functional form to use? What assumptions to make about parameters and error terms? These choices are seldom dictated either by data or prior scientific knowledge. That is why judgment is so critical, the opportunity for error so large, and the number of successful applications so limited.