John Ioannidis is a “meta-researcher” who has written several well-known papers about the reliability of research that uses frequentist hypothesis testing. (The Atlantic published a reasonably good article about Ioannidis and his work in November 2010.) One of his most-cited papers, called “Why Most Published Research Findings Are False,” presents a more general version of the example I presented in my last post. Steven Goodman and Sander Greenland wrote a response arguing that Ioannidis’ analysis is overly pessimistic because it does not take into account the observed significance value of research findings. Ioannidis then responded to defend his original position. I’m planning to work through this exchange, with an eye toward the question whether an argument like Ioannidis’ could be used to present the base-rate fallacy objection to error statistics in a way that is more consistent with frequentist scruples than Howson’s original presentation.
Ioannidis’ argument generalizes the example I gave in my previous post: it uses the same kind of reasoning, but with variables instead of constants. Ioannidis idealizes science as consisting of well-defined fields i=1, …, n, each with a characteristic ratio Ri of “true relationships” (for which the null hypothesis is false) to “no relationships” (the null is true) among those relationships it investigates, and with characteristic Type I and Type II error rates αi and βi. Under these conditions, he shows, the probability that a statistically significant result in field i reflects a true relationship is (1- βi)Ri/(Ri- βiRi- αi).
It’s somewhat difficult to see where this expression comes from in Ioannidis’ presentation. This blog post by Alex Tabarrok presents Ioannidis’ argument in a way that’s easier to follow, including the following diagram:
Here’s the idea. You start with, say, 1000 hypotheses in a given field, at the top of the diagram. The ratio of true hypotheses to false hypotheses in the field is R, so a simple algebraic manipulation (omitting subscripts) shows that the fraction of all hypotheses that are true hypotheses is R(1+R), while the fraction that are false is 1- R(1+R). That brings us to the second row from the top in the diagram: if R is, say ¼ (so that there is one true hypothesis investigated for every four false hypotheses investigated), then on average out of 1000 hypotheses 200 will be true and 800 false. Of the 200 true hypotheses, some will generate statistically significant results, while others will not. The probability that an investigation of a true relationship in this field yields a statistically significant result is, by hypothesis, β. If β is, say, .6, then on average there will be 120 positive results out of the 200 true hypotheses investigated. Similarly, of the 800 false hypotheses, some will generate statistically significant results; the probability that any one hypothesis will do so is α. Letting α=.05, then, on average there will be 40 positive results for the 800 false hypotheses investigated. Thus, the process of hypothesis testing in this field yields on average 40 false positives for every 120 true positives, giving it a PPV of .75. Running through the example with variables instead of numbers, one arrives at a PPV of (1- β)R/(R- βR- α). This result implies that PPV is greater than .5 if and only if (1- β)R > α.
Ioannidis goes on to model the effects of bias on PPV and to develop a number of “corollaries” about factors that affect the probability that a given research funding is true (e.g. “The hotter a scientiﬁc ﬁeld… the less likely the research ﬁndings are to be true”). He argues that for most study designs in most scientific fields, the PPV of a published positive result is less than .5. He then makes some suggestions for raising this value. These elaborations are quite interesting, but for the moment I would like to slow down and examine the idealizations in Ioannidis’ argument.
The idea that science consists of well-defined “fields” with uniform Type I and Type II error rates across all experiments is certainly an idealization, but a benign one so far as I can tell. The assumption that each field has an (often rather small) characteristic ratio R of “true relationships” to “no relationships” is more problematic. First, what do we mean by a “true relationship?” One of the most basic kinds of relationship that researchers investigate is simple probabilistic dependence: there is a “true relationship” of this kind between X and Y if and only if P(X & Y) ≠ P(X)*P(Y). However, if probabilistic dependence is representative of the relationships that scientists investigate, then one might reasonably claim that the ratio of “true relationships” to “no relationships” in a given field is always quite high, because nearly everything is probabilistically relevant to nearly everything else, if only very slightly. In fact, a common objection to null hypothesis testing is that (point) null hypotheses are essentially always false, so that testing them serves no useful purpose.
One could avoid this objection by replacing “no relationship” with, say, “negligible relationship” and “true relationship” with “non-negligible relationship.” However, for the argument to go through one would then have to reconceive of α as the probability of rejecting the null given that the true discrepancy from the null is non-negligible, and of β as the probability of failing to reject the null given that the true discrepancy from the null is negligible. Fortunately, these probabilities would generally be nearly the same as the nominal α and β.
The assumption that each field has a characteristic R is more problematic than the assumption that each field has a characteristic α and β for a second reason as well: ascribing a value for R to a test requires choosing a reference class for that test. The assumption that each field has a characteristic α and β is simply a computational convenience; α and β are defined for a particular test even though this assumption is false. By contrast, the assumption that each field has a characteristic R is more than a convenience: something like it must be at least approximately true for R to be well defined in a particular case. This point seems to me the Achilles’ heel of Ioannidis’ argument, and of attempts to persuade frequentists to treat PPV as a test operating characteristic on par with α and β. A frequentist could reasonably object that there is no principled basis for choosing a particular reference class to use in a particular case in order to estimate R. And even with a particular reference class, there are significant challenges to obtaining a reasonable estimate for R.