Thursday, February 3, 2011

Positive Predictive Value as an Operating Characteristic

The Mayo and Howson papers I examined in my last few posts came out of a symposium at the 1996 meeting of the Philosophy of Science Association.  In this post, I turn my attention to the third paper that came out of that symposium, this one by Ronald Giere.

Giere takes a somewhat neutral, third-party stance on the debate between Mayo and Howson, although his sympathies seem to lie more with error statistics.  He contrasts Mayo and Howson’s views as follows: Howson attempts to offer a logic of scientific inference, analogous to deductive logic, whereas Mayo aims to describe scientific methods with desirable operating characteristics.

This distinction does seem to capture how Howson and Mayo think about what they are doing.  However, it does not make me any more sympathetic to error statistics, because it seems to me a mistake to try to separate method from logic.  The operating characteristics of a scientific method are desirable to the extent that they allow one to draw reliable inferences, and the extent to which they allow one to draw reliable inferences depends on logical considerations.

Nevertheless, the logic/method distinction is useful for understanding the perspective of frequentists such as Mayo.  In fact, one may be able to use the insight this distinction provides to recast the base-rate fallacy objection in a way that will strike closer to home for a frequentist.  The key is to present the objection in terms of the positive predictive value of a test and to argue that positive predictive value (PPV) is an operating characteristic on par with Type I and Type II error rates.  In fact, a test can have low rates of Type I and Type II error (low α and β), but still have low positive predictive value.  Consider the following (oversimplified) example:

Suppose that in a particular field of research, 9/10 of the null hypotheses tested are true.  For simplicity, I will assume that all of the tests in this field use the same α and β levels: the conventional α=.05, and the lousy but fairly common β=.5.  The following 2x2 table displays the most probable set of outcomes out of 1000 tests:

Test rejects H0
Test fails to reject H­0

H0 is true
H0 is false


Intuitively, PPV is the probability that a positive result is genuine.  In more frequentist terms, it is the frequency of false nulls among cases in which the test rejects the null.  In this example, PPV is 45/95 = .53.  As the example shows, a test can have low α and β (desirable) without having high PPV, if the base rate of false nulls is sufficiently low.  (Thanks to Elizabeth Silver for providing me with this example.)

Superficially, this way of presenting the base-rate objection appears to be closer to the frequentist framework than Howson’s way of presenting it.  However, a frequentist might object that PPV is not an operating characteristic of a test in the same way that Type I and Type II error rates are.  Type I error rates, one might think, are genuine operating characteristics because they do not depend on any features of the subject matter to which the test is applied.  One simply stipulates a Type I error rate and chooses acceptance and rejections for one’s test statistic that yield that error rate.  By contrast, to calculate PPV one has to take into account the fraction of true nulls within the subject area in question.  Thus, PPV is not an intrinsic feature characteristic of a test, but an extrinsic feature of the test relative to a subject area.

This objection ignores the fact that Type I error rates are calculated on the basis of assumptions about the subject matter under test—most often, assumptions of normality.  As a result, Type I error rates are not intrinsic features of tests either, but of tests as applied to subject areas in which (typically) things are approximately normal.  Normality assumptions may be more widely applicable and more robust than assumptions about base rates, but they are nonetheless features of the subject matter rather than features of the test itself.  Type II error rates are even more obviously features of the test relative to a subject matter, because they are typically calculated for a particular alternative hypothesis that is taken to be plausible or relevant to the case at hand.

A frequentist could respond simply by conceding the point: PPV is an operating characteristic of a test that is relevant to whether one can conclude that the null is false on the basis of a positive result.  To do so, however, would be to abandon the severity requirement and to move closer to the Bayesian camp.

The example given above uses the same kind of reasoning that John Ioannidis uses in his paper “Why Most Published Research Findings are False.”  It might be useful to move next to that paper and the responses it received.

Before moving on, I'd like to note a couple other interesting other moves Giere makes in his paper.  First, he characterizes Bayesianism and error statistics as extensions of the rival research programs that Carnap and Reichenbach were developing around 1950, but without those programs' foundationalist ambitions.  Second, Giere emphasizes a point that I think is very important: Bayesianism (as it is typically understood in the philosophy of science) is concerned with the probability that propositions are true.  It is not concerned (at least in the first instance) with how close to the truth any false propositions may be.  Yet, in many (if not all) scientific applications, the truth is not an attainable goal.  Even staunch scientific realists admit that our best scientific theories are very probably false.  Where they differ from anti-realists is that they claim that our best theories are close to and/or tending toward the truth.  One might think that the emphasis in real scientific cases on approximate truth rather than probable truth favors error statistics over Bayesianism.  However, when one moves into real scientific cases one should also move into real Bayesian methods, which include Bayesian methods of model building, which are Bayesian in that involve conditioning on priors but are not like the Bayesian methods that philosophers tend to focus on because they aim to produce models that are approximately true rather than models that have a high posterior probability.  Unlike Bayesian philosophers, perhaps, Bayesian statisticians have develop a variety of methods that can handle the notion of approximate truth just as well as error-statistical methods.

No comments:

Post a Comment