Friday, May 20, 2011

Evans et al. Proof Part 3: The Proof with Cross-Embedding

In my previous post I gave a few examples of cross-embedded models with multiple ancillaries.  In this post, I’ll present the Evans et al. proof that (C) entails (L), which involves constructing a cross-embedded model.

Evans et al. start with an arbitrary experimental outcome (E,x0) and construct by stipulation a hypothetical outcome of a hypothetical Bernoulli experiment (B,h) that has the same likelihood function.  They then build a cross-embedded experiment out of E and B.  To preview where this is going, they invoke (C) twice to establish that the outcome (x0,h) of the cross-embedded experiment has the same evidential meaning as (E,x0) and as (B,h), and thus that (E,x0) and (B,h) have the same likelihood function as one another.  They then repeat the process with an arbitrary experimental outcome (E’,y0) that has the same likelihood function as (E,x0) to establish that (E’,y0) has the same evidential meaning as (B,h), and thus that it has the same evidential meaning as (E,x0).  The likelihood principle follows immediately.

Let f(X;θ) be the likelihood function for experiment E at sample point x.    Evans et al. construct the following cross-embedded experiment:
V\X
x0
x1
xi
h
½ f(x0;θ)
½ f(x1;θ)
½ f(xi;θ)
t
½ - ½f(x0;θ)
½ f(x0;θ)
0
0

The indicator variable for h is ancillary: its distribution is Bernoulli(1/2) independent of θ.  The indicator variable for X=x0 (as opposed to X itself) is ancillary: its distribution is also Bernoulli(1/2) independent of θ.  Thus, (C) says that one can conditionalize on these variables without changing evidential meaning.

Call this hypothetical cross-embedded experiment E*.  By (C) and the ancillarity of the indicator for h, Ev(E*,(h,x0))=Ev(E,x0).  By (C) and the ancillarity of the indicator for x0, Ev(E*,(h,x0))=Ev(B,h).

By the same series of steps, replacing x’s with y’s and Es with E’s,  one can show that Ev(E’,y0)=Ev(B,h) for any (E’,y0) such that y0 has the same likelihood function in E’ as x0 has in E.  The likelihood principle follows immediately.

Evans et al. Proof Part 2: Cross Embedding


The key step in the Evans et al. proof (discussed in the previous post) is to construct a cross-embedded experiment with two ancillary statistics.  In this post I explain how that construction works. 

Let’s start with a simple example (from Evans et al. 1985, p. 3) to illustrate the fact that an experiment can have two ancillary statistics.  Consider an experiment whose sampling distribution is represented by the 2x2 table below:
y\x
1
0
1
θ
1-θ
0
1-θ
θ

relative to the normalizing constant 2.  The variable y is ancillary with respect to x: the unconditional probability distribution of y is Bernoulli with Pr(y=1)=1/2 independent of θ.  Conditioning on y changes the probability distribution of x from Bernoulli with Pr(x=1)=1/2 to Bernoulli with either Pr(x=1|y=1)=θ or Pr(x=1|y=0)=1-θ depending on the value of y.  The 2x2 table is symmetric with respect to interchange between x and y, so exactly the same holds true interchanging x and y.  Thus, y is ancillary with respect to x, and x is ancillary with respect to y.

Note that this experiment is mathematically equivalent to a mixture experiment that involves first observing x to decide whether to perform the experiment represented by column 1 (renormalized) or the experiment represented by column 2 (renormalized) and then performing that experiment.  It is also mathematically equivalent to a mixture experiment that involves first observing y to decide whether to perform the experiment represented by row 1 (renormalized) or the experiment represented by row 2 (renormalized) and then performing that experiment. However, a physical instantiation of this experiment can’t actually be both types of mixture experiment simultaneously.  If one were to restrict (C) so that it applied only to genuine mixture experiments and not to experiments that are mathematically equivalent to mixture experiments, the Evans et al. proof would not succeed.  Whether this restriction can be made precise and defended remains to be seen.

After constructing the above example, Evans et al. consider scaling down the probabilities in the table and reallocating the deleted probability in a θ-free way.  Such a modification does not affect the ancillarity of x and y.  For instance, consider multiplying each cell probability by 2c/(1+c), where 0≤c≤1, and then inserting the deleted 1-2c/(1+c) probability mass into the lower left cell.  The following table results:
y\x
1
0
1
c-cθ
0
1-cθ

relative to the normalizing constant 1+c.  Again, the variable y is ancillary with respect to x and vice versa, because each has an unconditional distribution that is independent of θ: y is Bernoulli with Pr(y=1)=c/(1+c), and x is Bernoulli with Pr(x=1)=1/(1+c). 

Consider the result (x,y)=(1,1) from the experiment above.  According to (C), this result is evidentially equivalent to the result x=1 from the conditional experiment given by y=1, which is Bernoulli(θ), and to the result y=1 from the conditional experiment given by x=1, which is Bernoulli(cθ) for arbitrary 0≤c≤1. 
I’m not sure how to interpret the consequence that outcome y=1 from Bernoulli(θ) is evidentially equivalent to outcome x=1 from Bernoulli(cθ).  I suppose it would apply if one had a choice between flipping Coin A and Coin B, and all one knew about the biases of the two coins was that the bias of Coin B was some particular fraction that of Coin A.  The result purports to show that a flip of Coin A that lands heads tells you the same thing about the bias of Coin A as a flip of Coin B that lands heads.  Intuitively, this result strikes me as wrong.  Make the fraction very, very small: suppose we know that the bias of Coin B is one trillionth that of Coin A.  It seems that a head on a flip of Coin B would provide very strong evidence that the bias of Coin A is very close to one, while a head on Coin A would not be so telling.  If I’m interpreting correctly the claim that outcome y=1 Bernoulli(θ) is evidentially equivalent to outcome x=1 from Bernoulli(cθ), and my intuitions in this case are sound, then this example could provide an argument against (C).  Unfortunately, Evans et al. seem to have discussed this result only in an unpublished manuscript.

Evans et al. next move beyond the binary case to construct a more general “discrete embedding model.”  They start with an experiment with probability function f(x; θ).  They then “cross” f(x; θ) with Bernoullis to yield the following joint distribution:
y\x
1
2
1
f(1; θ)
f(2; θ)
0
g(1) - f(1; θ)
g(2) - f(2; θ)

relative to the normalizing constant G=Σg(x).  Again, x and y are mutually ancillary: x has unconditional probability distribution Pr(x=i)=g(i)/G, while y has unconditional distribution Bernoulli(1/g(i)).

Evans et al. also consider a continuous embedding model, but that need not concern us here; continuous models are an idealization, so issues that arise only for continuous models are not relevant for practice.

Evans et al. develop a modified version of the discrete embedding model to prove that (C) entails (L).  In my next post, I will discuss that model and the Evans et al. proof.

Thursday, May 19, 2011

Evans et al. Proof Part 1: Overview


Birnbaum proved in 1962 that (S) and (C) entail (L), and in 1964 that (M) and (C) entail (L), where (M) is strictly weaker than (S).  Evans et al. proved in 1986 that (C) alone entails (L).  In this post I’ll explain the Evans et al. proof in a rough, qualitative way.

The clearest formulation of (C) for present purposes is the one Birnbaum provides in his (1972):
                Conditionality (C): If h(x) is an ancillary statistic, then Ev(E,x)=Ev(Eh,x), where h=h(x).
(See the previous post for an explication of the term “ancillary statistic.”)

The Evans et al. proof starts with an arbitrary experimental outcome (E1,x*).  It then constructs a hypothetical outcome (B,h) that has the same likelihood function.  The next step is to construct a cross-embedded experiment that I will call CE(E1,B).  This cross-embedded experiment has two maximal ancillary statistics: the variable of which x* is an instance is ancillary with respect to the variable of which h is an instance, and vice versa.  Thus, it is possible to apply (C) to CE(E1,B) twice to establish that (CE(E1,B),(x*,h)) is evidentially equivalent to both (E1,x*) and (B,h), and thus that (E1,x*) and (B,h) are evidentially equivalent to one another.  The next step is to apply the same trick to an arbitrary experimental outcome (E2,y*) with the same likelihood function as  (E1,x*), using the same hypothetical (B,h).  From the fact that (E1,x*) and (E2,y*) are both evidentially equivalent to (B,h), it follows that they are evidentially equivalent to one another.  Because the only constraint placed on (E1,x*) and (E2,y*) in this construction is that they must have the same likelihood function, the likelihood principle follows immediately.

Allowing text boxes to represent experimental outcomes and lines to represent evidential equivalence (established by (C)), this proof can be represented by the following diagram:






A Clarification of (C)

Birnbaum proves in his (1962) that (S) and (C) jointly entail (L).  He claims that (C) entails (S), which would mean that (C) entails (L) by itself, but he does not prove it.  In his (1964), he clarifies that he can only prove that (C) entails (S) by using a principle (M) that is strictly weaker than (S) but does not follow from (C).  Thus, the strongest result he can claim is that (C) and (M) jointly entail (L).  In their (1986), however, Evans et al. prove that in fact (C) alone does entail (L).  In this post I begin the task of reconstructing the Evans et al. proof.  First I need to clarify a point of obscurity in Birnbaum’s 1962 formulation of (C).

In 1962, Birnbaum formulates (C) as follows:

The Principle of Conditionality (C): If an experiment E is (mathematically equivalent to) a mixture G of components {Eh}, with possible outcomes (Eh, xh), then Ev(E,(Eh, xh)) = Ev(Eh, xh).

The parenthetical phrase “mathematically equivalent to” here turns out to be essential.  (C) applies to any experiment that contains an ancillary statistic.  In a mixture experiment, the outcome of the random process that determines which component experiment to perform is an ancillary statistic.  However, non-mixture experiments can have ancillary statistics as well.  These experiments are “mathematically equivalent to” mixture experiments, but they do not involve an actual two-stage process that consists of first using a random process to choose a component experiment and then performing that component experiment.

In his (1972), Birnbaum formulates the notion of an ancillary statistic as follows:

h = h(x) is called an ancillary statistic if it admits the factored form f(x;θ) = g(h) f(x|h; θ) where g = g(h) = Prob (h(X)=h) is independent of θ.  

In other words, an ancillary statistic is a statistic independent of the parameters whose probability distribution that can be factored out of the likelihood function f(x;θ) to yield that conditional likelihood function f(x|h; θ).  A generic mixture experiment yields a simple example.  Suppose you flip a coin to decide whether to perform experiment E1 or E2, where information about the bias of that coin is not informative about the data-generating process in E1 or E2.  There is an unconditional likelihood function f(x;θ) for this mixture experiment as a whole.  However, if a frequentist knows which way the flip turns out and thus which of E1 or E2 is performed he or she will typically take that information into account and use the conditional likelihood function f(x|h; θ) that takes this information into account.  He or she will thus neglect the mixture structure of the experiment, acting as if it had been known all along that the experiment which is actually performed would be performed.

In his (1972), Birnbaum formulates (C) in terms of the notion of an ancillary statistic.  He first defines some notation:

(Eh,x) denotes a model of evidence determined by an outcome x of the experiment Eh: (Ω,Sh,fh)  where S={x: h(x)=h}.  E may be called a mixture experiment, with components Eh having respective probabilities g(h).

He then reformulates (C):

Conditionality (C): If h(x) is an ancillary statistic, then Ev(E,x)=Ev(Eh,x), where h=h(x).

This formulation is not different in substance from Birnbaum’s 1962 formulation; it is merely more explicit that being mathematically equivalent to a mixture experiment means having an ancillary statistic.

Tuesday, May 17, 2011

A Well-Motivated Frequentist Response to Birnbaum's Theorem

I’m giving a “Works in Progress” talk on Friday to explain my current position on frequentist responses to Birnbaum’s proof.  Here is my abstract:
Frequentists appear to be committed to the sufficiency principle (S) and the conditionality principle (C).  However, Birnbaum (1962) proved that (S) and (C) entail the likelihood principle (L), which frequentist methods violate.  To respond adequately to Birnbaum’s theorem, frequentists must place restrictions on (S) and/or (C) that block Birnbaum’s proof and argue that those restrictions are well motivated.  Restricting (C) alone will not suffice, because (S) by itself implies too much of the content of (L) for frequentists to accept it.  Specifically, frequentists need to restrict (S) so that it does not apply to mixture experiments some of whose components have respective outcomes with the same likelihood function.  Berger and Wolpert (1988, p. 46) claim that such a restriction would be artificial, but in fact it has a strong frequentist motivation: reduction to the minimal sufficient statistic in such an experiment throws away information about what sampling distribution is appropriate for frequentist inference. 
I’ll try to explain the basic argument here.  Start with the claim that (S) by itself implies too much of the content of (L) for frequentists to accept it.  Kalbfleisch makes this point in his (1975) as a criticism of Durbin’s proposal to restrict (C) rather than (S).  Consider two experiments E1 and E2.  E1 involves flipping a coin five times and reporting the number of heads.  E2 involves flipping a coin until it comes up heads and reporting the number of flips required.  Suppose that in both experiments the flips are i.i.d. Bernoulli.  Imagine an instance of E1 and E2 in which one gets one head in E1, and five flips in E2, so that both E1 and E2 consist of flipping a coin five times and getting heads once.  The likelihood principle says that these two outcomes have the same evidential meaning.

The sufficiency principle does not imply that those outcomes of E1 and E2 have the same evidential meaning.  However, it does say that they would have had the same evidential meaning if they had been two outcomes of one experiment rather than outcomes of two different experiments.  So consider a mixture experiment E* that involves first flipping a coin two decide whether to perform E1 or E2 and then performing the selected experiment.  (The coin used to decide which experiment to perform should be distinct from the coin used in E1 or E2 with independent bias.)  According to the sufficiency principle, the outcome that consists of performing experiment E1 and getting one head has the same evidential meaning as the outcome that consists of performing experiment E2 and getting five tosses when each is performed as part of the mixture experiment E*.  To get the result that they also have the same evidential meaning when performed outside of E*, one needs to appeal to something like (C).  (This is essentially how Birnbaum proves that (S) and (C) entail (L).)  However, to go so far but no farther seems rather unreasonable.  As Kalbfleisch puts it, “In order to reject (L) and accept (S) one must attach great importance to the possibility of choosing randomly between E1 and E2” (p. 252).  To avoid adopting this strange position, someone who rejects (L) should reject (S) as well.

The minimal restriction on (S) that blocks Birnbaum’s proof is to modify (S) so that it does not apply to mixture experiments some of whose components have respective outcomes with the same likelihood function.  Berger and Wolpert say that this restriction “seems artificial, there being no intuitive reason to restrict sufficiency to certain types of experiments” (1988, p. 46).  One can flesh out an argument along these lines as follows.  The following argument for (S) is compelling and completely general:

Conditional on the value of a sufficient statistic, which outcome occurs is independent of the parameters of the experimental model.   Independent variables do not contain information about one another.  Thus, conditional on a sufficient statistic, which outcome occurs does not contain any information about the parameters of the experimental model.  Therefore, the evidential meaning of an experimental outcome is the same as the evidential meaning of an outcome corresponding to the same value of the sufficient statistic. 

Because this argument makes no assumptions about whether the experiment in question is pure or mixed, it would be artificial to restrict (S) to non-mixture experiments.

The problem with this argument (from a frequentist perspective) is that it assumes that the experimental model appropriate for frequentist inference is fixed in advance, regardless of which outcome occurs.  But in a mixture experiment, (C) implies that which experimental model is appropriate for frequentist inference depends on which component experiment is performed.  When the components of the mixture experiment have respective outcomes with the same likelihood function, reduction to the minimal sufficient statistic throws away the information about which component experiment was performed.  Thus, applying (S) to a mixture experiment some of whose components have respective outcomes with the same likelihood function is inappropriate from a frequentist perspective.

I think this is quite a good frequentist response to Berger and Wolpert’s objection.  The challenge for a frequentist is to make the needed restriction on (S) precise in a defensible way.  Berger and Wolpert claim that the distinction between mixture and non-mixture experiments is difficult if not impossible to characterize clearly, suggesting that this challenge will not be easy to meet.  As long as the distinction appears to be real, however, a frequentist need not be bothered too much by difficulties in formulating it precisely.

I have argued that frequentists need to restrict (S).  Fortunately for the frequentist, the needed restriction is well-motivated from a frequentist perspective.  I should note that frequentists may also need to restrict (C).  They certainly do need to do so if Evans, Fraser, and Monette (1986) are correct in their claim that (C) alone implies (L).  I have just begun looking at their paper.  They point out that the fact that seemingly innocuous principles (S) and (C) imply the highly controversial principle (L) should be a clue that there is more to (S) and (C) than meets the eye.  They seem to think that Birnbaum’s way of characterizing experimental models is too simple and that with a more adequate approach (L) would no longer follow from appropriately modified versions of (S) and (C).  It looks like their paper will take some time to digest but will be well worth the effort.  

Wednesday, April 27, 2011

Term Paper on Birnbaum's Proof

Here's a pdf of the term paper described in the previous post.  It has many loose ends, but I think it's a good start on an exciting project.

Tuesday, April 26, 2011

Abstract of a Term Paper on Birnbaum's Proof

I'm writing a term paper that I hope will serve as a preliminary step toward my philosophy comp.  Here's the abstract.

Frequentist methods of statistical inference violate the likelihood principle (L). However,  Birnbaum [4] proved that (L) follows from specific versions (S) and (C ) of two principles—the sufficiency principle and the conditionality principle, respectively—to which frequentists appear to be committed.   In a recent publication [15],  Mayo notes that Birnbaum’s proof  “has  generally  been  accepted  by  frequentists,  likelihoodists,  and Bayesians alike” (p. 307). Nevertheless, she argues that the proof is fallacious (chapter 7(III)). Mayo’s critique involves replacing Birnbaum’s (S) and (C ) with different formulations of the principles of sufficiency and conditionality (S’) and (C’). Mayo shows that (S’) and (C’) do not entail
(L) but gives no reason to doubt Birnbaum’s theorem that (S) and (C) entail (L). While Mayo thus fails to show that Birnbaum’s proof is fallacious, her critique does raise the important question whether (S) and (C) or (S’) and (C’) are better formulations of the principles of sufficiency and conditionality.  I canvas a few arguments that have been offered on either side of this issue.   On balance,  these arguments appear to favor Birnbaum’s position.   However,  they are not sufficiently compelling to declare the issue resolved.

I think it's a good start.  For the comp, I would like to address other responses to Birnbaum's argument in addition to Mayo's and to have something more definite to say about how we should interpret the sufficiency and conditionality principles.

Friday, April 22, 2011

Birnbaum's Proof Part II: The Details

The notation needed to explain Birnbaum's proof in detail outruns the capabilities of Blogger, so I wrote it up as a pdf.  My conclusions are essentially unchanged from my rough sketch of the argument: Birnbaum's proof is valid, but I am suspicious that he has not formulated the principles of conditionality and sufficiency properly.  If you interpret the principle of conditionality not as a statement about evidential equivalence but instead as a directive about how to analyze experimental results (which seems appropriate to me at this time), then it is incompatible with the principle of sufficiency as formulated by Birnbaum, and indeed with any principle that can do the work that the principle of sufficiency does in Birnbaum's proof.  Another way to undermine Birnbaum's proof would be to insist, as Durbin (1970) does, that a conditional analysis can only condition on a variable that is part of the minimal sufficient statistic, although that move seems to me less appropriate to me at this time.

I did realize in examining Birnbaum's proof that it is only appropriate for experiments with discrete sample spaces.  However, I do not think that this limitation is serious because the fact that no measurement is completely precise means that all real experiments have discrete sample spaces, the idea of a continuous sample space being only a useful idealization.

Saturday, April 16, 2011

Birnbaum's Proof Part 1: The Rough Idea

The centerpiece of Birnbaum's 1962 paper is his proof that the conditionality and sufficiency principles (as he formulates them) entail the likelihood principle (as he formulates it).  This proof is significant, again, because frequentists generally accept conditionality and sufficiency but do not accept the likelihood principle, which follows from Bayesianism and implies many of the consequences of the Bayesian position that frequestists find objectionable.  In a future post, I will delve into the details of Birnbaum's proof; in this post, I just want to display its overall structure and introduce the objections it has received. 

Birnbaum considers two experiments that have pairs of respective outcomes—call them “star pairs”—that determine proportional likelihood functions.  He then constructs a hypothetical mixture experiment with these two experiments as its components.  By the conditionality principle, an outcome of either component experiment has the same as the evidential meaning of the corresponding outcome of the mixture experiment.  Now, there is a sufficient statistic that lumps together outcomes of the mixture experiment corresponding to star pair outcomes of the component experiments.  By the sufficiency principle, then, these outcomes of the mixture experiment have the same evidential meaning.  Given that an outcome of the mixture experiment has the same evidential meaning as a corresponding outcome of a component experiment, and that outcomes of the mixture experiment corresponding to star pair outcomes of the two component experiments have the same evidential meaning as one another, it follows that star pair outcomes of the two component experiments have the same evidential meaning as one another.  That is just what the likelihood principle asserts.

Call the two component experiments E and E’, respectively; call the mixture experiment E*; and let (x*, y*) be a "stair pair," with x* an outcome of E and y* an outcome of E’.  Then the following diagram depicts the structure of Birnbaum’s proof, using lines to indicate evidential equivalence and denoting above each line which principle is invoked to establish equivalence:



Several objections to this proof have appeared in the statistics literature (e.g. Durbin 1970, Cox and Hinkley 1974, Kalbfleisch 1974, Joshi 1990) and at least one in the philosophy literature (Mayo 2011), but it is still widely accepted.  I am suspicious of Birnbaum's proof, but I do not yet have confidence in any precise diagnosis of where it goes wrong.

One objection to Birnbaum's proof toward which I am sympathetic says that the conditionality princple should be understood not as a claim about evidential equivalence, but as a directive about how to analyze experimental results: thus, the conditionality principle says, "analyze experimental results conditional on which experiment was actually performed."  Understood in this way, the conditionality principle prohibits the use of the sufficient statistic that lumps together results from experiment E and experiment E', blocking Birnbaum's proof.  (Kalbleisch develops a version of this idea in his 1974, but the specific way in which he develops it may be problematic.) Birnbaum denies that the conditionality principle is to be understood as a directive (1962, p. 281 and elsewhere), but it is not clear to me that he has good reasons for doing so.

Friday, April 15, 2011

Birnbaum's Likelihood Principle

In this post, I present Birnbaum’s formulation of the likelihood principle and explain why frequentists reject and Bayesians accept this principle. Again, the big picture: frequentists typically accept conditionality and sufficiency principles while rejecting the likelihood principle. The likelihood principle is a central tenant of Bayesianism that follows directly from using Bayes’ theorem as an update rule. Many of the features of Bayesianism that frequentists find objectionable follow from the likelihood principle alone, so it is a short step from accepting the likelihood principle to becoming a Bayesian. Birnbaum argues that the conditionality and sufficiency principles are equivalent to the likelihood principle, putting significant pressure on frequentists to justify their position.


You should not be surprised to learn that the likelihood principle appeals to the notion of a likelihood; or, more precisely, a likelihood function. Birnbaum models an experiment as having a well-defined joint probability density f(x, θ) for all x in its sample space and all θ in its parameter space. This joint density implies a conditional density f_X|Θ(x|θ) for each θ. The likelihood function is this conditional density considered as a function of θ rather than x, defined up to an arbitrary multiplicative constant c: L( θ|x)=cf_X|Θ(x|θ). Roughly speaking, the likelihood function tells you how probable the model makes the data as a function of that model’s parameters.

Most frequentists are happy to use the likelihood function of an experiment in certain specific ways, such as maximum likelihood estimation and likelihood ratio testing. However, they do not accept the likelihood principle, which says, roughly, that all of the information about θ an experiment provides is contained in the likelihood function of θ. As Birnbaum formulates it, the likelihood principle is (like the conditionality and sufficiency principles) a claim about evidential equivalence. Specifically, it asserts that if the two experiments E and E’ with a common parameter space produce respective outcomes x and y that determine proportional likelihood functions, then Ev(E, x)=Ev(E’, y).

Consider two possible experiments. In the first experiment, you decide to spin a coin 12 times, and it comes up heads 3 times. Assuming that the spins are independent and identically distributed Bernoulli trials with the probability of heads on a given trial equal to p, the likelihood function of this outcome (up to an arbitrary multiplicative constant) is L(p|x=3) = (12 choose 3) 3^p 9^(1-p). In the second experiment, you decide to spin the coin until you obtain 3 heads. As it turns out, heads comes up for the third time on the 12th spin. Assuming that again the spins are independent and identically distributed Bernoulli trials with the probability of heads on a given trial equal to p, the likelihood function of this outcome (up to an arbitrary multiplicative constant) is L(p|x=12) = (11 choose 2) 3^p 9^(1-p). The likelihood functions for these two experiments are proportional, so the likelihood principle says that they have the same evidential meaning.

Standard frequentist methods say, contrary to the likelihood principle, that these two experiments do not have the same evidential meaning. (In fact, the second experiment but not the first allows one to reject the null hypothesis p=.5 at the .05 level in a one-sided test.) Frequentist methods are based on P values, where a P value is, roughly, the probability of a result at least as extreme as the observed result. The likelihood principle implies that only the outcome actually obtained in an experiment is relevant to the evidential interpretation of that experiment. Because P values refer to results other than the result that actually occurred (namely, the unrealized results that are at least as extreme as the observed result), they are incompatible with the likelihood principle.

This conflict between the likelihood principle becomes particularly stark when one considers “try and try again” stopping rules, which direct one to continue sampling until one achieves a particular result, such as a specific P value or posterior probability. For instance, a possible stopping rule is to continue collecting data until one achieves a nominally .05 significant result; that is, a result that appears to be significant at the P=.05 level if one analyzes the data as if a fixed-sample size stopping rule had been used. A frequentist would insist that data gathered according to this stopping rule does not have the same evidential meaning as the same data gathered according to a fixed-sample stopping rule. After all, the experiment with the “try and try again” stopping rule is guaranteed to generate a nominally .05 significant result, so its real P value is not .05, but 1.

Bayesians argue on the contrary that it is absurd to make the evidential meaning of an experiment sensitive to the stopping rule used. Why should the evidential meaning of a result depend on an experimenter’s intentions, which are after all “inside his or her head?”

There is much more that could be said about frequentist-Bayesian disputes about the relevance of stopping rules to inference. For present purposes, it is enough to note that the relevance of stopping rules for frequentist tests violates the likelihood principle.

The likelihood principle, while unacceptable to frequentists, is a simple consequence of the use of Bayes’ theorem as an update rule. According to Bayes’ theorem, the posterior probability of a hypothesis is equal to its prior probability times its likelihood, divided by the average of the prior probabilities of all hypotheses in the hypothesis space weighted by their likelihoods. Thus, given a prior distribution over the hypothesis space, the posterior probability of a hypothesis depends only on the likelihood function. (The arbitrary multiplicative constant included in the likelihood can be factored out of both the top and the bottom of Bayes’ theorem, so it cancels out.) Thus, Bayesianism implies the likelihood principle.

Thursday, April 14, 2011

Birnbaum's Sufficiency Principle

In my previous post, I gave an example that motivates the conditionality principle and presented Birnbaum’s formulation of that principle.  In this post, I do likewise for the sufficiency principle.  To recap the big picture: frequentists typically accept conditionality and sufficiency principles while rejecting the likelihood principle.  The likelihood principle is a central tenant of Bayesianism that follows directly from using Bayes’ theorem as an update rule.  Many of the features of Bayesianism that frequentists find objectionable follow from the likelihood principle alone, so it is a short step from accepting the likelihood principle to accepting Bayesianism.  Birnbaum argues that the conditionality and sufficiency principles are equivalent to the likelihood principle, putting significant pressure on frequentists to justify their position.

The sufficiency principle appeals to the notion of a sufficient statistic.  Roughly speaking, a sufficient statistic lumps together some outcomes of an experiment that have the following property: given that some outcome in the lumped-together set occurred, which one of those outcomes occurred is independent of the parameters of the experiment.  For instance, consider the outcome of a series of two coin tosses, where the tosses are assumed to be independent and identically distributed Bernoulli trials with probability p of heads.  The outcome space for this experiment is the set of possible sequences of outcomes of two coin tosses: HH, HT, TH, TT.  A sufficient statistic for this experiment is the number of heads.  This statistic is sufficient because, given the number of heads, the exact sequence of heads and tails is independent of p.

The sufficiency principle says, roughly, that a sufficient statistic summarizes the results of an experiment with no loss of information.  In other words, given the value t(x) of a statistic T(X) that is sufficient for θ, you don’t learn any more about θ by learning x.  This claim is very widely accepted and appears to be well-motivated.  x is independent of θ conditional on t(x), and it’s hard to see how one quantity could provide information about another quantity of which it is independent.  For instance, the sufficiency principle says that, given the number of heads obtained in a sequence of n tosses, you don’t learn any more about p by learning the exact sequence of heads and tails.  (This application of the likelihood principle requires the assumption that the tosses are independent and identically distributed Bernoulli trials; if the possibility that the tosses are non-independent were on the table, for instance, then information about sequence would be relevant and the number of heads would not be a sufficient statistic.)

Birnbaum formulates the likelihood principle, like the sufficiency principle, as a claim about evidential equivalence.  Take an experiment E with outcome x and a derived experiment E’ with outcome t=t(x), where T(X) is a sufficient statistic; then Ev(E, x)=Ev(E’,t).  In other words, reduction to a sufficient statistic does not change the evidential meaning of an experiment.

My comments at the end of the previous post are also appropriate here: it is a good idea to be wary of general principles insofar as they are motivated merely by the fact that they seem to capture the intuitions at play in simple examples.  However, it is worth keeping in mind that the sufficiency principle is not motivated only (or even, I think, primarily) by its intuitive appeal in simple cases, but also by the general claim that one quantity cannot provide information about another quantity of which it is independent.  Similarly, the conditionality principle is motivated by the general claim that experiments that were not performed are irrelevant to the interpretation of the experiment that was performed.  However, the conditionality principle is formulated to apply to experiments that are “mathematically equivalent” to mixture experiments, so it is not clear that this general claim is general enough to warrant the principle.

Birnbaum's Conditionality Principle

Allan Birnbaum’s 1962 paper “On the Foundations of Statistical Inference” purports to prove that the conditionality and sufficiency principles—which frequentists typically accept—jointly entail the likelihood principle—which frequentists typically reject.  The likelihood principle is an important consequence of Bayesianism.  Moreover, many of the consequences of Bayesianism that frequentists typically find objectionable (e.g., the stopping rule principle) follow from the likelihood principle alone.  Thus, once one accepts the likelihood principle and its consequences, there is little to stop one from becoming a Bayesian.  The prominent Bayesian L. J. Savage said that he began to take  Bayesian seriously “only through the recognition of the likelihood principle.”  As a result, he called the initial presentation of Birnbaum’s paper “really a historic occasion.”

In this post, I will discuss why both frequentists and Bayesians find the conditionality principle attractive, and I will provide Birnbaum’s formulation of that principle.

In rough intuitive terms, the conditionality principle says that only the experiment that was actually performed is relevant for interpreting that experiment’s results.  Stated this way, the principle seems rather obvious.  For instance, suppose your lab contains two thermometers, one of which is more precise than the other.  You share your lab with another researcher, and you both want to use the more precise thermometer for today’s experiments.  You decide to resolve your dispute by tossing a fair coin.  Once you have received your thermometer and run your experiment, there are two kinds of methods you could use to analyze your results.  One kind of method is an unconditional approach, which assigns margins of error in light of the fact that you had a 50/50 chance of using either thermometer, without taking into account which thermometer you actually used.  The other kind is an conditional approach, which assigns margins of error to measurements in light of the thermometer you actually used, ignoring the fact that you might have used the other thermometer.  Most statisticians regard as highly counterintuitive the idea that you should take into account the fact that you might have used a thermometer other than the one you actually used in interpreting your data.  Thus, most statisticians favor the conditional approach in this case.  The conditionality principle is designed to capture this intuition.

To express the conditionality principle precisely, it will be necessary to introduce some notation.  Birnbaum models an experiment as having a parameter space Ω of vectors θ, a sample space S of vectors x, and a joint probability distribution f(x, θ) defined for all x and θ.  He writes the outcome x of experiment E as (E, x), and the “evidential meaning” of that outcome as Ev(E, x).  He does not attempt to characterize the notion of evidential meaning beyond the constraints given by the conditionaly, sufficiency, and likelihood principles.  Each of those principles states conditions in which two outcomes of experiments have the same evidential meaning.

Birnbaum expresses the conditionality principle in terms of the notion of a mixture experiment.  A mixture experiment E involves first choosing which of a number of possible “component experiments” to perform by observing the value of a random variable h with a known distribution independent of θ, and then taking an observation xh from the selected component experiment Eh.  One can then represent the outcome of this experiment as either (h, xh) or, equivalently, as (Eh, xh).  The conditionality principle says that Ev(E, (Eh, xh))=Ev(Eh, xh).  In words, the evidential meaning of the outcome of a mixture experiment is the same as the evidential meaning of the corresponding outcome of the component experiment that was actually performed.

The above discussion of the conditionality principle follows a pattern that is common in philosophy: start with an intuition-pumping example, then state a principle that seems to capture the source of the intuition at work in that example.  It takes only a little experience with philosophical disputes to become suspicious of this pattern of reasoning.  There are always many general principles that can be used to license judgments about particular cases, and there are typically counterexamples to whatever happens to be the most “obvious” or “natural” general principle.  Take, for instance, theories of causation.  It is easy to give examples to motivate, say a David Lewis-style counterfactual analysis of causation.  For instance, the Titanic sank because it struck an iceberg.  Analysis: the Titanic struck an iceberg and sank, and if it hadn’t struck that iceberg then it wouldn’t have sunk.  In general: c causes e if and only if c and e both occur, and if c hadn’t occurred then e wouldn’t have occurred.  This analysis seems to capture what’s going on in the Titanic example, but counterexamples abound.  For instance, suppose that (counterfactually, so far as I know) there had been a terrorist on board the Titanic who would have sabotaged it and caused it to sink the next day if it hadn’t struck the iceberg and sunk.  Presumably, one still wants to say in this scenario that the iceberg caused the Titanic to sink.  Nevertheless, the Titanic would have sunk even if it hadn’t struck the iceberg.  Typically in philosophical debates, a counterexample like this one leads to a revision of the original analysis that blocks the counterexample; that revised analysis is then subjected to another counterexample, which leads to further revision; and this counterexample-revision-counterexample cycle iterates until the analysis becomes so complex that the core idea that motivated the original analysis starts to seem hopeless.  That idea is abandoned, a new idea is proposed, and the process is repeated with that new idea.

In short, my training in philosophy inclines me to be suspicious of Birnbaum’s conditionality principle, even though it seems to capture what’s going on in the simple thermometer example.  Because of the conditionality principle’s technical and specialized nature, however, it is not as easy to think of potential counterexamples.  I will table this concern for now; in future posts, I will discuss counterexamples to the conditionality principle that statisticians have proposed, and revisions to that principle they have suggested.

Revised Topics

My philosophy comp topic has evolved gradually, while my history comp topic has changed drastically.


My current philosophy comp project begins with a 1962 paper in which Allan Birnbaum argues that two principles frequentist statisticians typically accept—the conditionality and sufficiency principles—imply a principle they typically reject—the likelihood principle. The likelihood principle is a consequence of Bayes’ theorem, and Bayes’ theorem provides perhaps the simplest way to implement the likelihood principle in statistical inference, so Birnbaum’s argument tends to push frequentists toward Bayesianism.

Birnbaum’s argument is famous among those interested in the philosophy of statistics, but it has been criticized. Several statisticians have argued that Birnbaum’s formulation of either the conditionality principle or the sufficiency principle is too strong, and that replacing it with a suitably weakened principle would not allow Birnbaum’s argument to go through. However, these statisticians disagree among themselves about how Birnbaum’s principles should be weakened, and their specific proposals have been criticized. Joshi and Mayo have raised stronger objections to Birnbaum’s argument, arguing that there is a flaw in Birnbaum’s logic rather than in his premises.

I do not yet know what to say about Birnbaum’s argument, but I think that with enough work I am bound to find something interesting. Whether Birnbaum is right or not, there is work to be done in pinpointing exactly where either he or his critics go wrong, and the results of such an analysis are likely to have significant implications for the foundations of statistics.

I am abandoning my history comp project based on the Millikan oil-drop experiment. Millikan’s notebooks have already received careful scrutiny, and after some preliminary work it is not clear that an experimental approach will yield any significant new insights in time for the comp deadline. Moreover, it has come to my attention that there is a researcher in Germany who is way ahead of me in tracking down and investigating extant versions of Millikan’s apparatus.

Instead, I am planning to write my history comp on a puzzling passage in Darwin’s Origin of Species. I wrote a paper on this topic last year and received encouraging comments on it and suggestions for expanding it. In particular, I am planning to investigate how this passage changed through subsequent editions of the Origin and to look for evidence that might indicate why Darwin made the particular changes he did.

Friday, February 25, 2011

Is Spectrum Bias a Problem for Error Statistics?

A phenomenon called spectrum bias might help my argument that advocates of error statistics should take the positive predictive value (PPV) and negative predictive value (NPV) of their tests seriously. Spectrum bias is typically discussed in the context of medical diagnostic tests. Such tests are characterized by their sensitivity and specificity, where a test’s sensitivity is the probability that it yields a positive result if the condition in question is present, and its specificity is the probability that it yields a negative result if the condition in question is absent. PPV and NPV are more clinically relevant than sensitivity and specificity. However, sensitivity and specificity are more popular measures of a test’s performance because, unlike PPV and NPV, they are generally taken to be intrinsic properties of the test, independent of the prevalence of the condition in the population.
Spectrum bias is the phenomenon that sensitivity and specificity are not, in fact, intrinsic properties of medical tests. Like PPV and NPV, they vary when they are applied to different populations. There are both theoretical and empirical studies supporting the claim that spectrum bias exists. At least one study I have looked at purports to show that sensitivity and specificity vary with features of the population almost as much as PPV and NPV. One part at least of the explanation for this phenomenon is that medical conditions typically are not truly dichotomous; they can be present to varying extents. Misclassification is more likely for individuals who are close to the classification cutoff. As a result, sensitivity and specificity are lower for populations in which many individuals are close to the cutoff than they are for populations without this feature.

If spectrum bias afflicts error statistical tests generally, then an advocate of error statistics cannot deny the relevance of PPV and NPV on the grounds that they are not intrinsic properties of tests without also impugning their preferred error rates α and β.
I need to find out more about spectrum bias and its prevalence and severity before I can be confident that this argument is a good one. However, it does seem promising and is not likely to have been considered before within the philosophy of science, where spectrum bias seems to be largely unknown.

Wednesday, February 23, 2011

A Refinement of Ioannidis' Argument

I've briefly written up in Word an idea for a refinement of Ioannidis' argument that would yield results that are relevant to error statistics and the base-rate fallacy.  You can download the file here.  (Unfortunately, the figures don't show up properly in the Google docs viewer that the link brings up--you need to download the file and view it in Word.)

Monday, February 21, 2011

Teaching Reflection (Baranger Award Application Materials III)

The third and final portion of the Baranger Award Application is the Teaching Reflection:
Please submit a brief (300 words) description of how your sample teaching material (submitted below) reflects your teaching philosophy. You may wish to address how this material is useful to students and/or how it contributes to student learning. Additionally, you might consider how you might revise the materials now that you have had an opportunity to use them in the classroom.


The sample teaching material I am providing is here.  My description of it is as follows:

I have provided a handout for a writing lesson that I gave in the course Introduction to the Philosophy of Science.  I believe that students should receive writing instruction throughout their studies.  At the same time, I cannot let a philosophy course turn into a writing course.  To balance those demands, I developed a lesson that aims to help students improve their writing as much as possible while only devoting a single session explicitly to writing.  This lesson focuses on three simple but powerful tips and gives those tips names so that I can refer to them for the rest of the term.

This lesson reflects lessons I learned while teaching test-prep courses, as I explain in my description of a teaching challenge I faced.  In particular, the lesson is structured so that the ideas are bite-sized and uncluttered.  In each of the three main sections, I begin by introducing the core idea of that section through an example.  I then use a series of additional examples to introduce a few wrinkles into that core idea.  Each example is there to make one simple point, and I resist the temptation to comment on an example beyond that simple point.  After the examples, I provide a few notes that sum up the points they are meant to illustrate, and then I restate the main point of the section.  The lesson is highly interactive, with students reading examples and suggesting revisions throughout.  It ends with a drill that gives students a chance to practice improving some bad passages drawn from actual academic writing.  The drill is essential because it allows students to start trying to apply the lesson to realistic cases while their peers and I are there to help them when they run into trouble.

Thursday, February 17, 2011

Teaching Philosophy Revised

Here's a new draft of my statement of teaching philosophy, which I am planning to revise and use as part of my application for the Elizabeth Baranger teaching award. Feedback welcome!

Statement of Teaching Philosophy

            Philosophy courses provide excellent opportunities to teach skills and habits of mind that are central to a liberal arts education.  Those skills include abilities to analyze and evaluate arguments, to formulate reasonable views about complex issues, and to articulate and defend those views both orally and in writing.  Such skills help students become responsible citizens, valuable employees, and thoughtful human beings.
            The ability to analyze and evaluate arguments is fundamental for many fields, including not only philosophy but also career fields such as science, medicine, and law.  It is also essential for formulating reasonable, nuanced beliefs in a time of extremist commentary.  Philosophy courses provide excellent opportunities to teach these skills both because philosophy is a highly contentious discipline and because philosophers attend explicitly to the norms of argumentation.  I help my students acquire these critical thinking skills in several ways.  For instance, early on in a course I teach a lesson about how to analyze and evaluate arguments.  That lesson establishes a framework for talking about arguments that I continue to use throughout the term.  I also require students to write a number of Reading Responses in which they choose an argument from one of their readings to analyze and evaluate.  I use a peer-review system for these assignments in which students receive frequent feedback on their writing from one another in addition to the feedback they receive from me.
            Critical thinking skills are essential, but students should also learn to think synthetically and constructively.  Philosophy courses are well suited for teaching those skills because they give students opportunities to present their own views both in written work and in class discussions.  I prefer essay topics that are related to but not identical to topics we discuss in class, so that students can use ideas that we have discussed but cannot simply restate them.  When students receive their first essay assignment, I teach them a few simple ways to improve the clarity of their writing and give those tips simple names so that I can refer to them throughout the term.  I also work to ensure that students feel comfortable sharing their ideas in class while at the same time helping them to improve their oral presentation skills.  I tell them from the beginning that it is okay to be wrong in a difficult field such as philosophy, and that it is generally more productive to try out a view and see where it leads than to remain forever sitting on the fence.  I reinforce this message by taking students’ ideas seriously, pointing out their merits and raising concerns without shooting them down.  At the same time, I ask students to avoid selling their ideas short; for instance, I ask them to avoid the weak phrase “I feel like...” in favor of the more forceful “I think that...” and to avoid expressing statements as if they were questions.  I work to build student participation into my lessons as much as possible so that students come to class expecting to speak.
Many topics of debate in our society have at their heart philosophical issues.  For instance, debates about whether alternatives to the theory of evolution should be taught in public school science classes often turn on the question of what distinguishes science from non-science, which philosophers of science call the problem of demarcation.  I aim to help my students develop a more sophisticated perspective on those debates and a greater appreciation for the importance of philosophy by highlighting such connections.  In one case, I gave my students a New York Times op-ed piece by Deborah Tannen and asked them to comment on it in light of Karl Popper’s philosophy of science.  I was thrilled to see that they were able to identify what appears to be an ad-hoc maneuver by Tannen to save her favored theory--a no-no according to Popper.  I then gave them the option to write a Reading Response in which they applied Popper’s philosophy to Tannen’s article and gave their own view about whether they agree with what Popper’s theory says about this case.
I aim to persuade my students that they need philosophy to think about issues they care about.  In addition, I aim to give them skills that will allow them to think clearly and carefully about those issues and to be eloquent in sharing their thoughts with others.  Such skills are vital not only in the workplace, but also in private life and democratic society.