Friday, May 20, 2011

Evans et al. Proof Part 3: The Proof with Cross-Embedding

In my previous post I gave a few examples of cross-embedded models with multiple ancillaries.  In this post, I’ll present the Evans et al. proof that (C) entails (L), which involves constructing a cross-embedded model.

Evans et al. start with an arbitrary experimental outcome (E,x0) and construct by stipulation a hypothetical outcome of a hypothetical Bernoulli experiment (B,h) that has the same likelihood function.  They then build a cross-embedded experiment out of E and B.  To preview where this is going, they invoke (C) twice to establish that the outcome (x0,h) of the cross-embedded experiment has the same evidential meaning as (E,x0) and as (B,h), and thus that (E,x0) and (B,h) have the same likelihood function as one another.  They then repeat the process with an arbitrary experimental outcome (E’,y0) that has the same likelihood function as (E,x0) to establish that (E’,y0) has the same evidential meaning as (B,h), and thus that it has the same evidential meaning as (E,x0).  The likelihood principle follows immediately.

Let f(X;θ) be the likelihood function for experiment E at sample point x.    Evans et al. construct the following cross-embedded experiment:
V\X
x0
x1
xi
h
½ f(x0;θ)
½ f(x1;θ)
½ f(xi;θ)
t
½ - ½f(x0;θ)
½ f(x0;θ)
0
0

The indicator variable for h is ancillary: its distribution is Bernoulli(1/2) independent of θ.  The indicator variable for X=x0 (as opposed to X itself) is ancillary: its distribution is also Bernoulli(1/2) independent of θ.  Thus, (C) says that one can conditionalize on these variables without changing evidential meaning.

Call this hypothetical cross-embedded experiment E*.  By (C) and the ancillarity of the indicator for h, Ev(E*,(h,x0))=Ev(E,x0).  By (C) and the ancillarity of the indicator for x0, Ev(E*,(h,x0))=Ev(B,h).

By the same series of steps, replacing x’s with y’s and Es with E’s,  one can show that Ev(E’,y0)=Ev(B,h) for any (E’,y0) such that y0 has the same likelihood function in E’ as x0 has in E.  The likelihood principle follows immediately.

Evans et al. Proof Part 2: Cross Embedding


The key step in the Evans et al. proof (discussed in the previous post) is to construct a cross-embedded experiment with two ancillary statistics.  In this post I explain how that construction works. 

Let’s start with a simple example (from Evans et al. 1985, p. 3) to illustrate the fact that an experiment can have two ancillary statistics.  Consider an experiment whose sampling distribution is represented by the 2x2 table below:
y\x
1
0
1
θ
1-θ
0
1-θ
θ

relative to the normalizing constant 2.  The variable y is ancillary with respect to x: the unconditional probability distribution of y is Bernoulli with Pr(y=1)=1/2 independent of θ.  Conditioning on y changes the probability distribution of x from Bernoulli with Pr(x=1)=1/2 to Bernoulli with either Pr(x=1|y=1)=θ or Pr(x=1|y=0)=1-θ depending on the value of y.  The 2x2 table is symmetric with respect to interchange between x and y, so exactly the same holds true interchanging x and y.  Thus, y is ancillary with respect to x, and x is ancillary with respect to y.

Note that this experiment is mathematically equivalent to a mixture experiment that involves first observing x to decide whether to perform the experiment represented by column 1 (renormalized) or the experiment represented by column 2 (renormalized) and then performing that experiment.  It is also mathematically equivalent to a mixture experiment that involves first observing y to decide whether to perform the experiment represented by row 1 (renormalized) or the experiment represented by row 2 (renormalized) and then performing that experiment. However, a physical instantiation of this experiment can’t actually be both types of mixture experiment simultaneously.  If one were to restrict (C) so that it applied only to genuine mixture experiments and not to experiments that are mathematically equivalent to mixture experiments, the Evans et al. proof would not succeed.  Whether this restriction can be made precise and defended remains to be seen.

After constructing the above example, Evans et al. consider scaling down the probabilities in the table and reallocating the deleted probability in a θ-free way.  Such a modification does not affect the ancillarity of x and y.  For instance, consider multiplying each cell probability by 2c/(1+c), where 0≤c≤1, and then inserting the deleted 1-2c/(1+c) probability mass into the lower left cell.  The following table results:
y\x
1
0
1
c-cθ
0
1-cθ

relative to the normalizing constant 1+c.  Again, the variable y is ancillary with respect to x and vice versa, because each has an unconditional distribution that is independent of θ: y is Bernoulli with Pr(y=1)=c/(1+c), and x is Bernoulli with Pr(x=1)=1/(1+c). 

Consider the result (x,y)=(1,1) from the experiment above.  According to (C), this result is evidentially equivalent to the result x=1 from the conditional experiment given by y=1, which is Bernoulli(θ), and to the result y=1 from the conditional experiment given by x=1, which is Bernoulli(cθ) for arbitrary 0≤c≤1. 
I’m not sure how to interpret the consequence that outcome y=1 from Bernoulli(θ) is evidentially equivalent to outcome x=1 from Bernoulli(cθ).  I suppose it would apply if one had a choice between flipping Coin A and Coin B, and all one knew about the biases of the two coins was that the bias of Coin B was some particular fraction that of Coin A.  The result purports to show that a flip of Coin A that lands heads tells you the same thing about the bias of Coin A as a flip of Coin B that lands heads.  Intuitively, this result strikes me as wrong.  Make the fraction very, very small: suppose we know that the bias of Coin B is one trillionth that of Coin A.  It seems that a head on a flip of Coin B would provide very strong evidence that the bias of Coin A is very close to one, while a head on Coin A would not be so telling.  If I’m interpreting correctly the claim that outcome y=1 Bernoulli(θ) is evidentially equivalent to outcome x=1 from Bernoulli(cθ), and my intuitions in this case are sound, then this example could provide an argument against (C).  Unfortunately, Evans et al. seem to have discussed this result only in an unpublished manuscript.

Evans et al. next move beyond the binary case to construct a more general “discrete embedding model.”  They start with an experiment with probability function f(x; θ).  They then “cross” f(x; θ) with Bernoullis to yield the following joint distribution:
y\x
1
2
1
f(1; θ)
f(2; θ)
0
g(1) - f(1; θ)
g(2) - f(2; θ)

relative to the normalizing constant G=Σg(x).  Again, x and y are mutually ancillary: x has unconditional probability distribution Pr(x=i)=g(i)/G, while y has unconditional distribution Bernoulli(1/g(i)).

Evans et al. also consider a continuous embedding model, but that need not concern us here; continuous models are an idealization, so issues that arise only for continuous models are not relevant for practice.

Evans et al. develop a modified version of the discrete embedding model to prove that (C) entails (L).  In my next post, I will discuss that model and the Evans et al. proof.

Thursday, May 19, 2011

Evans et al. Proof Part 1: Overview


Birnbaum proved in 1962 that (S) and (C) entail (L), and in 1964 that (M) and (C) entail (L), where (M) is strictly weaker than (S).  Evans et al. proved in 1986 that (C) alone entails (L).  In this post I’ll explain the Evans et al. proof in a rough, qualitative way.

The clearest formulation of (C) for present purposes is the one Birnbaum provides in his (1972):
                Conditionality (C): If h(x) is an ancillary statistic, then Ev(E,x)=Ev(Eh,x), where h=h(x).
(See the previous post for an explication of the term “ancillary statistic.”)

The Evans et al. proof starts with an arbitrary experimental outcome (E1,x*).  It then constructs a hypothetical outcome (B,h) that has the same likelihood function.  The next step is to construct a cross-embedded experiment that I will call CE(E1,B).  This cross-embedded experiment has two maximal ancillary statistics: the variable of which x* is an instance is ancillary with respect to the variable of which h is an instance, and vice versa.  Thus, it is possible to apply (C) to CE(E1,B) twice to establish that (CE(E1,B),(x*,h)) is evidentially equivalent to both (E1,x*) and (B,h), and thus that (E1,x*) and (B,h) are evidentially equivalent to one another.  The next step is to apply the same trick to an arbitrary experimental outcome (E2,y*) with the same likelihood function as  (E1,x*), using the same hypothetical (B,h).  From the fact that (E1,x*) and (E2,y*) are both evidentially equivalent to (B,h), it follows that they are evidentially equivalent to one another.  Because the only constraint placed on (E1,x*) and (E2,y*) in this construction is that they must have the same likelihood function, the likelihood principle follows immediately.

Allowing text boxes to represent experimental outcomes and lines to represent evidential equivalence (established by (C)), this proof can be represented by the following diagram:






A Clarification of (C)

Birnbaum proves in his (1962) that (S) and (C) jointly entail (L).  He claims that (C) entails (S), which would mean that (C) entails (L) by itself, but he does not prove it.  In his (1964), he clarifies that he can only prove that (C) entails (S) by using a principle (M) that is strictly weaker than (S) but does not follow from (C).  Thus, the strongest result he can claim is that (C) and (M) jointly entail (L).  In their (1986), however, Evans et al. prove that in fact (C) alone does entail (L).  In this post I begin the task of reconstructing the Evans et al. proof.  First I need to clarify a point of obscurity in Birnbaum’s 1962 formulation of (C).

In 1962, Birnbaum formulates (C) as follows:

The Principle of Conditionality (C): If an experiment E is (mathematically equivalent to) a mixture G of components {Eh}, with possible outcomes (Eh, xh), then Ev(E,(Eh, xh)) = Ev(Eh, xh).

The parenthetical phrase “mathematically equivalent to” here turns out to be essential.  (C) applies to any experiment that contains an ancillary statistic.  In a mixture experiment, the outcome of the random process that determines which component experiment to perform is an ancillary statistic.  However, non-mixture experiments can have ancillary statistics as well.  These experiments are “mathematically equivalent to” mixture experiments, but they do not involve an actual two-stage process that consists of first using a random process to choose a component experiment and then performing that component experiment.

In his (1972), Birnbaum formulates the notion of an ancillary statistic as follows:

h = h(x) is called an ancillary statistic if it admits the factored form f(x;θ) = g(h) f(x|h; θ) where g = g(h) = Prob (h(X)=h) is independent of θ.  

In other words, an ancillary statistic is a statistic independent of the parameters whose probability distribution that can be factored out of the likelihood function f(x;θ) to yield that conditional likelihood function f(x|h; θ).  A generic mixture experiment yields a simple example.  Suppose you flip a coin to decide whether to perform experiment E1 or E2, where information about the bias of that coin is not informative about the data-generating process in E1 or E2.  There is an unconditional likelihood function f(x;θ) for this mixture experiment as a whole.  However, if a frequentist knows which way the flip turns out and thus which of E1 or E2 is performed he or she will typically take that information into account and use the conditional likelihood function f(x|h; θ) that takes this information into account.  He or she will thus neglect the mixture structure of the experiment, acting as if it had been known all along that the experiment which is actually performed would be performed.

In his (1972), Birnbaum formulates (C) in terms of the notion of an ancillary statistic.  He first defines some notation:

(Eh,x) denotes a model of evidence determined by an outcome x of the experiment Eh: (Ω,Sh,fh)  where S={x: h(x)=h}.  E may be called a mixture experiment, with components Eh having respective probabilities g(h).

He then reformulates (C):

Conditionality (C): If h(x) is an ancillary statistic, then Ev(E,x)=Ev(Eh,x), where h=h(x).

This formulation is not different in substance from Birnbaum’s 1962 formulation; it is merely more explicit that being mathematically equivalent to a mixture experiment means having an ancillary statistic.

Tuesday, May 17, 2011

A Well-Motivated Frequentist Response to Birnbaum's Theorem

I’m giving a “Works in Progress” talk on Friday to explain my current position on frequentist responses to Birnbaum’s proof.  Here is my abstract:
Frequentists appear to be committed to the sufficiency principle (S) and the conditionality principle (C).  However, Birnbaum (1962) proved that (S) and (C) entail the likelihood principle (L), which frequentist methods violate.  To respond adequately to Birnbaum’s theorem, frequentists must place restrictions on (S) and/or (C) that block Birnbaum’s proof and argue that those restrictions are well motivated.  Restricting (C) alone will not suffice, because (S) by itself implies too much of the content of (L) for frequentists to accept it.  Specifically, frequentists need to restrict (S) so that it does not apply to mixture experiments some of whose components have respective outcomes with the same likelihood function.  Berger and Wolpert (1988, p. 46) claim that such a restriction would be artificial, but in fact it has a strong frequentist motivation: reduction to the minimal sufficient statistic in such an experiment throws away information about what sampling distribution is appropriate for frequentist inference. 
I’ll try to explain the basic argument here.  Start with the claim that (S) by itself implies too much of the content of (L) for frequentists to accept it.  Kalbfleisch makes this point in his (1975) as a criticism of Durbin’s proposal to restrict (C) rather than (S).  Consider two experiments E1 and E2.  E1 involves flipping a coin five times and reporting the number of heads.  E2 involves flipping a coin until it comes up heads and reporting the number of flips required.  Suppose that in both experiments the flips are i.i.d. Bernoulli.  Imagine an instance of E1 and E2 in which one gets one head in E1, and five flips in E2, so that both E1 and E2 consist of flipping a coin five times and getting heads once.  The likelihood principle says that these two outcomes have the same evidential meaning.

The sufficiency principle does not imply that those outcomes of E1 and E2 have the same evidential meaning.  However, it does say that they would have had the same evidential meaning if they had been two outcomes of one experiment rather than outcomes of two different experiments.  So consider a mixture experiment E* that involves first flipping a coin two decide whether to perform E1 or E2 and then performing the selected experiment.  (The coin used to decide which experiment to perform should be distinct from the coin used in E1 or E2 with independent bias.)  According to the sufficiency principle, the outcome that consists of performing experiment E1 and getting one head has the same evidential meaning as the outcome that consists of performing experiment E2 and getting five tosses when each is performed as part of the mixture experiment E*.  To get the result that they also have the same evidential meaning when performed outside of E*, one needs to appeal to something like (C).  (This is essentially how Birnbaum proves that (S) and (C) entail (L).)  However, to go so far but no farther seems rather unreasonable.  As Kalbfleisch puts it, “In order to reject (L) and accept (S) one must attach great importance to the possibility of choosing randomly between E1 and E2” (p. 252).  To avoid adopting this strange position, someone who rejects (L) should reject (S) as well.

The minimal restriction on (S) that blocks Birnbaum’s proof is to modify (S) so that it does not apply to mixture experiments some of whose components have respective outcomes with the same likelihood function.  Berger and Wolpert say that this restriction “seems artificial, there being no intuitive reason to restrict sufficiency to certain types of experiments” (1988, p. 46).  One can flesh out an argument along these lines as follows.  The following argument for (S) is compelling and completely general:

Conditional on the value of a sufficient statistic, which outcome occurs is independent of the parameters of the experimental model.   Independent variables do not contain information about one another.  Thus, conditional on a sufficient statistic, which outcome occurs does not contain any information about the parameters of the experimental model.  Therefore, the evidential meaning of an experimental outcome is the same as the evidential meaning of an outcome corresponding to the same value of the sufficient statistic. 

Because this argument makes no assumptions about whether the experiment in question is pure or mixed, it would be artificial to restrict (S) to non-mixture experiments.

The problem with this argument (from a frequentist perspective) is that it assumes that the experimental model appropriate for frequentist inference is fixed in advance, regardless of which outcome occurs.  But in a mixture experiment, (C) implies that which experimental model is appropriate for frequentist inference depends on which component experiment is performed.  When the components of the mixture experiment have respective outcomes with the same likelihood function, reduction to the minimal sufficient statistic throws away the information about which component experiment was performed.  Thus, applying (S) to a mixture experiment some of whose components have respective outcomes with the same likelihood function is inappropriate from a frequentist perspective.

I think this is quite a good frequentist response to Berger and Wolpert’s objection.  The challenge for a frequentist is to make the needed restriction on (S) precise in a defensible way.  Berger and Wolpert claim that the distinction between mixture and non-mixture experiments is difficult if not impossible to characterize clearly, suggesting that this challenge will not be easy to meet.  As long as the distinction appears to be real, however, a frequentist need not be bothered too much by difficulties in formulating it precisely.

I have argued that frequentists need to restrict (S).  Fortunately for the frequentist, the needed restriction is well-motivated from a frequentist perspective.  I should note that frequentists may also need to restrict (C).  They certainly do need to do so if Evans, Fraser, and Monette (1986) are correct in their claim that (C) alone implies (L).  I have just begun looking at their paper.  They point out that the fact that seemingly innocuous principles (S) and (C) imply the highly controversial principle (L) should be a clue that there is more to (S) and (C) than meets the eye.  They seem to think that Birnbaum’s way of characterizing experimental models is too simple and that with a more adequate approach (L) would no longer follow from appropriately modified versions of (S) and (C).  It looks like their paper will take some time to digest but will be well worth the effort.