Science and Statistics: Evidential Equivalence and Truth

Friday, July 6, 2012

Evidential Equivalence and Truth

In my previous post, I discussed a concern I have about the project I had been planning to pursue in my dissertation. Briefly, I was planning to defend the Likelihood Principle, which says that certain sets of experimental outcomes are evidentially equivalent. This Likelihood Principle seems interesting because frequentist methods, which are the statistical methods most widely used in science, do not respect evidential equivalence if the Likelihood Principle is true. Here's the problem: it's not obvious to me that we ought to insist on using statistical methods that respect evidential equivalence.

One might claim that respecting evidential equivalence is an end in itself. All I can say in response tp this claim is that it is not an end I personally care about in itself. I care about epistemic goals that involve some sort of correspondence between a proposition and the world, such as truth, truthlikeness, approximate truth, and empirical adequacy. I care about respecting evidential equivalence and other epistemic virtues like coherence that concern the internal ordering of a set of propositions only insofar as they help in achieving those goals.

To my mind, then, we ought to insist on using statistical methods that respect evidential equivalence only to the extent that doing so would help us achieve what we might loosely call correspondence goals.

Would insisting on using statistical methods that respect evidential equivalence help us achieve correspondence goals? There doesn't seem to be any way to answer this question across the board as far as I can see. Frequentist methods (which fail to respect evidential equivalence) have probability one of achieving a certain level of performance with respect to correspondence goals in the indefinite long run when their assumptions are satisfied, while Bayesian methods (which respect evidential equivalence) provide no such guarantees but rather may perform better or worse than frequentist methods in a given respect on a given problem depending on the priors used and the actual state of affairs. This comparison suggests that we ought to use Bayesian methods when we are reasonably confident that our priors are good, and use frequentist methods otherwise (or something like that). That conclusion is not very exciting, and it doesn't give us any interesting story to tell about how respecting evidential equivalence helps us get to the truth.

17 comments:

JasonJuly 8, 2012 at 6:44 PM
"Frequentist methods (which fail to respect evidential equivalence) have probability one of achieving a certain level of performance with respect to correspondence goals in the indefinite long run when their assumptions are satisfied ... This comparison suggests that we ought to use Bayesian methods when we are reasonably confident that our priors are good, and use frequentist methods otherwise (or something like that)."

Sure it's true that Frequentist methods have probability 1 of achieving something or other, depending on exactly how their assumptions are specified, but I think that's actually not true on most ways of setting up Frequentist procedures and, on the readings on which it is true, it's really not saying much. For example, it's not true (contra Neyman) that a Frequentist statistician will get 95% (or whatever) of their results right over their lifetime unless one of their assumptions is that they have no "prior" (I hate that word) information.

So I guess I'd put a lot of emphasis on the "(or something like that)" and not take much notice of the formal properties of Frequentist methods as an argument for using them when Bayesian methods fail.

Personally I don't know WHAT we should do when Bayesian methods fail, but that's OK. Epistemic humility FTW.

"That conclusion is not very exciting, and it doesn't give us any interesting story to tell about how respecting evidential equivalence helps us get to the truth."

I'm not sure exactly what you want here. When the Bayesian framework applies, it's guaranteed to get you the most rational answer, right? Unlike some other LP enthusiasts, I agree with your implication that the Bayesian framework doesn't always apply. But it's important not to be overambitious in a PhD thesis! I'd strongly advise against trying to simultaneously write a good PhD on the LP *AND* solve the very general problem of what to do when it's not applicable. I'm not saying you should support the LP if you don't believe in it (we'll talk about that another time); I'm saying that you should just say whatever you think about the LP, get your PhD, and then solve all the problems of induction later!

Jason
ReplyDelete
Replies
Greg GandenbergerJuly 9, 2012 at 11:53 AM
"Sure it's true that Frequentist methods have probability 1 of achieving something or other, depending on exactly how their assumptions are specified, but I think that's actually not true on most ways of setting up Frequentist procedures and, on the readings on which it is true, it's really not saying much."

I would like to know what you mean by "that's actually not true on most ways of setting up Frequentist procedures," but I don’t, so I'll focus on your point that when it is true that frequentist methods achieve something or other with respect to performance with probability one, it's really not saying much. This is a very good and important point that I intended to mention but didn’t.

At one point I had written something like the following: It’s not the case that frequentist methods are “safe” in the sense that they are bound to do pretty well on average, while Bayesian methods are “risky.” There are situations in which frequentist methods perform very, very badly. For instance, in typical cases the supremum of the probability of getting a wrong answer from a frequentist null hypothesis significance test is 100(1-α)%--far worse than flipping a coin. (Of course, one could point out that well-designed tests only perform badly in this sense for deviations from the null hypothesis that are small enough not to matter much. My point is just that the comparison between the performance characteristics of frequentist methods and those of Bayesian methods is not as simple as “frequentist methods are guaranteed to perform pretty well, while Bayesian methods might perform better or worse depending on how good the priors are.”)

"For example, it's not true (contra Neyman) that a Frequentist statistician will get 95% (or whatever) of their results right over their lifetime unless one of their assumptions is that they have no "prior" (I hate that word) information."

What did you have in mind when you wrote “unless one of their assumptions is that they have no ’prior’...information?"

Certainly it's not the case that Frequentists who always do level α tests are expected to get 100(1-α)% of their results right over their lifetimes. If they study small effects, then their expected success rate might be very low indeed. It is the case that they are expected to reject true nulls only 100α% of the time and accept nulls only 100β% of the time when the alternative hypothesis used to calculate β is true.

I agree with you that that's not saying much, especially when one considers that null hypotheses are perhaps never literally true, many studies are underpowered, many (most?) scientists violate predesignation and other rules that are important on a behavioristic view of frequentist methods, and the publication process functions to a large extent as a filter for statistical significance.

It is at least saying something that can agree on. But maybe it’s saying so little that we shouldn’t really care about it even if we all agree on it.

I'm no great fan of frequentist methods. The point I intend to be making is simply that it's not clear to me that respecting evidential equivalence is related in any straightforward way to performance, and performance is what I care about. Thus, it’s not clear to me that the fact that frequentist methods violate the Likelihood Principle is a good objection to those methods.

“When the Bayesian framework applies, it's guaranteed to get you the most rational answer, right?”

I have yet to be convinced that Bayesian updating is uniquely rational in any sense. I’m with De Finetti here--I don’t see why it would be irrational to “repent” of one’s previous degrees of belief and start fresh with new ones.

In addition, it’s not clear to me why we should be rational unless being rational will help us get closer to the truth (or approximate truth, etc.)
ReplyDelete
Replies
Greg GandenbergerJuly 9, 2012 at 11:53 AM
“But it's important not to be overambitious in a PhD thesis! I'd strongly advise against trying to simultaneously write a good PhD on the LP *AND* solve the very general problem of what to do when it's not applicable.I'm not saying you should support the LP if you don't believe in it (we'll talk about that another time); I'm saying that you should just say whatever you think about the LP, get your PhD, and then solve all the problems of induction later!”

Point taken. Right now, though, I’m trying to work through a question that a dissertation on the LP should surely answer once it is seen as a live question: is the LP a good argument against frequentist methods?
ReplyDelete
Replies
JasonJuly 9, 2012 at 8:56 PM
"is the LP a good argument against frequentist methods?"

I guess it seems clear to me, through examples like Cox's example and a whole bunch of other examples from Berger and Wolpert, that evidential considerations are plenty to support the LP, at least in my formulation, which as you know assumes that we have to use a roughly standard statistical framework (hypothesis space and sample space) and also makes a bunch of other assumptions. You may simply not agree with me about the force of the examples and the associated arguments. Or you may think, as I do, that there may be even better methods which we haven't discovered yet, which go outside the framework I set up for the LP. I'd be very happy with the latter of those two alternatives! I think it's useful to keep them separate. I don't claim to have given any completely general argument for evidentialism; but I do think I've shown (and not for the first time!) that it's better than any of the existing alternatives.

More about this in my other reply which blogspot rejected for being too long!
ReplyDelete
Replies
Greg GandenbergerJuly 10, 2012 at 9:53 AM
Jason sent me additional comments by email that were too long to post here. I'm going to break them up and post them here with my replies.

Jason:
'I would like to know what you mean by "that's actually not true on most ways of setting up Frequentist procedures," but I don’t'

Sorry for not being clear.

I said this "depending on exactly how their assumptions are specified". I often see the assumptions under-specified, so that the Frequentist claim (often, not always) is that e.g. 95% of a statistician's confidence intervals will contain the true value, in the long run. This is clearly not true, for many reasons including the fact that it omits "prior" knowledge that they may have.

Me:
Sorry, I'm still not following. A 100(1-α)% confidence interval for θ is an interval constructed by a method that has probability 100(1-α)% of containing the true value of θ when its assumptions are satisfied. So by the strong law of large numbers, 100(1-α)% of the 100(1-α)% confidence intervals a statistician constructs will contain the true value of the relevant parameter in the long run, assuming that the assumptions of the methods used to construct those confidence intervals I've satisfied. Is there something wrong with what I've just said, or are you making a different point?

One should not claim that the probability is 100(1-α)% that a particular 100(1-α)% confidence interval contains the true value of the parameter in question, among other reasons because one might have information about the parameter not taken into account in constructing the confidence interval. But that doesn't affect the claim that 100(1-α)% of the 100(1-α)% confidence intervals a statistician constructs will contain the true value of the relevant parameter in the long run as far as I can see. Of course, "in the long run" is doing a lot of work here.
ReplyDelete
Replies
Greg GandenbergerJuly 10, 2012 at 10:05 AM
Jason:
'It is the case that they are expected to reject true nulls only 100α% of the time and accept nulls only 100β% of the time when the alternative hypothesis used to calculate β is true.'

Yes, we agree, that's true, if you just look at null hypothesis testing with a point null and a point alternative. As you go on to say, there are many reasons why that's often a bad idea, including all the reasons you give and also Lindley's paradox. And if you look at confidence intervals instead then you get the situation I discuss above, where I can't see any useful claim the Frequentist can make in the presence of "prior" information.

Me:
I agree that null hypothesis testing with a point null and a point alternative is usually a bad idea for many reasons.

I need to think more about Lindley's paradox. Could a frequentist claim that it begs the question by presupposing an evidentialist perspective?

I'm not sure what you have in mind about confidence intervals. Maybe your point is that the fact that 100(1-α)% of the 100(1-α)% confidence intervals a statistician constructs will contain the true value of the relevant parameter in the long run is not a useful statement when we are evaluating some particular confidence interval(s) that statistician has constructed so far, and is clearly misleading when Bayesian methods apply.

This is a very difficult point that's absolutely vital to the point of view I'm tempted to take according to which we ought to think about the epistemology of science in terms of the reliability of the methods used rather than in terms of an evidential relation between data hypotheses, and background knowledge with no reference to methods. I won't try to address it here, but I will discuss it in the future.
ReplyDelete
Replies
Greg GandenbergerJuly 10, 2012 at 10:18 AM
Jason:
'[That null hypothesis significance tests are expected to reject true nulls only 100α% of the time and accept nulls only 100β% of the time when the alternative hypothesis used to calculate β is true] is at least saying something that can agree on. But maybe it’s saying so little that we shouldn’t really care about it even if we all agree on it.'

That's exactly my view too, in most cases. I think there are at least SOME cases in which we should care about Frequntist analyses, but they're not the typical ones. Hacking gives some examples in his 1975 (?) book, and I suspect there are many more cases. As far as I can see so far, these are cases in which Frequentist analyses and Bayesian analyses more or less agree anyway. Unfortunately, the maths gets too complicated for me to be able to nicely delineate exactly which cases these are. One day maybe.

Me:
I'll need to think more about these points to have anything interesting to say about them.
ReplyDelete
Replies
Greg GandenbergerJuly 10, 2012 at 12:01 PM
Jason:
'I'm no great fan of frequentist methods. The point I intend to be making is simply that it's not clear to me that respecting evidential equivalence is related in any straightforward way to performance, and performance is what I care about.'

Bayesians and Frequentists define 'performance' differently. That's meant to be the main message of my manuscript, and I do say this in the prologue, although I probably don't make it clear enough. It's fine with me if you want to reject both definitions. If you can give a plausible alternative definition, that will be great (really, not sarcastic here! - you'd be a great person to do this). But in the meantime, I think Bayesian or evidential definitions of performance are much more important than Frequentist definitions. There is no neutral definition that I know of. Is there?

Me:
Do you have in mind the distinction between Frequentism and factualism? Speaking loosely, Frequentists choose among possible methods by considering the performance of those methods when certain hypotheses are true, averaging over possible data; whereas factualists hold the data fixed and try to come up with a method to compare hypotheses.

This distinction by itself doesn't seem to give us two different definitions of performance. Some notion of performance needs to be fed into the frequentist account before it can be used. The factualist account doesn't refer directly to performance at all.

The most straightforward notion of performance is expected gain/loss as measured by a loss function appropriate to the problem at hand. The problem with this notion is that it requires a probability distribution over the hypothesis space, which frequentists don’t allow themselves.

Frequentists use rather complicated alternative notions of performance that involve some admittedly ad hoc elements. Cases in which a uniformly most powerful test is available are the simplest. In those cases, the notion of performance frequentists use reduces to the notion of rejecting the null hypothesis with as high probability as possible when it is false for a given upper bound on the probability of failing to reject it when it is true.

Once we fix a factualist method we can ask about its performance according to various notions of performance. All of the factualist methods of which I am aware are either likelihoodist or Bayesian. Likelihoodist “methods” are just proposed conceptual analyses of notions like “data D favors hypothesis H_1 over hypothesis H_2 to degree X” for which I have no use. There doesn’t seem to be any sensible way to measure the performance of such methods. If they are correct, then they output tautologies. If not, then they output contradictions. I’m inclined to say that there is no fact of the matter about whether they are correct or not. In any case, I don’t see what they’re good for even if they are correct.

One could try to reinterpret likelihoodist methods, e.g. by using the principle that one ought to believe the claim that is best supported by one’s data. That’s just maximum likelihood estimation, which sometimes yields sensible results and sometimes yields crazy results.

A Bayesian procedure just is the procedure that has the lowest expected loss as measured by a specified loss function on a given prior probability distribution. I agree that Bayesian procedures are the way to go when an unproblematic prior probability distribution is available. But an unproblematic prior probability distribution usually is not available in science.
ReplyDelete
Replies
Greg GandenbergerJuly 10, 2012 at 12:09 PM
Jason:
'I have yet to be convinced that Bayesian updating is uniquely rational in any sense.'

I think it's rational in its own terms. Sounds like we don't quite agree about that, but it doesn't matter because we DO agree that there may be even better forms of rationality. I don't know what they are, though.

Me:
What do you mean by "rational in its own terms?"

I regard Bayesian updating as a method to be evaluated according to its usefulness in helping us find the truth. The usual arguments for it proceed from axioms about rationality or evidential meaning that are admittedly highly plausible intuitively but have nothing directly to do with finding the truth.
ReplyDelete
Replies
Greg GandenbergerJuly 10, 2012 at 2:31 PM
Jason:
'In addition, it’s not clear to me why we should be rational unless being rational will help us get closer to the truth (or approximate truth, etc.)'

Sorry, but I don't think the 'etc.' is good enough! The point of statistics is to operationalise this sort of idea, and it's only once you've picked an operationalisation that you're being really clear about what your standard of rationality is. I agree that the best standard may not be Bayesianism (because I think it may not be any of the existing ones). I think that if you cash out fully what you're trying to get at by talking about truth, you'll have a theory of statistics. Go for it! It will be interesting to see whether it's Bayesian or not. It may not be. But I don't think you'll end up with a non-statistical standard with which to compare statistical theories. If it's non-statistical then it's too vague. Or so I predict. :-)

Me:
I think this is a great comment. I'm not sure right now how to respond.

I can elaborate a bit on what I was trying to say. All of the epistemic virtues I care about for their own sakes are "external" virtues that involve the achievement of some kind of "correctness" relation between a belief state and the world. Truth is the strongest such relation, involving exact matching between the belief state and the world. (I don't want to get into deep questions about the nature of truth here, which I don't think would add anything to this discussion.) I also care about approximate truth, truthlikeness, empirical adequacy, approximate empirical adequacy, and so on. I'm not inclined to care about "internal" epistemic virtues that only concern the relationships among one's beliefs either at a time or over time, such as logical consistency; synchronic and diachronic probabilisitic coherence; and respecting evidential equivalence, except insofar as these virtues aid in the pursuit of an epistemic virtue involving a correctness relation. Many philosophers are very concerned with internal epistemic virtues, so perhaps I'm either idiosyncratic in this respect or there are strong connections between internal epistemic virtues and external epistemic virtues of which I am not aware.

Obviously this is all rather hand-wavey and unsatisfactory as serious philosophy. That's ok, I think---it's just a blog!
ReplyDelete
Replies

Add comment

Science and Statistics

Friday, July 6, 2012

Evidential Equivalence and Truth

17 comments:

Labels

Blog Archive