Friday, July 6, 2012

Evidential Equivalence and Truth


In my previous post, I discussed a concern I have about the project I had been planning to pursue in my dissertation.  Briefly, I was planning to defend the Likelihood Principle, which says that certain sets of experimental outcomes are evidentially equivalent.  This Likelihood Principle seems interesting because frequentist methods, which are the statistical methods most widely used in science, do not respect evidential equivalence if the Likelihood Principle is true.  Here's the problem: it's not obvious to me that we ought to insist on using statistical methods that respect evidential equivalence.


One might claim that respecting evidential equivalence is an end in itself.  All I can say in response tp this claim is that it is not an end I personally care about in itself.  I care about epistemic goals that involve some sort of correspondence between a proposition and the world, such as truth, truthlikeness, approximate truth, and empirical adequacy.  I care about respecting evidential equivalence and other epistemic virtues like coherence that concern the internal ordering of a set of propositions only insofar as they help in achieving those goals.


To my mind, then, we ought to insist on using statistical methods that respect evidential equivalence only to the extent that doing so would help us achieve what we might loosely call correspondence goals.


Would insisting on using statistical methods that respect evidential equivalence help us achieve correspondence goals?  There doesn't seem to be any way to answer this question across the board as far as I can see.  Frequentist methods (which fail to respect evidential equivalence) have probability one of achieving a certain level of performance with respect to correspondence goals in the indefinite long run when their assumptions are satisfied, while Bayesian methods (which respect evidential equivalence) provide no such guarantees but rather may perform better or worse than frequentist methods in a given respect on a given problem depending on the priors used and the actual state of affairs.  This comparison suggests that we ought to use Bayesian methods when we are reasonably confident that our priors are good, and use frequentist methods otherwise (or something like that).  That conclusion is not very exciting, and it doesn't give us any interesting story to tell about how respecting evidential equivalence helps us get to the truth.

17 comments:

  1. "Frequentist methods (which fail to respect evidential equivalence) have probability one of achieving a certain level of performance with respect to correspondence goals in the indefinite long run when their assumptions are satisfied ... This comparison suggests that we ought to use Bayesian methods when we are reasonably confident that our priors are good, and use frequentist methods otherwise (or something like that)."

    Sure it's true that Frequentist methods have probability 1 of achieving something or other, depending on exactly how their assumptions are specified, but I think that's actually not true on most ways of setting up Frequentist procedures and, on the readings on which it is true, it's really not saying much. For example, it's not true (contra Neyman) that a Frequentist statistician will get 95% (or whatever) of their results right over their lifetime unless one of their assumptions is that they have no "prior" (I hate that word) information.

    So I guess I'd put a lot of emphasis on the "(or something like that)" and not take much notice of the formal properties of Frequentist methods as an argument for using them when Bayesian methods fail.

    Personally I don't know WHAT we should do when Bayesian methods fail, but that's OK. Epistemic humility FTW.


    "That conclusion is not very exciting, and it doesn't give us any interesting story to tell about how respecting evidential equivalence helps us get to the truth."

    I'm not sure exactly what you want here. When the Bayesian framework applies, it's guaranteed to get you the most rational answer, right? Unlike some other LP enthusiasts, I agree with your implication that the Bayesian framework doesn't always apply. But it's important not to be overambitious in a PhD thesis! I'd strongly advise against trying to simultaneously write a good PhD on the LP *AND* solve the very general problem of what to do when it's not applicable. I'm not saying you should support the LP if you don't believe in it (we'll talk about that another time); I'm saying that you should just say whatever you think about the LP, get your PhD, and then solve all the problems of induction later!

    Jason

    ReplyDelete
  2. "Sure it's true that Frequentist methods have probability 1 of achieving something or other, depending on exactly how their assumptions are specified, but I think that's actually not true on most ways of setting up Frequentist procedures and, on the readings on which it is true, it's really not saying much."

    I would like to know what you mean by "that's actually not true on most ways of setting up Frequentist procedures," but I don’t, so I'll focus on your point that when it is true that frequentist methods achieve something or other with respect to performance with probability one, it's really not saying much. This is a very good and important point that I intended to mention but didn’t.

    At one point I had written something like the following: It’s not the case that frequentist methods are “safe” in the sense that they are bound to do pretty well on average, while Bayesian methods are “risky.” There are situations in which frequentist methods perform very, very badly. For instance, in typical cases the supremum of the probability of getting a wrong answer from a frequentist null hypothesis significance test is 100(1-α)%--far worse than flipping a coin. (Of course, one could point out that well-designed tests only perform badly in this sense for deviations from the null hypothesis that are small enough not to matter much. My point is just that the comparison between the performance characteristics of frequentist methods and those of Bayesian methods is not as simple as “frequentist methods are guaranteed to perform pretty well, while Bayesian methods might perform better or worse depending on how good the priors are.”)

    "For example, it's not true (contra Neyman) that a Frequentist statistician will get 95% (or whatever) of their results right over their lifetime unless one of their assumptions is that they have no "prior" (I hate that word) information."

    What did you have in mind when you wrote “unless one of their assumptions is that they have no ’prior’...information?"

    Certainly it's not the case that Frequentists who always do level α tests are expected to get 100(1-α)% of their results right over their lifetimes. If they study small effects, then their expected success rate might be very low indeed. It is the case that they are expected to reject true nulls only 100α% of the time and accept nulls only 100β% of the time when the alternative hypothesis used to calculate β is true.

    I agree with you that that's not saying much, especially when one considers that null hypotheses are perhaps never literally true, many studies are underpowered, many (most?) scientists violate predesignation and other rules that are important on a behavioristic view of frequentist methods, and the publication process functions to a large extent as a filter for statistical significance.

    It is at least saying something that can agree on. But maybe it’s saying so little that we shouldn’t really care about it even if we all agree on it.

    I'm no great fan of frequentist methods. The point I intend to be making is simply that it's not clear to me that respecting evidential equivalence is related in any straightforward way to performance, and performance is what I care about. Thus, it’s not clear to me that the fact that frequentist methods violate the Likelihood Principle is a good objection to those methods.

    “When the Bayesian framework applies, it's guaranteed to get you the most rational answer, right?”

    I have yet to be convinced that Bayesian updating is uniquely rational in any sense. I’m with De Finetti here--I don’t see why it would be irrational to “repent” of one’s previous degrees of belief and start fresh with new ones.

    In addition, it’s not clear to me why we should be rational unless being rational will help us get closer to the truth (or approximate truth, etc.)

    ReplyDelete
  3. “But it's important not to be overambitious in a PhD thesis! I'd strongly advise against trying to simultaneously write a good PhD on the LP *AND* solve the very general problem of what to do when it's not applicable.I'm not saying you should support the LP if you don't believe in it (we'll talk about that another time); I'm saying that you should just say whatever you think about the LP, get your PhD, and then solve all the problems of induction later!”

    Point taken. Right now, though, I’m trying to work through a question that a dissertation on the LP should surely answer once it is seen as a live question: is the LP a good argument against frequentist methods?

    ReplyDelete
  4. "is the LP a good argument against frequentist methods?"

    I guess it seems clear to me, through examples like Cox's example and a whole bunch of other examples from Berger and Wolpert, that evidential considerations are plenty to support the LP, at least in my formulation, which as you know assumes that we have to use a roughly standard statistical framework (hypothesis space and sample space) and also makes a bunch of other assumptions. You may simply not agree with me about the force of the examples and the associated arguments. Or you may think, as I do, that there may be even better methods which we haven't discovered yet, which go outside the framework I set up for the LP. I'd be very happy with the latter of those two alternatives! I think it's useful to keep them separate. I don't claim to have given any completely general argument for evidentialism; but I do think I've shown (and not for the first time!) that it's better than any of the existing alternatives.

    More about this in my other reply which blogspot rejected for being too long!

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. I also find examples like Cox's and Berger and Wolpert's persuasive. But I'm trying to resist the urge to settle for persuasive and instead to focus on the question of how following the Likelihood Principle would help us get correct answers to our scientific questions. I don't see how it would do so except in cases in which Bayesian optimality results really mean something, i.e. when priors are well-grounded in some sense as opposed to merely reflecting someone's fair betting odds.

      Frequentist methods at least can be proven to have certain truth-related virtues without imposing a prior probability distribution where none is given. Now, those virtues are rather weak. Are they weak enough that we shouldn't care about them even when our only alternative is to impose a prior probability distribution where none is given? I'm not sure about that.

      Consider, for instance, the search for the Higgs boson. That kind of case seems to me the best candidate for the use of frequentist methods. It doesn't seem to me at all reasonable to use subjective priors to settle questions in fundamental physics, even with a robustness analysis. See Larry Wasserman's interesting post on this topic (http://normaldeviate.wordpress.com/2012/07/11/the-higgs-boson-and-the-p-value-police/).

      Delete
  5. Jason sent me additional comments by email that were too long to post here. I'm going to break them up and post them here with my replies.


    Jason:
    'I would like to know what you mean by "that's actually not true on most ways of setting up Frequentist procedures," but I don’t'

    Sorry for not being clear.

    I said this "depending on exactly how their assumptions are specified". I often see the assumptions under-specified, so that the Frequentist claim (often, not always) is that e.g. 95% of a statistician's confidence intervals will contain the true value, in the long run. This is clearly not true, for many reasons including the fact that it omits "prior" knowledge that they may have.

    Me:
    Sorry, I'm still not following. A 100(1-α)% confidence interval for θ is an interval constructed by a method that has probability 100(1-α)% of containing the true value of θ when its assumptions are satisfied. So by the strong law of large numbers, 100(1-α)% of the 100(1-α)% confidence intervals a statistician constructs will contain the true value of the relevant parameter in the long run, assuming that the assumptions of the methods used to construct those confidence intervals I've satisfied. Is there something wrong with what I've just said, or are you making a different point?

    One should not claim that the probability is 100(1-α)% that a particular 100(1-α)% confidence interval contains the true value of the parameter in question, among other reasons because one might have information about the parameter not taken into account in constructing the confidence interval. But that doesn't affect the claim that 100(1-α)% of the 100(1-α)% confidence intervals a statistician constructs will contain the true value of the relevant parameter in the long run as far as I can see. Of course, "in the long run" is doing a lot of work here.

    ReplyDelete
  6. Jason:
    'It is the case that they are expected to reject true nulls only 100α% of the time and accept nulls only 100β% of the time when the alternative hypothesis used to calculate β is true.'

    Yes, we agree, that's true, if you just look at null hypothesis testing with a point null and a point alternative. As you go on to say, there are many reasons why that's often a bad idea, including all the reasons you give and also Lindley's paradox. And if you look at confidence intervals instead then you get the situation I discuss above, where I can't see any useful claim the Frequentist can make in the presence of "prior" information.

    Me:
    I agree that null hypothesis testing with a point null and a point alternative is usually a bad idea for many reasons.

    I need to think more about Lindley's paradox. Could a frequentist claim that it begs the question by presupposing an evidentialist perspective?

    I'm not sure what you have in mind about confidence intervals. Maybe your point is that the fact that 100(1-α)% of the 100(1-α)% confidence intervals a statistician constructs will contain the true value of the relevant parameter in the long run is not a useful statement when we are evaluating some particular confidence interval(s) that statistician has constructed so far, and is clearly misleading when Bayesian methods apply.

    This is a very difficult point that's absolutely vital to the point of view I'm tempted to take according to which we ought to think about the epistemology of science in terms of the reliability of the methods used rather than in terms of an evidential relation between data hypotheses, and background knowledge with no reference to methods. I won't try to address it here, but I will discuss it in the future.

    ReplyDelete
    Replies
    1. I should say that one of the main reasons that null hypothesis testing with a point null and a point alternative is usually a bad idea is that it is a response to a question to which we usually already know the answer, namely "Is the point null hypothesis literally true?" Almost always, we can be sure in advance that the answer is no. (Many would say "always" instead of "almost always." But what about an ESP study? Or a test for stellar parallax at a time when Tycho Brahe's astronomical theory was still in play?)

      Delete
    2. Follow up from Jason which wouldn't post correctly:

      > I need to think more about Lindley's paradox. Could a frequentist claim that it begs the question by presupposing an evidentialist perspective?

      If you're NOT worried about Lindley's paradox, think about real examples, e.g. very large drug trials using point null hypotheses. If they're large enough (which these days they tend not to be, because of stopping rules, but there are exceptions) they're always going to come out positive. You don't really see any important questions being begged there, do you?


      > A 100(1-α)% confidence interval for θ is an interval constructed by a method that has probability 100(1-α)% of containing the true value of θ when its assumptions are satisfied. So by the strong law of large numbers, 100(1-α)% of the 100(1-α)% confidence intervals a statistician constructs will contain the true value of the relevant parameter in the long run, assuming that the assumptions of the methods used to construct those confidence intervals I've satisfied.

      When you talk about "assuming that the assumptions of the methods used to construct those confidence intervals I've satisfied", are those assumptions meant to include the idea of having no "prior" information, i.e. information not captured by the mathematical model? If yes, we're not arguing (but then they're not often very useful assumptions, are they?). If no, great! You can generate a lot of confidence intervals using any model you like, but with me having some information not captured in your model, we can place bets which would be fair if 95% of your intervals contained the true value of a parameter, and I can win a lot of money!

      Again, think about real cases, e.g. drug trials analysed using confidence intervals which ignore what we actually know about the probable effectiveness of the drugs (as they always are, apart from sample size calculations). It may be sensible to use such models, but they're certainly not going to generate confidence intervals which contain the true value 95% of the time.

      Delete
    3. My replies:

      > If you're NOT worried about Lindley's paradox, think about real examples, e.g. very large drug trials using point null hypotheses. If they're large enough (which these days they tend not to be, because of stopping rules, but there are exceptions) they're always going to come out positive. You don't really see any important questions being begged there, do you?

      I'm not sure I understand what you're getting at here. Is it that a large enough trial will generate a statistically significant result even if the true effect size is tiny? That's certainly true, and is a good reason to emphasize estimates of effect sizes rather than outcomes of significance tests.

      >When you talk about "assuming that the assumptions of the methods used to construct those confidence intervals I've satisfied", are those assumptions meant to include the idea of having no "prior" information, i.e. information not captured by the mathematical model? If yes, we're not arguing (but then they're not often very useful assumptions, are they?). If no, great! You can generate a lot of confidence intervals using any model you like, but with me having some information not captured in your model, we can place bets which would be fair if 95% of your intervals contained the true value of a parameter, and I can win a lot of money!

      >Again, think about real cases, e.g. drug trials analysed using confidence intervals which ignore what we actually know about the probable effectiveness of the drugs (as they always are, apart from sample size calculations). It may be sensible to use such models, but they're certainly not going to generate confidence intervals which contain the true value 95% of the time.

      I don't mean to argue that we should ignore genuine prior information! My present concern is simply whether the fact that frequentist methods violate the Likelihood Principle is a good argument against those methods. It doesn't seem relevant to Neyman-style views according to which we should think of frequentist procedures as decision rules justified by their long-run operating characteristics. I'm tempted to think that those views aren't crazy despite being rather unpopular. That being said, I absolutely would use a Bayesian method if the priors were more or less given.

      I don't see (maybe out of ignorance) why models that omit prior information wouldn't generate confidence intervals that contain the true value of the mean 95% of the time. I certainly wouldn't want to consider a bet on a particular confidence interval as if I had a personal probability of 95% that that confidence interval contains the true mean, but that's just because of the familiar point that Pr(A(X)<μ<B(X)|μ) is not the same as Pr(A(x)<μ<B(x)|μ) where x is a particular value of the random variable X.

      Delete
    4. Another reply from Jason:

      > what about an ESP study? Or a test for stellar parallax at a time when Tycho Brahe's astronomical theory was still in play?

      In those cases, if we have to use a point null hypothesis (and I'm ambivalent about whether we should), we should compare it to a Bayesian model with a lump of prior on the null. And then we get some results which are rather shockingly bad for the Frequentists ... admittedly from an evidential point of view, so this is off the topic you've been pursuing, but still the results are fascinating in their own right. See the attached if you don't already know it. [Refers to Sellke, Bayarri, and Berger 2001] (http://www.stat.duke.edu/courses/Spring10/sta122/Labs/Lab6.pdf)

      My reply:
      I'll need to re-read this paper and get back to you. Thanks!

      Delete
  7. Jason:
    '[That null hypothesis significance tests are expected to reject true nulls only 100α% of the time and accept nulls only 100β% of the time when the alternative hypothesis used to calculate β is true] is at least saying something that can agree on. But maybe it’s saying so little that we shouldn’t really care about it even if we all agree on it.'

    That's exactly my view too, in most cases. I think there are at least SOME cases in which we should care about Frequntist analyses, but they're not the typical ones. Hacking gives some examples in his 1975 (?) book, and I suspect there are many more cases. As far as I can see so far, these are cases in which Frequentist analyses and Bayesian analyses more or less agree anyway. Unfortunately, the maths gets too complicated for me to be able to nicely delineate exactly which cases these are. One day maybe.

    Me:
    I'll need to think more about these points to have anything interesting to say about them.

    ReplyDelete
  8. Jason:
    'I'm no great fan of frequentist methods. The point I intend to be making is simply that it's not clear to me that respecting evidential equivalence is related in any straightforward way to performance, and performance is what I care about.'

    Bayesians and Frequentists define 'performance' differently. That's meant to be the main message of my manuscript, and I do say this in the prologue, although I probably don't make it clear enough. It's fine with me if you want to reject both definitions. If you can give a plausible alternative definition, that will be great (really, not sarcastic here! - you'd be a great person to do this). But in the meantime, I think Bayesian or evidential definitions of performance are much more important than Frequentist definitions. There is no neutral definition that I know of. Is there?

    Me:
    Do you have in mind the distinction between Frequentism and factualism? Speaking loosely, Frequentists choose among possible methods by considering the performance of those methods when certain hypotheses are true, averaging over possible data; whereas factualists hold the data fixed and try to come up with a method to compare hypotheses.

    This distinction by itself doesn't seem to give us two different definitions of performance. Some notion of performance needs to be fed into the frequentist account before it can be used. The factualist account doesn't refer directly to performance at all.

    The most straightforward notion of performance is expected gain/loss as measured by a loss function appropriate to the problem at hand. The problem with this notion is that it requires a probability distribution over the hypothesis space, which frequentists don’t allow themselves.

    Frequentists use rather complicated alternative notions of performance that involve some admittedly ad hoc elements. Cases in which a uniformly most powerful test is available are the simplest. In those cases, the notion of performance frequentists use reduces to the notion of rejecting the null hypothesis with as high probability as possible when it is false for a given upper bound on the probability of failing to reject it when it is true.

    Once we fix a factualist method we can ask about its performance according to various notions of performance. All of the factualist methods of which I am aware are either likelihoodist or Bayesian. Likelihoodist “methods” are just proposed conceptual analyses of notions like “data D favors hypothesis H_1 over hypothesis H_2 to degree X” for which I have no use. There doesn’t seem to be any sensible way to measure the performance of such methods. If they are correct, then they output tautologies. If not, then they output contradictions. I’m inclined to say that there is no fact of the matter about whether they are correct or not. In any case, I don’t see what they’re good for even if they are correct.

    One could try to reinterpret likelihoodist methods, e.g. by using the principle that one ought to believe the claim that is best supported by one’s data. That’s just maximum likelihood estimation, which sometimes yields sensible results and sometimes yields crazy results.

    A Bayesian procedure just is the procedure that has the lowest expected loss as measured by a specified loss function on a given prior probability distribution. I agree that Bayesian procedures are the way to go when an unproblematic prior probability distribution is available. But an unproblematic prior probability distribution usually is not available in science.

    ReplyDelete
    Replies
    1. Continued from the previous comment:

      The big question is what to do when an unproblematic prior probability distribution is not available. The smaller question I am currently considering is whether arguments for the Likelihood Principle are good arguments against the use of frequentist methods in such cases. I don’t see how they could be because they beg the question when formulated in terms of what methods one ought to use and are too weak when formulated in terms of evidential equivalence.

      I’m afraid I may have gotten away from the point you wanted to make, which I don’t quite understand. It seems to me that most of the parties to this debate agree that expected loss is typically the most useful notion of performance when it applies. According to extreme Bayesians, it always applies. According to frequentists, it applies only rarely in science. When it doesn’t apply, frequentists have to use other notions, which admittedly end up being somewhat ad hoc. But what else is one to do once one has chosen not to use prior probabilities?

      Delete
  9. Jason:
    'I have yet to be convinced that Bayesian updating is uniquely rational in any sense.'

    I think it's rational in its own terms. Sounds like we don't quite agree about that, but it doesn't matter because we DO agree that there may be even better forms of rationality. I don't know what they are, though.

    Me:
    What do you mean by "rational in its own terms?"

    I regard Bayesian updating as a method to be evaluated according to its usefulness in helping us find the truth. The usual arguments for it proceed from axioms about rationality or evidential meaning that are admittedly highly plausible intuitively but have nothing directly to do with finding the truth.

    ReplyDelete
  10. Jason:
    'In addition, it’s not clear to me why we should be rational unless being rational will help us get closer to the truth (or approximate truth, etc.)'

    Sorry, but I don't think the 'etc.' is good enough! The point of statistics is to operationalise this sort of idea, and it's only once you've picked an operationalisation that you're being really clear about what your standard of rationality is. I agree that the best standard may not be Bayesianism (because I think it may not be any of the existing ones). I think that if you cash out fully what you're trying to get at by talking about truth, you'll have a theory of statistics. Go for it! It will be interesting to see whether it's Bayesian or not. It may not be. But I don't think you'll end up with a non-statistical standard with which to compare statistical theories. If it's non-statistical then it's too vague. Or so I predict. :-)

    Me:
    I think this is a great comment. I'm not sure right now how to respond.

    I can elaborate a bit on what I was trying to say. All of the epistemic virtues I care about for their own sakes are "external" virtues that involve the achievement of some kind of "correctness" relation between a belief state and the world. Truth is the strongest such relation, involving exact matching between the belief state and the world. (I don't want to get into deep questions about the nature of truth here, which I don't think would add anything to this discussion.) I also care about approximate truth, truthlikeness, empirical adequacy, approximate empirical adequacy, and so on. I'm not inclined to care about "internal" epistemic virtues that only concern the relationships among one's beliefs either at a time or over time, such as logical consistency; synchronic and diachronic probabilisitic coherence; and respecting evidential equivalence, except insofar as these virtues aid in the pursuit of an epistemic virtue involving a correctness relation. Many philosophers are very concerned with internal epistemic virtues, so perhaps I'm either idiosyncratic in this respect or there are strong connections between internal epistemic virtues and external epistemic virtues of which I am not aware.

    Obviously this is all rather hand-wavey and unsatisfactory as serious philosophy. That's ok, I think---it's just a blog!

    ReplyDelete