**Update:**Since this update I have moved to what I plan to be a long-term home at gandenberger.org.

# Science and Statistics

A blog about my research at the intersection of the philosophy of science and statistics.

## Sunday, August 26, 2012

### Reboot

I'm migrating over to a new blog site because of problems with the commenting function here and in response to a new opportunity, which you can read about here.

## Monday, July 16, 2012

### Methodism and Ockham's Razor

In my previous post I drew a rough-and-ready distinction between evidentialists and methodists in the philosophy of science. Evidentialists focus on evidential relations between data and evidence, whereas methodists think that these relations are illusory or unimportant and instead focus on the reliability of our methods.

In this post I want to motivate methodism by telling what I consider to be one of its success stories: justifying Ockham's razor. This story belongs to Kevin Kelly and his students (see their project site here.)

Ockham's razor is very popular among scientists, philosophers of science, and popular science writers. It says, roughly, that we ought to be biased in favor of simple theories. It is easy to motivate Ockham's razor by appealing to historical episodes: Copernicus' heliocentric theory is to be favored over Ptolemy's because it posits fewer circles; Newton's unification of terrestrial and celestial physics is to be favored over Aristotelian alternatives because it posits fewer fundamental laws; Einstein's special relativity is to be favored over its Newtonian predecessors because it does away with the ether and a favored rest state. And so on. Appeals to Ockham's razor are everywhere in popular discussions of physics and sometimes crop up in other areas of science as well.

The apparent importance of Ockham's razor to science is difficult for philosophers of science to explain because it is hard either to explain what "simplicity" is in science or to see how a fixed bias toward simplicity could help guide us toward the truth. As Kelly puts it, if we already know that the truth is simple, then we don't need Ockham's razor. And if we don't already know that the truth is simple, what entitles us to assume that it is?

Kelly and students have proven that Ockham's razor in some sense converges to the truth more efficiently than other methods. Here is a simple example to illustrate their idea. Imagine a machine that spits out either a green ball or a blue ball once a second for all eternity. Your job is to watch the balls that come out of the machine and to produce a conjecture about how many times the color of the balls it spits out will change. You have no background knowledge about how the machine works beyond the fact that it always spits out a green ball or a blue ball and that the number of color changes is finite.

Suppose that the first ten balls that come out of the machine are green. You might be tempted to conjecture that the color never changes. But suppose that the stakes are high and you decide to wait. After one minute, you still have only seen green balls. Likewise after a day, a week, a month, and a year.

At what point do you give in and conjecture that the color never changes? The urge to do so seems irresistible, but there is a problem: with no background knowledge beyond that the machine always spits our a green ball or a blue ball, you have no basis for saying that the probability that the next ball will be green is higher than the probability that the next ball will be blue.

This is where Kelly's justification of Ockham's razor comes in. Kelly and his students propose that simplicity is a matter of asymmetric underdetermination: the data will never refute a complex theory if a simpler theory is true, but it will eventually refute a simpler theory if a more complex theory is true. For instance, the data will never refute "one color change" if "zero color changes" is true, but they will eventually refute "zero color changes" if "one color change" is true. Similarly, "one color change" is simpler than "two color changes," which is simpler than "three color changes," and so on.

Why follow Ockham's razor by giving the simplest answer compatible with the data? Let M be a method of producing conjectures for this problem that converges to the truth in the limit of inquiry on any possible data sequence. M

M must eventually posit the simplest answer compatible with experience, on pain of failing to converge to the truth in the limit. It can then be forced to retract if that answer is refuted. In this way, any method that converges to the truth in the limit can be forced to retract "up the complexity hierarchy." By contrast, only methods that violate Ockham's razor can be forced to retract "down the complexity hierarchy." Nature simply has to wait such methods out.

Thus, by never positing an answer other than the simplest one compatible with experience, a method that conforms to Ockham's razor minimizes the number of retractions it makes in the worst case. That's Kelly et al.'s justification of Ockham's razor.

This justification has two important features. First, it says only that Ockham methods are better than non-Ockham methods with respect to

Despite the weakness of this justification, I consider it a success story for methodism. Kelly et al.'s justification for Ockham's razor may not give us everything we want, but it does seem to give us everything we need. The point is not to get us excited about Ockham's razor. It is, rather, to investigate what if anything can be said in favor of this rule that we are already naturally inclined to follow. And it does give us something that does seem worth taking into consideration, at least when prior probabilities are unavailable.

It is interesting that frequentist null hypothesis significance tests can be thought of as instances of Ockham's razor if we loosen the notion of refutation a bit so that an answer is considered refuted if it is outside a confidence interval with a specified confidence level. In this sense, you can never refute a complex alternative hypothesis if a point null hypothesis is true, but you can eventually refute the null if the alternative is true; thus, the null is simpler than the alternative. The orthodox frequentist practice of acting as if the null were true until it is refuted thus conforms to Ockham's razor.

In this post I want to motivate methodism by telling what I consider to be one of its success stories: justifying Ockham's razor. This story belongs to Kevin Kelly and his students (see their project site here.)

Ockham's razor is very popular among scientists, philosophers of science, and popular science writers. It says, roughly, that we ought to be biased in favor of simple theories. It is easy to motivate Ockham's razor by appealing to historical episodes: Copernicus' heliocentric theory is to be favored over Ptolemy's because it posits fewer circles; Newton's unification of terrestrial and celestial physics is to be favored over Aristotelian alternatives because it posits fewer fundamental laws; Einstein's special relativity is to be favored over its Newtonian predecessors because it does away with the ether and a favored rest state. And so on. Appeals to Ockham's razor are everywhere in popular discussions of physics and sometimes crop up in other areas of science as well.

The apparent importance of Ockham's razor to science is difficult for philosophers of science to explain because it is hard either to explain what "simplicity" is in science or to see how a fixed bias toward simplicity could help guide us toward the truth. As Kelly puts it, if we already know that the truth is simple, then we don't need Ockham's razor. And if we don't already know that the truth is simple, what entitles us to assume that it is?

Kelly and students have proven that Ockham's razor in some sense converges to the truth more efficiently than other methods. Here is a simple example to illustrate their idea. Imagine a machine that spits out either a green ball or a blue ball once a second for all eternity. Your job is to watch the balls that come out of the machine and to produce a conjecture about how many times the color of the balls it spits out will change. You have no background knowledge about how the machine works beyond the fact that it always spits out a green ball or a blue ball and that the number of color changes is finite.

Suppose that the first ten balls that come out of the machine are green. You might be tempted to conjecture that the color never changes. But suppose that the stakes are high and you decide to wait. After one minute, you still have only seen green balls. Likewise after a day, a week, a month, and a year.

At what point do you give in and conjecture that the color never changes? The urge to do so seems irresistible, but there is a problem: with no background knowledge beyond that the machine always spits our a green ball or a blue ball, you have no basis for saying that the probability that the next ball will be green is higher than the probability that the next ball will be blue.

This is where Kelly's justification of Ockham's razor comes in. Kelly and his students propose that simplicity is a matter of asymmetric underdetermination: the data will never refute a complex theory if a simpler theory is true, but it will eventually refute a simpler theory if a more complex theory is true. For instance, the data will never refute "one color change" if "zero color changes" is true, but they will eventually refute "zero color changes" if "one color change" is true. Similarly, "one color change" is simpler than "two color changes," which is simpler than "three color changes," and so on.

Why follow Ockham's razor by giving the simplest answer compatible with the data? Let M be a method of producing conjectures for this problem that converges to the truth in the limit of inquiry on any possible data sequence. M

*must*eventually take the bait and settle on the answer "zero color changes" given a long enough data sequence with no color changes; otherwise, it would fail to converge to the truth. Suppose that it takes the bait after seeing 1,324,235 consecutive green balls. We have no assurance whatsoever that it won't be forced to retract its answer by a blue ball showing up at any later time. Suppose that the 1,543,213th ball is blue. Then "zero color changes" is refuted. M might then abstain from giving an answer for a long time, having been burned once before. But given a long enough sequence of blue balls, it must eventually take the bait and settle on the answer "one color change," again on pain of failing to converge to the truth. There is no assurance whatsoever that it will not be forced to retract that answer later on.M must eventually posit the simplest answer compatible with experience, on pain of failing to converge to the truth in the limit. It can then be forced to retract if that answer is refuted. In this way, any method that converges to the truth in the limit can be forced to retract "up the complexity hierarchy." By contrast, only methods that violate Ockham's razor can be forced to retract "down the complexity hierarchy." Nature simply has to wait such methods out.

Thus, by never positing an answer other than the simplest one compatible with experience, a method that conforms to Ockham's razor minimizes the number of retractions it makes in the worst case. That's Kelly et al.'s justification of Ockham's razor.

This justification has two important features. First, it says only that Ockham methods are better than non-Ockham methods with respect to

*worst-case*retraction loss. One can't say anything about expected-case retraction loss without imposing a probability distribution on the problem; but if one is in a position to impose a probability distribution on the problem, then one can just perform Bayesian inference based on the posterior distribution without worrying about Ockham's razor. Best-case retraction loss doesn't favor Ockham's razor because a method that starts with a complex answer could get lucky: nature could refute all of the simpler theories in succession and then decline to refute that one for eternity. In addition to the fact that it only concerns worst-case retraction loss, Kelly et al.'s justification is limited in that it provides*no*assurance*at all*that a simpler answer is more likely to be correct than a more complex answer. It speaks only about the reliability of methods, and not about the intrinsic belief-worthiness of some hypotheses given some data. It is a methodist justification only, not an evidentialist one.Despite the weakness of this justification, I consider it a success story for methodism. Kelly et al.'s justification for Ockham's razor may not give us everything we want, but it does seem to give us everything we need. The point is not to get us excited about Ockham's razor. It is, rather, to investigate what if anything can be said in favor of this rule that we are already naturally inclined to follow. And it does give us something that does seem worth taking into consideration, at least when prior probabilities are unavailable.

It is interesting that frequentist null hypothesis significance tests can be thought of as instances of Ockham's razor if we loosen the notion of refutation a bit so that an answer is considered refuted if it is outside a confidence interval with a specified confidence level. In this sense, you can never refute a complex alternative hypothesis if a point null hypothesis is true, but you can eventually refute the null if the alternative is true; thus, the null is simpler than the alternative. The orthodox frequentist practice of acting as if the null were true until it is refuted thus conforms to Ockham's razor.

### Nice Quote about Bayesian Methods

I am not a Bayesian because I lack self-confidence. That is, you have to have enough self-confidence to have a specific prior on things, and I do not think I know enough about things to have a specific prior. I may have a general prior on some things. I think that a good Bayesian, that is, a Bayesian who picks a prior that has some value to it, is better than a non-Bayesian. And a bad Bayesian who has a prior that is wrong is worse than a non-Bayesian, and I have seen examples of both. What I do not know is how do I know which is which before we evaluate the outcome.

--Clive Granger

--Clive Granger

### Methodism and Frequentism

Some philosophers of science focus on questions like "What are the necessary and sufficient conditions for data D to count as evidence for hypothesis H?" A few think that questions of this kind are misguided and instead focus on questions like "How do the methods used in science help us get closer to the truth?" I will call the first group

Frequentist methods tend to appeal to methodists because they are typically taken to derive their warrant not primarily from intuitions about evidential relations but instead from their long-run operating characteristics. For instance, a uniformly most powerful level α null hypothesis significance test is typically taken to be a good procedure (when it is so taken) because it has the highest probability of rejecting the null hypothesis when it is false among all tests that have probability α or less of rejecting the null hypothesis when it is true. By contrast, Bayesian methods are typically taken to derive their warrant from considerations of coherence or consistency with plausible axioms about rational inference or behavior.

Although the typical way of motivating frequentist methods has a methodist behavior while the typical way of motivating Bayesian methods does not, a methodist need not be a frequentist. The frequentist methods in common use are appealing because they can be

*evidentialists*and the second group*methodists*. (With apologies to mainstream epistemologists, who use these terms in slightly different ways, as well as to the relevant group of mainline Protestants. And also to my readers for the fact that this distinction may not be entirely clear and my terminology across posts may not be entirely uniform. This research is a work in progress!)Frequentist methods tend to appeal to methodists because they are typically taken to derive their warrant not primarily from intuitions about evidential relations but instead from their long-run operating characteristics. For instance, a uniformly most powerful level α null hypothesis significance test is typically taken to be a good procedure (when it is so taken) because it has the highest probability of rejecting the null hypothesis when it is false among all tests that have probability α or less of rejecting the null hypothesis when it is true. By contrast, Bayesian methods are typically taken to derive their warrant from considerations of coherence or consistency with plausible axioms about rational inference or behavior.

Although the typical way of motivating frequentist methods has a methodist behavior while the typical way of motivating Bayesian methods does not, a methodist need not be a frequentist. The frequentist methods in common use are appealing because they can be

*proven*to have certain desirable operating characteristics when their assumptions are satisfied. However, their desirable operating characteristics are typically much weaker than they seem at first glance and are compatible with terrible performance in important respects. And Bayesian methods often perform better; they just can't be*proven*to have any particular truth-related virtues (except from the standpoint of the degrees of belief of a particular agent) because their performance depends on the relationship between their priors and the true state of affairs. Methodists are not committed to preferring methods with weak but provable performance characteristics over methods that may or may not perform better depending on the circumstances.## Wednesday, July 11, 2012

### A Lesson from Glymour's Theory and Evidence

In his

I suspect that the most important passage in

*Theory and Evidence*, Clark Glymour presents his "bootstrap" theory of confirmation, which is essentially a conceptual analysis of the notion of confirmation (i.e. positive evidential relevance) that attempts to extend the hypothetico-deductive account to allow a theory to be confirmed by evidence that is derived from the very theory in question, in just the cases in which that kind of circularity is unproblematic. Glymour's theory is extremely ingenious but faces at least one decisive objection (given here). Attempts to patch the theory to address that objection did not get very far, and the theory was more or less abandoned.I suspect that the most important passage in

*Theory and Evidence*is not one in which Glymour presents or defends his theory, but the final paragraph of the book in which he reflects on what he has accomplished. His insight and his candor are remarkable:There is nothing in this book that corresponds to an attempt to show that the methods I have described are justified or uniquely rational. There are arguments for the methods, arguments that purport to show that the strategy achieves our intuitive demands on confirmation relations better than do competing strategies, but these arguments do not show that the bootstrap strategy will lead us to the truth in the short run or the long run, or will lead us to the truth if anything can, or is required by some more primitive canon of rationality. There are such arguments for other confirmation theories, although none of them are wholly good arguments; perhaps it would be better to have a bad argument than of one of these kinds for the for the bootstrap strategy than to have none at all. But the strategy does not depend on frequencies, and there is as yet no framework of degrees of belief within which to place it, and so none of the usual lines of argument seems relevant at all. One has only the sense that unless the world is perversely complex, the strategy will help us to locate false hypotheses and separate them from true ones. A sense is not an argument. I am partly consoled by the thought that one cannot do everything; sometimes, alas, one cannot do very much at all.To me the message of this passage and of the fate of

*Theory and Evidence*and many similar projects is that conceptual analysis is an ineffective way to arrive at normative conclusions, even if the concept being analyzed (e.g. confirmation/evidence) has normative import. Showing that a normative claim follows from more basic normative claims is perhaps a more promising approach, but to my mind the best approach of all is to show that abiding by a normative claim will help us achieve some end we care about. If it doesn't do that, then why should we regard it as normative?## Saturday, July 7, 2012

### Unfortunately, "Axiomatic" Doesn't Imply "Useful"

But we may look at the purpose of tests from another view-point. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong. Here, for example, would be such a "rule of behaviour": to decide whether a hypothesis, H, of a given type be rejected or not, calculate a specified character, x, of the observed facts; if x>x_0, reject H, if x≤x_0, accept H. Such a rule tells us nothing as to whether in a particular case H is true when x≤x_0, or false when x>x_0. But it may often be proved that if we behave according to such a rule, then in the long run we shall reject H when it is true not more, say, than once in a hundred times, and in addition we may have evidence that we shall reject H sufficiently often when it is false. —Jerzy Neyman and Egon Pearson, 1933

It's typically easy to make a conceptual analysis task sound interesting and important, at least to the philosophically inclined. Here's a way to motivate the question "What is evidence?" that I have used in teaching. Start with the question "What is science?" or more specifically, "How is science different from (and better than?) other ways of attempting to learn about the world, e.g. astrology, reading sacred texts, etc.?" This question is easy to motivate. Take debates about whether or not intelligent design should be taught in public school science classrooms. Typically, these debates center on whether intelligent design is science or not. In the Kitzmiller vs. Dover Area School District case, for instance, philosophers of science were brought in to testify on this issue. Thus, the question of what constitutes science is central to a big debate in our society that a lot of us already care about.

According to many (most?) philosophers of science today, the question "What is science?" is a red herring. The views of the judge in the Kitzmiller case notwithstanding, the important question is not whether, e.g., intelligent design is science or not, but something more like whether intelligent design is well supported by our evidence or not. Thus, rather than asking "What is science?" we should be asking "What is evidence?"

In general, it is easy to motivate the task of conceptual analysis by pointing out that the concept being analyzed does some important work for us but is not entirely clear. However, conceptual analysis has many problems. For one, it almost never works. Typically, a conceptual analysis is produced; then a counterexample is presented; then the analysis is reworked to get around that counterexample; then a new counterexample is produced; then the analysis is reworked again to get around the new counterexample; and so on until the analysis becomes so ugly and messy that nobody finds it very appealing any more. At some point in this process, a rival analysis is produced which then undergoes the same process. The result is a collection of ugly and messy analyses none of which really work but each of which has vociferous defenders who engage in interminable debates that to most of us seem boring and pointless, even though the question being debated can be made to seem interesting and important.

Perhaps there is a deeper problem with conceptual analysis, namely that questions of the form "what is x?" are not actually interesting and important after all. Recall the motivation I gave for the question "what is evidence?" I said that many philosophers believe that the question of whether intelligent design should be taught in public school science classes should be focused on the question of whether intelligent design is well supported by evidence or not, rather than the question of whether intelligent design is science or not. Thus, the important question is not, "what is science?" but rather "what is evidence?" However, it's not obvious that we need an answer to the question "what is evidence?" in order to judge whether intelligent design is well supported by evidence or not. In fact, many philosophers call acts of reasoning that assume that we cannot know whether something is an x until we have a definition of x instances of the "socratic fallacy." We can tell whether something is green or not just by looking at it. We don't have to know what

*green*is. Evidence seems to work the same way, to some extent at least.

Perhaps we don't

*need*a conceptual analysis of "evidence" in order to judge whether intelligent design is well supported by evidence or not, but such an account would nevertheless be

*useful*in clarifying our intuitive judgments and allowing us to articulate and defend them. That seems right, but it's a moot point if attempts to give a conceptual analysis of "evidence" are bound to fail as nearly all attempts at conceptual analysis fail.

If not conceptual analysis, then what should philosophers be doing? One option is to forget about

*defining*terms like "evidence" and instead to develop

*axiomatic theories*of such notions. Philosophers can operate in "Euclidean mode" rather than "Socratic mode," as Glymour puts it. Euclid did give "definitions" of terms like "point" and "line," but they were more like dictionary entries than conceptual analyses. They don't hold up under philosophical scrutiny. But so what? Euclid developed a beautiful and powerful theory about points and lines and such. It's not clear what rigorous definitions of "point" and "line" would add to his theory, practically speaking.

The same is true of probability theory and the theory of causation. Philosophers still debate about what "probability" and "causation" mean, but these questions have little or no importance for working statisticians. On the other hand, the theories based on Kolmogorov's axioms and on the Causal Markov and Faithfulness Conditions (the latter developed by philosophers) are of great importance to practitioners.

Philosophers typically like thinking for its own sake. But they have should at least try to do work that matters. Conceptual analysis typically does not matter, especially after the obvious options have been laid out and explored and the field has more or less reach a stalemate.

However, it is not a foregone conclusion that an axiomatic theory will matter either. In fact, I suspect that the Likelihood Principle is part of an axiomatic theory that doesn't matter. Hence my dissatisfaction with my project.

Specifically, the Likelihood Principle is part of an axiomatic theory of evidence, or rather, equivalence of evidential meaning. Allan Birnbaum initiated the development of this theory when he showed that the Likelihood Principle follows from the conjunction of the Sufficiency Principle and the Weak Conditionality Principle. I have attempted to improve it by showing that the Likelihood Principle follows from the conjunction of the Weak Evidential Equivalence Principle and the Experimental Conditionality Principle. I think this proof makes it harder for frequentists to deny that their methods fail to respect equivalence of evidential meaning. But it only follows that we ought not use frequentist methods on the assumption that we ought not use methods that fail to respect equivalence of evidential meaning. That assumption begs the question against Neyman's view of tests as rules of behavior, as given in the quote at the top of this post.

My view is that restrictions on methods are justified only insofar as they help us achieve our practical and epistemic goals. Respecting equivalence of evidential meaning is not for me a basic epistemic goal. nor does it seem to be related in any straightforward way to success in achieving my basic practical and epistemic goals (see my previous post). Thus, I at least seem to have no use for an axiomatic theory of evidential equivalence, and thus, it seems, no use for the Likelihood Principle. Bummer.

## Friday, July 6, 2012

### Evidential Equivalence and Truth

###

In my previous post, I discussed a concern I have about the project I had been planning to pursue in my dissertation. Briefly, I was planning to defend the Likelihood Principle, which says that certain sets of experimental outcomes are evidentially equivalent. This Likelihood Principle seems interesting because frequentist methods, which are the statistical methods most widely used in science, do not respect evidential equivalence if the Likelihood Principle is true. Here's the problem: it's not obvious to me that we ought to insist on using statistical methods that respect evidential equivalence.

One might claim that respecting evidential equivalence is an end in itself. All I can say in response tp this claim is that it is not an end I personally care about in itself. I care about epistemic goals that involve some sort of correspondence between a proposition and the world, such as truth, truthlikeness, approximate truth, and empirical adequacy. I care about respecting evidential equivalence and other epistemic virtues like coherence that concern the internal ordering of a set of propositions only insofar as they help in achieving those goals.

To my mind, then, we ought to insist on using statistical methods that respect evidential equivalence only to the extent that doing so would help us achieve what we might loosely call correspondence goals.

Would insisting on using statistical methods that respect evidential equivalence help us achieve correspondence goals? There doesn't seem to be any way to answer this question across the board as far as I can see. Frequentist methods (which fail to respect evidential equivalence) have probability one of achieving a certain level of performance with respect to correspondence goals in the indefinite long run when their assumptions are satisfied, while Bayesian methods (which respect evidential equivalence) provide no such guarantees but rather may perform better or worse than frequentist methods in a given respect on a given problem depending on the priors used and the actual state of affairs. This comparison suggests that we ought to use Bayesian methods when we are reasonably confident that our priors are good, and use frequentist methods otherwise (or something like that). That conclusion is not very exciting, and it doesn't give us any interesting story to tell about how respecting evidential equivalence helps us get to the truth.
###

In my previous post, I discussed a concern I have about the project I had been planning to pursue in my dissertation. Briefly, I was planning to defend the Likelihood Principle, which says that certain sets of experimental outcomes are evidentially equivalent. This Likelihood Principle seems interesting because frequentist methods, which are the statistical methods most widely used in science, do not respect evidential equivalence if the Likelihood Principle is true. Here's the problem: it's not obvious to me that we ought to insist on using statistical methods that respect evidential equivalence.

One might claim that respecting evidential equivalence is an end in itself. All I can say in response tp this claim is that it is not an end I personally care about in itself. I care about epistemic goals that involve some sort of correspondence between a proposition and the world, such as truth, truthlikeness, approximate truth, and empirical adequacy. I care about respecting evidential equivalence and other epistemic virtues like coherence that concern the internal ordering of a set of propositions only insofar as they help in achieving those goals.

To my mind, then, we ought to insist on using statistical methods that respect evidential equivalence only to the extent that doing so would help us achieve what we might loosely call correspondence goals.

Would insisting on using statistical methods that respect evidential equivalence help us achieve correspondence goals? There doesn't seem to be any way to answer this question across the board as far as I can see. Frequentist methods (which fail to respect evidential equivalence) have probability one of achieving a certain level of performance with respect to correspondence goals in the indefinite long run when their assumptions are satisfied, while Bayesian methods (which respect evidential equivalence) provide no such guarantees but rather may perform better or worse than frequentist methods in a given respect on a given problem depending on the priors used and the actual state of affairs. This comparison suggests that we ought to use Bayesian methods when we are reasonably confident that our priors are good, and use frequentist methods otherwise (or something like that). That conclusion is not very exciting, and it doesn't give us any interesting story to tell about how respecting evidential equivalence helps us get to the truth.

Subscribe to:
Posts (Atom)