Science and Statistics

Monday, February 14, 2011

Teaching Philosophy (Baranger Award Application Materials II)

I solicited and received nominations for the Elizabeth Baranger Excellence in Teaching Award, which aims to "recognize and reward outstanding teaching by graduate students at Pitt." I'll be posting here drafts of my application materials. Feedback is welcome!

The application requires a "Statement of Teaching Philosophy," which the application website describes as follows:

Teaching Philosophy

For purposes of the A&S GSO award, you should consider the statement to be an explication of your pedagogical goals, methods, and theories. Although you may reference established pedagogical theories, what we as a committee are most interested in is your own understanding and how you put those ideas into practice, not your knowledge of the current jargon of teaching theory. For the purposes of this award, you do not have to focus exclusively on concrete examples in this component, because your teaching philosophy should be complementary to your example of and reflection on your teaching materials and the other application materials.

One of the primary goals of this statement is for you to demonstrate, or gain, a consciousness of the processes of learning in and out of your teaching environment. However you choose to do it, let your readers know how you think learning happens, what the best ways to facilitate this are, and how you put these ideas into practice. You may choose to address some of the following questions in your philosophy:

What are your objectives as a teacher? What methods do you use to achieve these goals? How do you assess and evaluate your effectiveness in achieving your objectives?
What do you believe about teaching? About learning? How is this played out in your classroom?
Why is teaching important to you?
Focus on how you go about teaching, with concrete examples where necessary or appropriate, and a reflection on how students react(ed) to concepts and/or innovations. You may reference other materials you have submitted.
Share insights about teaching in your specific discipline (importance of the field, theoretical grounding when necessary).
It is acceptable to talk about your mistakes in order to demonstrate what you learned from them.

We recognize that teachers at different levels of teaching will have differing amounts and quality of experiences from which to draw. We are more interested in how well you were able to work within the parameters you were given. Your Teaching Philosophy should not exceed 750 words.

The following is a statement of teaching philosophy that I wrote for a different occasion. I am not entirely happy with it, and I think that it is too focused on the subject of philosophy for the Baranger Award application (as opposed to an application for a job in a philosophy department), but it is a start:

College courses in many disciplines primarily present stories of intellectual triumphs, such as ingenious methods, surprising discoveries, and successful theories. By contrast, a typical philosophy course primarily presents stories of intellectual failures: everyday concepts that resist analysis, simple paradoxes that resist resolution, and compelling questions that resist definitive answers. From a pedagogical perspective, this feature of a typical philosophy course generates both challenges and opportunities. One major challenge is to avoid giving students the impression that philosophy is pointless because it never makes any progress. One major opportunity is to help students become more reflective and critical about their beliefs.

I sympathize with students who complain about the fact that philosophers seem unable to solve any of the major problems they set themselves. I used to respond to students who came to me with this complaint by pointing out that philosophy does make progress of sorts---we now know that many seemingly plausible positions cannot be made to work. However, they often find this answer unsatisfying because all of the progress philosophers make seems to be negative; we know a lot in many cases about which answers will not work, but have learned little about the answer that will work. A better response, I now think, is to point out that this feature of philosophy tells us something important: it is exceedingly difficult to develop defensible views about many very basic issues. As a former professor of mine put it, philosophy really teaches you that you can't just say any old thing.

Philosophy is less a body of information than a set of skills and habits of mind. Students in philosophy courses should learn to read a document, understanding its author's position, identifying his or her argument for that position, and evaluating the strength of that argument. They should also learn to develop their own positions on complex issues and to present cogent arguments for those positions. These skills are fundamental for critical thinking and thus are useful not only for philosophy, but also for many professions and for everyday life. Because philosophy is a highly contentious discipline, a philosophy course provides excellent opportunities for teaching these skills. To some extent, students will pick these skills up naturally through the process of learning about and doing philosophy. However, making explicit what these skills involve can accelerate this process. In addition, it can help to give students opportunities to practice these skills and to receive feedback on their performance.

As a philosophy teacher, my primary goal is to help my students acquire the argumentative skills of a good philosopher. In addition, I aim to help my students to understand and internalize the specific content of the course. Educational resarch suggests that testing students on a given body of material helps them internalize that material more effectively than simply reviewing that material with them, so I give frequent small quizzes. Educational research also suggests that teaching something to someone else is one of the most effective ways to internalize it, so I have my students review their answers with one another before we discuss them together. This peer review method has the additional benefit that weaker students can receive more personalized attention than I can give them. Of course, I also review the material with them myself to prevent misconceptions from taking root and to answer questions that the peer review process fails to resolve.

Another distinctive feature of my teaching, besides peer review, is my interactive style of lecturing. My basic approach is not to provide any information that I could elicit from the students. As a result, in a one-hour recitation with twenty students, nearly every student speaks in every session. I believe that this practice reinforces what the students have learned better than me reciting all of the information to them. It also keeps the students alert and engaged, which is crucial for their learning.

Teaching philosophy is an excellent opportunity to help students acquire both critical thinking skills and a reflective habit of mind. I have developed some techiniques to try to make the most of this opportunity, and I look forward to continuing to improve my pedagogical skills.

Teaching Challenge (Baranger Award Application Materials I)

I solicited and received nominations for the Elizabeth Baranger Excellence in Teaching Award, which aims to "recognize and reward outstanding teaching by graduate students at Pitt." I'll be posting here drafts of my application materials. Feedback is welcome!

The first item I have written is a response to the following prompt:

Teaching challenge

In an essay of no more than 500 words, please (a) describe a challenge you faced and overcame as a teacher, explaining (b) how you dealt with it and (c) what you learned from it.

With regards to (a), consider answering some of the following questions: Was the challenge one of how you relayed information to students, how you assessed students, how you organized material, or perhaps with what kind of an attitude you approached the course? Were any particular aspects of your teaching philosophy put under scrutiny as a result of facing the challenge? Do you think this type of a challenge is commonly faced by graduate student teachers?

With regards to (b), consider answering some of the following questions: Was there any previous planning (for example, a well-made syllabus or a comprehensive teaching philosophy), which prepared you for the challenge that arose? Did you seek help from other graduate students or faculty members? Did you attempt to overcome the challenge the first time you encountered it, or was it only after realizing that the challenge was an on-going element that you decided to address it?

With regards to (c), consider answering some of the following questions: Is the challenge something to be prevented from semester to semester, or do you look forward to facing it again? Have you rethought your teaching philosophy in light of the experience? How has experiencing the challenge forced you to rethink the attitude you take into the classroom?

Here is my response:

The first course I ever taught aimed to prepare high school students for the ACT college entrance exam. I loved my students and spent hours crafting each lesson. My students liked me too and thought that I was very smart. There was only one problem: from the beginning of the course to the end, my students’ practice test scores barely improved. I was teaching, but my students were not learning. One of my students summed up her experience in the course as follows: “I understand your lessons, and I feel like I’ve learned a lot. But my scores aren’t getting any better, and I can’t figure out why.”

Those results bothered me. I refined my lessons, and in subsequent courses my students did slightly better. The improvements were not dramatic, however, until I received additional training to teach a second test type. My trainer noticed a general problem with my teaching and called me on it repeatedly: I was talking too much. Students would tune out during my elaborate explanations. I needed to identify the most important points of the lesson and punch those points one at a time.

In general, I realized, I was focusing too much on developing thorough, logically correct lessons, and not enough on presenting material in a way that helped students learn. I streamlined my lessons, breaking them down into bite-sized pieces each of which was focused on one major point. I read about pedagogical techniques and to incorporate them into my teaching. (Many of the techniques I use are described in my statement of teaching philosophy.) And I continued to monitor my students’ test results to see what was working and what was not.

This experience taught me, among other things, the importance of getting objective feedback about what my students are actually learning. If I had only received feedback in the form of student evaluations, then I would have thought that the first course I taught was going rather well. However, the data I received from their tests results showed otherwise. It is difficult in my field (History and Philosophy of Science) to get data on student learning that is as clear as the data I got from my test-prep courses, but it is possible to get useful information from student responses in class and on written assignments by paying attention to deficiencies in their responses and reflecting on how one’s teaching might have contributed to those problems.

Academic tradition says that a teacher’s responsibility is to present the material, and a student’s responsibility is to learn it. That attitude lets teachers off the hook too easily. Students bear some responsibility for their own learning, of course, but a good educator helps to bridge the gap between where a student is and where he or she should be by the end of a course through skillful pedagogy and sensitivity to the student’s point of view.

Friday, February 11, 2011

Oil and Water

Between 1896 and 1906, J. J. Thomson and his students performed a series of experiments that led to the "cloud method" for measuring the charge on a gaseous ion. Around 1908, Millikan began working to improve upon the cloud method. He first discovered that increasing the voltage he used allowed him to experiment on single droplets rather than an entire cloud, using what he called the "balanced water-drop" method. Depending on whose account you believe, either Millikan or his graduate student Harvey Fletcher thought to use watch oil instead of water because watch oil would evaporate only very slowly.

Today I learned why water drops were so unsatisfactory. It is impossible to experiment on them with an atomizer! I spent several hours trying to figure out why I wasn't getting any water drops in the chamber only to realize that I was getting some, but they disappeared as soon as they arrived. I knew that evaporation was an issue, but I didn't know that it happened so fast.

Sunday, February 6, 2011

Ioannidis' Argument

John Ioannidis is a “meta-researcher” who has written several well-known papers about the reliability of research that uses frequentist hypothesis testing. (The Atlantic published a reasonably good article about Ioannidis and his work in November 2010.) One of his most-cited papers, called “Why Most Published Research Findings Are False,” presents a more general version of the example I presented in my last post. Steven Goodman and Sander Greenland wrote a response arguing that Ioannidis’ analysis is overly pessimistic because it does not take into account the observed significance value of research findings. Ioannidis then responded to defend his original position. I’m planning to work through this exchange, with an eye toward the question whether an argument like Ioannidis’ could be used to present the base-rate fallacy objection to error statistics in a way that is more consistent with frequentist scruples than Howson’s original presentation.

Ioannidis’ argument generalizes the example I gave in my previous post: it uses the same kind of reasoning, but with variables instead of constants. Ioannidis idealizes science as consisting of well-defined fields i=1, …, n, each with a characteristic ratio R_i of “true relationships” (for which the null hypothesis is false) to “no relationships” (the null is true) among those relationships it investigates, and with characteristic Type I and Type II error rates α_i and β_i. Under these conditions, he shows, the probability that a statistically significant result in field i reflects a true relationship is (1- β_i)R_i/(R_i- β_iR_i- α_i).

It’s somewhat difficult to see where this expression comes from in Ioannidis’ presentation. This blog post by Alex Tabarrok presents Ioannidis’ argument in a way that’s easier to follow, including the following diagram:

Here’s the idea. You start with, say, 1000 hypotheses in a given field, at the top of the diagram. The ratio of true hypotheses to false hypotheses in the field is R, so a simple algebraic manipulation (omitting subscripts) shows that the fraction of all hypotheses that are true hypotheses is R(1+R), while the fraction that are false is 1- R(1+R). That brings us to the second row from the top in the diagram: if R is, say ¼ (so that there is one true hypothesis investigated for every four false hypotheses investigated), then on average out of 1000 hypotheses 200 will be true and 800 false. Of the 200 true hypotheses, some will generate statistically significant results, while others will not. The probability that an investigation of a true relationship in this field yields a statistically significant result is, by hypothesis, β. If β is, say, .6, then on average there will be 120 positive results out of the 200 true hypotheses investigated. Similarly, of the 800 false hypotheses, some will generate statistically significant results; the probability that any one hypothesis will do so is α. Letting α=.05, then, on average there will be 40 positive results for the 800 false hypotheses investigated. Thus, the process of hypothesis testing in this field yields on average 40 false positives for every 120 true positives, giving it a PPV of .75. Running through the example with variables instead of numbers, one arrives at a PPV of (1- β)R/(R- βR- α). This result implies that PPV is greater than .5 if and only if (1- β)R > α.

Ioannidis goes on to model the effects of bias on PPV and to develop a number of “corollaries” about factors that affect the probability that a given research funding is true (e.g. “The hotter a scientiﬁc ﬁeld… the less likely the research ﬁndings are to be true”). He argues that for most study designs in most scientific fields, the PPV of a published positive result is less than .5. He then makes some suggestions for raising this value. These elaborations are quite interesting, but for the moment I would like to slow down and examine the idealizations in Ioannidis’ argument.

The idea that science consists of well-defined “fields” with uniform Type I and Type II error rates across all experiments is certainly an idealization, but a benign one so far as I can tell. The assumption that each field has an (often rather small) characteristic ratio R of “true relationships” to “no relationships” is more problematic. First, what do we mean by a “true relationship?” One of the most basic kinds of relationship that researchers investigate is simple probabilistic dependence: there is a “true relationship” of this kind between X and Y if and only if P(X & Y) ≠ P(X)*P(Y). However, if probabilistic dependence is representative of the relationships that scientists investigate, then one might reasonably claim that the ratio of “true relationships” to “no relationships” in a given field is always quite high, because nearly everything is probabilistically relevant to nearly everything else, if only very slightly. In fact, a common objection to null hypothesis testing is that (point) null hypotheses are essentially always false, so that testing them serves no useful purpose.

One could avoid this objection by replacing “no relationship” with, say, “negligible relationship” and “true relationship” with “non-negligible relationship.” However, for the argument to go through one would then have to reconceive of α as the probability of rejecting the null given that the true discrepancy from the null is non-negligible, and of β as the probability of failing to reject the null given that the true discrepancy from the null is negligible. Fortunately, these probabilities would generally be nearly the same as the nominal α and β.

The assumption that each field has a characteristic R is more problematic than the assumption that each field has a characteristic α and β for a second reason as well: ascribing a value for R to a test requires choosing a reference class for that test. The assumption that each field has a characteristic α and β is simply a computational convenience; α and β are defined for a particular test even though this assumption is false. By contrast, the assumption that each field has a characteristic R is more than a convenience: something like it must be at least approximately true for R to be well defined in a particular case. This point seems to me the Achilles’ heel of Ioannidis’ argument, and of attempts to persuade frequentists to treat PPV as a test operating characteristic on par with α and β. A frequentist could reasonably object that there is no principled basis for choosing a particular reference class to use in a particular case in order to estimate R. And even with a particular reference class, there are significant challenges to obtaining a reasonable estimate for R.

Thursday, February 3, 2011

Positive Predictive Value as an Operating Characteristic

The Mayo and Howson papers I examined in my last few posts came out of a symposium at the 1996 meeting of the Philosophy of Science Association. In this post, I turn my attention to the third paper that came out of that symposium, this one by Ronald Giere.

Giere takes a somewhat neutral, third-party stance on the debate between Mayo and Howson, although his sympathies seem to lie more with error statistics. He contrasts Mayo and Howson’s views as follows: Howson attempts to offer a logic of scientific inference, analogous to deductive logic, whereas Mayo aims to describe scientific methods with desirable operating characteristics.

This distinction does seem to capture how Howson and Mayo think about what they are doing. However, it does not make me any more sympathetic to error statistics, because it seems to me a mistake to try to separate method from logic. The operating characteristics of a scientific method are desirable to the extent that they allow one to draw reliable inferences, and the extent to which they allow one to draw reliable inferences depends on logical considerations.

Nevertheless, the logic/method distinction is useful for understanding the perspective of frequentists such as Mayo. In fact, one may be able to use the insight this distinction provides to recast the base-rate fallacy objection in a way that will strike closer to home for a frequentist. The key is to present the objection in terms of the positive predictive value of a test and to argue that positive predictive value (PPV) is an operating characteristic on par with Type I and Type II error rates. In fact, a test can have low rates of Type I and Type II error (low α and β), but still have low positive predictive value. Consider the following (oversimplified) example:

Suppose that in a particular field of research, 9/10 of the null hypotheses tested are true. For simplicity, I will assume that all of the tests in this field use the same α and β levels: the conventional α=.05, and the lousy but fairly common β=.5. The following 2x2 table displays the most probable set of outcomes out of 1000 tests:

	Test rejects H₀	Test fails to reject H₀
H₀ is true	45	855	900
H₀ is false	50	50	100
	95	905

Intuitively, PPV is the probability that a positive result is genuine. In more frequentist terms, it is the frequency of false nulls among cases in which the test rejects the null. In this example, PPV is 45/95 = .53. As the example shows, a test can have low α and β (desirable) without having high PPV, if the base rate of false nulls is sufficiently low. (Thanks to Elizabeth Silver for providing me with this example.)

Superficially, this way of presenting the base-rate objection appears to be closer to the frequentist framework than Howson’s way of presenting it. However, a frequentist might object that PPV is not an operating characteristic of a test in the same way that Type I and Type II error rates are. Type I error rates, one might think, are genuine operating characteristics because they do not depend on any features of the subject matter to which the test is applied. One simply stipulates a Type I error rate and chooses acceptance and rejections for one’s test statistic that yield that error rate. By contrast, to calculate PPV one has to take into account the fraction of true nulls within the subject area in question. Thus, PPV is not an intrinsic feature characteristic of a test, but an extrinsic feature of the test relative to a subject area.

This objection ignores the fact that Type I error rates are calculated on the basis of assumptions about the subject matter under test—most often, assumptions of normality. As a result, Type I error rates are not intrinsic features of tests either, but of tests as applied to subject areas in which (typically) things are approximately normal. Normality assumptions may be more widely applicable and more robust than assumptions about base rates, but they are nonetheless features of the subject matter rather than features of the test itself. Type II error rates are even more obviously features of the test relative to a subject matter, because they are typically calculated for a particular alternative hypothesis that is taken to be plausible or relevant to the case at hand.

A frequentist could respond simply by conceding the point: PPV is an operating characteristic of a test that is relevant to whether one can conclude that the null is false on the basis of a positive result. To do so, however, would be to abandon the severity requirement and to move closer to the Bayesian camp.

The example given above uses the same kind of reasoning that John Ioannidis uses in his paper “Why Most Published Research Findings are False.” It might be useful to move next to that paper and the responses it received.

Before moving on, I'd like to note a couple other interesting other moves Giere makes in his paper. First, he characterizes Bayesianism and error statistics as extensions of the rival research programs that Carnap and Reichenbach were developing around 1950, but without those programs' foundationalist ambitions. Second, Giere emphasizes a point that I think is very important: Bayesianism (as it is typically understood in the philosophy of science) is concerned with the probability that propositions are true. It is not concerned (at least in the first instance) with how close to the truth any false propositions may be. Yet, in many (if not all) scientific applications, the truth is not an attainable goal. Even staunch scientific realists admit that our best scientific theories are very probably false. Where they differ from anti-realists is that they claim that our best theories are close to and/or tending toward the truth. One might think that the emphasis in real scientific cases on approximate truth rather than probable truth favors error statistics over Bayesianism. However, when one moves into real scientific cases one should also move into real Bayesian methods, which include Bayesian methods of model building, which are Bayesian in that involve conditioning on priors but are not like the Bayesian methods that philosophers tend to focus on because they aim to produce models that are approximately true rather than models that have a high posterior probability. Unlike Bayesian philosophers, perhaps, Bayesian statisticians have develop a variety of methods that can handle the notion of approximate truth just as well as error-statistical methods.

Saturday, January 29, 2011

Mayo’s Reasons for Rejecting Premise 4 of Howson’s Argument

I realize now that I misunderstood Mayo’s use of J in place of ~H. ~H says that breast cancer is absent, whereas J says that breast disease (inclusive of breast cancer) is absent. The idea seems to be that a non-cancerous breast disease is likely to trigger a false-positive result in a test for breast cancer, and that this possibility makes it the case that a positive test result does not pass H severely.

This point does not affect the upshot of my analysis. Howson can simply stipulate a hypothetical case in which there is no state corresponding to J in which a false positive test is likely. That is enough to show that the severity requirement is unsound in principle.

Moreover, Mayo grants that ~J (which says that breast disease is present) does pass a severe test despite having (we can assume) a low posterior probability. Thus, she allows that a hypothesis can meet the severity requirement despite having a low posterior, effectively granting premise 2 of Howson’s argument, and turns her attention to premise 4.

Here is Mayo’s reconstruction of Howson’s argument modified to reflect the fact that Mayo denies that H passes a severe test but allows that ~J does so:

An abnormal result e is taken as failing to reject H (i.e., as “accepting H”); while rejecting J, that no breast disease exists
~J passes a severe test and thus ~J is indicated according to (*). (Modified)
But the disease is so rare in the population (from which the patient was randomly sampled) that the posterior probability of ~J given e is still very low (and that of ~H is still very high). (Modified)
Therefore, “intuitively,” ~J is not indicated but rather ~H is. (Modified)
Therefore, (*) is unsound.

Mayo’s argument against premise 4 is interesting, but an orthodox Bayesian has an easy response. She points out that both error-statistical and Bayesian tests involve probabilistic calculations that are themselves deductive. The error-statistical framework only becomes ampliative with the introduction of the severity requirement (*), which goes beyond those calculations to make an assertion about which claims are well supported by tests. She demands that (*) be compared not against the deductive probabilistic calculations that Bayesians perform, but against a truly ampliative Bayesian rule. What she is demanding, in effect, is a rule of detachment, which tells a Bayesian when to infer from a statement of the form “the probability of H is p” to the statement “H.”

A Bayesian has at least two possible responses to this maneuver. First, it is not clear that Bayesian updating is a deductive method of inference. It uses a rule—Bayes’ theorem—that follows from the axioms of probability, but those axioms are not dictated by classical logic, nor is the normative claim that the right way to update one’s degree of belief in H when one has an experience the only direct epistemic import of which is to change one’s degree of belief in E to 1 is by conditioning on E, for all propositions H and E. Second, Mayo has not given any reason why Bayesians should adopt a rule of detachment, rather than being strict probabilists. The demand that orange-selling Bayesians provide an apple to compare with her apple would be unfair if part of the Bayesian position were that oranges can do everything apples can do at least as well apples do it. (A Bayesian can still approximate high-probability beliefs as full beliefs as a useful heuristic when doing so is not likely to lead to trouble.)

Having demanded (unfairly) that Bayesians adopt a rule of detachment, Mayo ascribes to Howson the following implicit rule:

· There is a good indication or strong evidence for the correctness of hypothesis H just to the extent that it has a high posterior probability.

She then turns Howson’s example against him, ascribing to him this rule of detachment. She notes that for a woman in her forties, the posterior probability of breast cancer given an abnormal mammogram is about 2.5%, which makes it very close to Howson’s hypothetical example. Under an error-statistics approach, the hypothesis that such a woman does not have breast cancer does not pass a severe test with a positive result; nor does the hypothesis that such a woman does have breast cancer. To provide strong evidence one way or the other, follow-up tests are needed. Under a Bayesian approach with a rule of detachment, the fact that the posterior probability that the woman has breast cancer is small provides a good indication that breast cancer is absent, “so the follow-up that discovered these cancers would not have been warranted.”

This argument is grossly unfair to the Bayesian position. For an orthodox Bayesian, whether or not follow-up tests are warranted (and whether or not the initial test was warranted) for a given individual depends on that individuals expected utilities. Rounding down probability 2.5% to 0% in an expected utility calculation is likely to lead to errors when the utility of the unlikely event is extremely high or extremely low, as in this case. This fact speaks not against Bayesianism, but against simple rules of detachment.

In summary, Mayo has shown that error statistics is sometimes more sensible than Bayesianism with a simple-minded rule of detachment. But a sensible Bayesian would not use such a rule of detachment, so this conclusion has no force against Bayesianism.

Mayo’s Reasons for Rejecting Premise 2 of Howson’s Argument

In my previous post, I presented Mayo’s reconstruction of Howson’s argument against error statistics:

An abnormal result e is taken as failing to reject H (i.e., as “accepting H”); while rejecting J, that no breast disease exists.
H passes a severe test and thus H is indicated according to (*).
But the disease is so rare in the population (from which the patient was randomly sampled) that the posterior probability of H given e is still very low (and that of J is still very high).
Therefore, “intuitively,” H is not indicated but rather J is.
Therefore (*) is unsound.

In this post, I will examine the reasons Mayo gives for rejecting premise 2.

Again, in the paper I am presently considering* Mayo expresses her severity requirement as follows:

(*): e is a good indication of H to the extent that H has passed a severe test with e.

where a test is “severe” with respect to H if and only if that test has a very low probability of passing H if H is false.

Howson gives an example of a medical test in which the hypothesis that a given patient has the disease in question (which in Mayo’s version of the example is breast cancer) appears to pass a severe test with a positive result, yet the posterior probability of the hypothesis that the patient has the disease conditional on the positive result is low. He takes this case to be a counterexample which shows that Mayo’s severity requirement is unsound.

Mayo responds in part by denying that the severity requirement has been met in this case. That is, she rejects premise 2 in her reconstruction of Howson’s. What reasons does she give for doing so?

First, after protesting a bit about the idealized nature of Howson’s example (which seems to me irrelevant to the point at issue), Mayo says that she will try to apply her severity requirement to it. She does so as follows (p. S208):

a. An abnormal result e is a poor indication of the presence of disease more extensive than d if such an abnormal result is probable even with the presence of disease no more extensive than d.

b. An abnormal result e is a good indication of the presence of disease as extensive as d if it is very improbable that such an abnormal result would have occurred if a lesser extent of disease were present.

Mayo has in mind a more realistic case than Howson’s, in which a disease can be present to varying extents. However, if her account aspires to provide a general theory of evidence, then it should apply to binary cases as well. Thus, in the context of this debate it seems unfair of Mayo to change the example. Sticking with Howson’s actual example, Mayo’s (a) and (b) reduce to the claim that a positive test result indicates the presence of disease to the extent that a positive result is improbable if the disease is absent.

At times, Mayo seems to be claiming something weaker for her severity requirement than that it provides a general theory of evidence—for instance, that it provides a reasonable guide for inductive inference when we do not have a strong evidential basis for assigning prior probabilities. Moreover, she claims that this kind of situation is very common in science, which makes understanding her severity requirement and the error-statistical techniques that conform to it quite important for understanding scientific practice. It seems to me that Mayo is on firm ground here. Moreover, I suspect that there is a lot of room here for reconciling her approach with Bayesianism by showing, for example, that frequentist techniques provide reasonably good approximations to Bayesian methods when priors are not known with precision but are known to be not too extreme.

However, Mayo goes further by presenting her account as a rival to Bayesianism, and by arguing not only that Bayesian techniques are hard to apply in many cases, but also that they are vitiated by their frequent dependencies on epistemic probabilities. As Clark Glymour points out in his paper “Instrumental Probability,” the claim that Bayesianism and the use of epistemic probabilities are “too subjective” is often motivated by confusing justification with content. An epistemic probability is “subjective” in the sense that it is a property of an individual’s (idealized) belief state (content), but it may nevertheless have a strong “objective” evidential basis (justification). When the “objective” justification for an epistemic probability is strong, I see no reason to object to it and the grounds that its content is subjective.

Returning to Howson’s example, it certainly appears that a positive result in Howson’s example satisfies Mayo’s severity requirement, given that a positive result is quite improbable if the disease is absent (P_~H(+)=.05 in Howson’s example, but that number can be made as small as one likes as long as the incidence rate/prior probability is adjusted downward to compensate). But Mayo denies that a positive result satisfies the severity requirement.

In support of this claim, Mayo points out that, unlike, the Neyman-Pearson framework for statistical tests, error statistics does not use automatic accept/reject rules; for instance, within the error-statistical framework one generally would not infer that a point null hypothesis is true from the fact that one fails to reject that null hypothesis at a pre-specified alpha level. The reason for this restraint is clear within the error-statistical framework: significance tests are unlikely to reject the null if the true value of parameter of interest is close to the null value relative to the power of the test. As a result, one cannot infer with severity from a failure to reject a point null hypothesis that that null hypothesis is true; one can at most infer with severity that the true value of the parameter is close to the null. (For instance, one might estimate a (1-alpha)% confidence interval for the parameter value, which would contain the null value.)

This move of Mayo’s seems to me a significant improvement in the Neyman-Pearson framework. However, it does not help in the case at hand, in which we are not considering a point null hypothesis about a continuous variable, but rather a hypothesis about a binary variable. On the other hand, understanding this aspect of Mayo’s account does help in understanding Mayo’s next move: Mayo claims that a failure to reject the hypothesis that the patient has breast cancer given a positive test result does not indicate that the patient has breast cancer so long as there are alternatives to this hypothesis that would very often produce the positive result.

Notice the analogy with a test of a hypothesized value for a parameter, which is presumably motivating Mayo’s claim here: one generally cannot conclude with severity that a point null hypothesis is true, because slight deviations from that point null are effectively indistinguishable from the null in a hypothesis test. In the same way, one cannot conclude with severity that a patient has breast cancer if there are other possible situations that would make a positive test outcome likely.

This requirement is surely too strong. Suppose that there is a non-diseased condition that mimics whatever sign or symptom of breast cancer the test picks up, generating false positives, but these condition is extremely rare—as rare as one likes. Then there would be an alternative to the hypothesis that the patient in question has breast cancer that would very often produce a positive result, but that there is no great need to investigate in concluding that the patient has breast cancer. Obviously one needs to rule out all possible alternatives to make a deductive inference, but one does not need to rule out incredibly rare/improbable alternatives to make a solid inductive inference.

There is a sound motivation behind Mayo’s claim that failure to reject H with a particular result does not indicate that H is true as long as there are alternatives to H that would make that result probable: in order to be telling in favor of a hypothesis, evidence must not only agree with that hypothesis but also speak against alternative hypotheses. However, the requirement goes too far. It seems that the sensible approach is to bring in prior probabilities and to require that any alternative hypotheses that would make the test outcome probable be themselves sufficiently improbable that the probability that any one of them is true is very small. Bayesianism implements this approach in a precise and well-motivated way, but in situations in which Bayes’ theorem is difficult to apply one could combine informal considerations of prior probability with the severity requirement to approximate Bayesian reasoning fairly well.

Getting back to the main point, does Mayo have a good argument against premise 2? I think not. In a realistic case, there could be many ways in which the hypothesis that a given person has a given disease could be false, some of which might make it probable that that the person would test positive for the disease despite not having it. Mayo would require ruling out such possibilities before declaring that the claim that the person has the disease is well supported. As long as we're considering her account as a theory of evidence, however, we need not be constrained by what would happen in most realistic cases. We can simply stipulate a hypothetical case in which the probability that someone who does not have the disease gets a positive result is .95, and that there are no further facts that would allow us to partition the set of people who do not have the disease into some who would get a positive result with high probability and some who would not. There are certainly many realistic cases in which we do not know any such facts, even if they exist, so this case in not so far removed from practice to be wholly uninteresting. In such a case, I do not see how Mayo's objections have any force against premise 2.

*The article I am considering is Deborah Mayo’s 1997 “Error Statistics and Learning from Error: Making a Virtue of Necessity.” It appeared in Philosophy of Science Vol. 64, Supplement. Proceedings of the 1996 Biennial Meeting of the Philosophy of Science Association. Part II: Symposia Papers (Dec., 1997), pp. S195-212. It is a response to Colin Howson’s “Error Statistics in Error,” pp. S185-S194 in the same issue.