In this post I continue my review of Michael Marder’s book, Research Methods for Science (Cambridge University Press, 2009). Last time I discussed the author’s failure to discuss randomization in experimental design. A second lost opportunity is the lack of a systematic discussion of sources of variability. Although chapter 3 is dedicated to statistics, chapter 2 introduces a number of statistical concepts and methods, in the context of “error analysis.” The author spends a lot of time in the chapter discussing measurement error (both random and systematic), which is a good thing. However, he fails to acknowledge that other sources of variability (not just measurement error) need to be accounted for in experiments in the life and social sciences. Biologists use the terms “biological variability” and “technical variability” to distinguish between natural biological variation in a given population and measurement error, respectively. Chapter 2's example of the fish spine length measurements presents a good (but lost) opportunity to make this important distinction. Statistical analysis in this example is required not so much due to measurement error (which is not assessed at all in the study as Marder describes it) but rather due to the natural variation of fish spine lengths in the two lakes. Partitioning the sources of variability is one of the most important concepts in applied statistics, and although the machinery of analysis of variance is needed to fully cope with it, the author could have at least introduced it conceptually. Measurement error in the fish spine study would have been assessed by taking multiple, independent measurements on each fish, and its extent would likely be much lower than the variation between fish. A study incorporating such measurements also presents an example of structured data that cannot all be considered independent and identically distributed. Structured data is too ubiquitous in science and ordinary life to leave unmentioned.
A third lost opportunity is the distinction between statistical and practical/clinical significance. The author pays lip service to this distinction (p. 54) but does not walk the walk. Earlier in the book (p. 18) he presents, as an example of hypothesis driven research, the question of whether one aspirin cures at a faster rate than another. An even earlier example is “finding which of two medical treatments is more effective” (p. 4). As John Tukey (1991) would have argued, the cure rates of the two aspirins will always differ at some decimal place. The more important question is by how much do the cure rates differ, and how precisely do we know it? The answers are provided by point and interval estimation, not hypothesis testing (e.g., Gardner & Altman, 1986). (In fact, reporting the confidence interval tacitly implies a hypothesis test: if the confidence interval of the difference in cure rates overlaps the null hypothesis value, typically zero, statistical significance was not demonstrated.) In the author’s defense, the obsession with hypothesis testing at the expense of estimation is rampant in the scientific community, and statisticians themselves are partially to blame for this misplaced emphasis. The resulting attitude can lead to perverse consequences. A practically or clinically negligible difference between medical treatments can always be found statistically significant when a sufficiently large number of patients are enrolled in a clinical trial. This has led to oncology trials of treatments with small therapeutic effects (and non-negligibly harmful side effects) in large studies in order to chase statistical significance (Horrobin, 2003). This is an abuse of the altruism asked of patients in these trials, who must suffer before they die, in order to secure knowledge of a statistically significant but clinically questionable outcome.
The examples of statistical results reported in Sec. 5.3.5 (p. 159), are dominated by hypothesis testing; no confidence intervals for quantities of scientific interest are provided. In the lichen example, Marder does provide standard errors for meaningful quantities (lichen sizes from two regions), but not a confidence interval for their difference (which appears to be the pivotal question of the study; only a p-value is given.) In the stickleback example, he again provides estimates of scientifically meaningful quantities, but without any uncertainty statements attached to these estimates. In both of these examples, p-values are presented with more than two decimal place precision: such numerical precision is almost never meaningful.
Another example of the weakness of the hypothesis testing approach is illustrated by Marder’s bottle rocket height example, discussed in Sec. 3.8, on the chi-square goodness of fit test. Here, a parabolic curve is fitted to trajectory data, and the chi-square test is used to evaluate the quality of the curve fit. This is a common practice, but questionable when unaccompanied by additional work. Goodness of fit should not be boiled down to a single number, the p-value of a statistical test. Quality of fit may be good in one region of the curve but poor in another—plotting the data (Fig. 3.13) helps to evaluate where and how the fit may be poor. The root mean square error (RMSE) of the curve fit provides a quantitative estimate of how scattered the data are around the fitted curve. An estimate like the RMSE can help evaluate whether goodness of fit may be practically acceptable, regardless of statistical significance, which is driven as much by sample size as magnitude of error. The chi-square p-value alone cannot address practical significance.
In Sec. 2.1.1, Marder discusses null and alternative hypotheses. However, he fails to make a distinction between conventional null hypotheses of no effect (innocent until proven guilty) and equivalence hypotheses, where one is trying to demonstrate that two effects are (by pre-specified criteria) equivalent, whereas the default position is that they are not. (Such hypotheses are useful for instance in research on generic drugs.) Occasionally I’ve seen scientists make use of conventional null hypothesis testing where equivalence testing is more appropriate, usually because they know no other way. Marder has one example which resembles the issues raised in the equivalence hypothesis: in Table 2.1 (p. 19), he lists an alternative hypothesis that “A Toyota Camry weighs exactly 1000 kg” with the null hypothesis that it does not. This pair of hypotheses is structured like a “one sample” version of an equivalence hypothesis, but he does not comment further about its alternative structure, compared to the more traditional hypothesis pairs in Table 2.1. In Table 1.1 (p. 5), he says that the hypothesis that a Toyota Camry weighs exactly 1000 kg is “silly if left as a hypothesis. There is no reason that the weight of a car should come out to such a neat round number.” He doesn't seem to realize that the same “silliness” exists in “two sample” problems – as mentioned above, two medical treatments are always different to some decimal place, possibly negligibly so (Tukey, 1991).
Addressing the complaints outlined in this and the previous posts would not require lengthening the book by much. Randomization could be dealt in less than a page in the main text, with another page in the spreadsheet appendix showing how one could actually carry it out. Sources of variation could be addressed in a single paragraph in the fish spine length example, with a reference to a more advanced text provided. Addressing practical/clinical significance vis-a-vis hypothesis testing requires rewriting a number of passages, but without making them lengthier by much. In short, it should have been easy to make Research Methods for Science a much better book while keeping it short. I will continue to examine the book's deficiencies in the next two posts.
References
M.J.
Gardner and D. G. Altman (1986): Confidence intervals rather than p values:
estimation rather than hypothesis testing. British Medical Journal, 292:
746-750.
D.F.
Horrobin (2003): Are large clinical trials in rapidly lethal diseases usually
unethical? Lancet, 361: 695-697.
[I hesitate to cite work of a figure as controversial as Horrobin, but I
think the ideas expressed in this particular paper, published a few months
before his death from cancer, may actually have some merit.]
J.W.
Tukey (1991): The philosophy of multiple comparisons. Statistical Science,
6: 100-116.
No comments:
Post a Comment