The Diffusion Tensor Literary Review: A review of "Research Methods for Science," by Michael Marder. Part 4. More about statistics.

In this post I continue my review of Michael Marder’s book, Research Methods for Science (Cambridge University Press, 2009). In my last post I began to take issue with the statistical thinking illustrated in the book; I will continue doing so here.

The coin flipping problem

As mentioned in Part 1 of this review, coin flipping is one of Marder's main examples for illustrating hypothesis driven research. However, his analysis of coin flipping data is inconsistent, sometimes treating the resulting data (correctly) as binary and other times (incorrectly) applying statistical methods intended for floating point data. Marder pays lip service to matching the statistical methods to the data type, most emphatically during his discussion of the chi-square test (p. 101), but doesn't follow this principle when dealing with his coin flipping data.

Let me be specific here. The coin flipping experiment is introduced in Sec. 2.1, and he represents heads with 1's and tails with 0's (p. 23). In Sec. 3.3 he correctly uses a binomial probability model for such data. Note that in this section, he states that “Assigning 0 and 1 to heads and tails is an arbitrary choice; any two integers would do” (p. 63). However, in most other places in the book he takes these integers literally, by calculating means and standard deviations on such data, interpreting these quantities, and even carrying out a Z test (pp. 28-29, 71-73, 88-89)! None of these arithmetical operations is meaningful for integer data, let alone binary data merely coded as integers. At the very least, Marder's interpretation of such calculations depends very much on the choice of special integers, 0 and 1. His approach does not generalize to multinomial non-numeric data, and students must be taught not to assign arbitrary integers to categorical data if the analysis will take these integers literally (e.g., by applying methods designed for floating point numbers). I would argue that much of Marder’s analysis of the coin flipping problem is technically incorrect, but the wider point is that his treatment illustrates and blesses poor practices and bad habits.

Correlation and regression

These topics are discussed in Sec. 4.3. Marder describes the least squares linear regression line as the best fit line, but such a description is not always valid. The calculations he describes in this section are based on conditioning on the dependent variable. When this is appropriate, we can call the resulting line a best fit line. However, it is common in science to encounter cases where such conditioning is not appropriate, such as when the scientific interest is in estimating the relation between the two variables rather than predicting one, given knowledge of the other. Moreover, in cases where both variables are random and/or measured with measurement error, the least squares line may provide a biased estimate of the slope. Better solutions in these situations have been around for years (e.g., orthogonal distance regression, geometric mean regression, measurement error regression), but are not typically discussed in introductory books (including Marder's). Consequently many scientists proceed with simple linear regression in cases where it is not appropriate, because they don't know any other way.

Marder also describes the correlation coefficient as “a measure of how well or poorly the points...fall on a straight line” (p. 127). This is true but not the whole truth. The correlation coefficient is also sensitive to the slope of the fitted line, and thus it is not a pure measure of the clustering of data about the best fit line (e.g., Loh, 1987).

Sensitivity and specificity

Sec. 3.6.6, on Type 1 errors, ends with an example: mammogram error rates. He describes the false positive and false negative rates of mammograms, with a little bit of context. However, the discussion is very cursory and fails to acknowledge an important distinction: the false positive and false negative rates are conditional on knowing the true status of the disease. When diagnostic results are reported to an individual patient, however, the probabilities that the patient is actually positive or negative for the disease, given the reported diagnosis, can be quite different from the diagnostic test's sensitivity and specificity, since the former probabilities also depend on the prevalence of the disease. These quantities (positive predictive value and negative predictive value) are conditioned in the opposite direction from the sensitivity and specificity discussed by Marder, and the two sets of error rates are frequently confused with each other, even by medical professionals. (In the legal profession, a somewhat similar confusion involving the direction of conditioning is known as the “prosecutor's fallacy.”) The routine use of mammograms in certain age groups has been a politically and medically controversial topic in recent years, and many readers of the book know or will come to know a woman receiving a positive diagnosis from a mammogram. The cursory and incomplete discussion of mammogram error rates in Marder's text, I fear, may be harmful to readers if left without further explanation. Either he should omit the example, or give a fully nuanced discussion of it, as can be found for instance in Gigerenzer (2002).

Precision and accuracy

Marder's discussion of precision and accuracy (Sec. 2.2.3) is an example of muddled writing. He never actually provides definitions for the two terms, and seems to conflate precision with numerical precision and significant digits (a distinct topic otherwise absent in the book). Accuracy refers to the systematic error of a measurement process; it is quantified by the deviation between a true value (for instance, of a standard sample assigned a reference value) and the mean of replicated measured values. Precision is the degree of scatter of replicated measured values about their mean, regardless of accuracy. Measurements can be accurate and not precise, and vice versa. Examples of clearly written discussions of precision and accuracy may be found in Dunn (2010, Sec. 7.5) and at greater length in Mandel (1984, chapter 6).

Model assumptions

In the discussion of U.S. population growth models (Sec. 4.6.1), an opportunity is lost to illustrate checking the plausibility of model assumptions. Although Marder concedes that his models' prediction “is only as good as the assumptions that led to” his model (p. 140), he makes no attempt to evaluate the plausibility of these assumptions. The model assumes that there is a fixed maximum population and a fixed growth rate. The growth rate is assigned the value of a historical average (for 1950-2000), and the maximum population is arbitrarily set to 6.4 billion. The author's discussion seeks mainly to illustrate how to simulate an iterated map model in spreadsheet software, but he could have also given students some guidance on comparing a model's forecasts with, say, historical data or data from other countries. Such comparisons are an essential component of model validation, a process commonly discussed in engineering. However, in "pure" science model validation is rarely discussed in any formal way, except in certain fields where forecasting is of central interest (e.g., weather and climate).

Error propagation

In Sec. 2.2.4, Marder discusses the rules for propagation of error when calculating quantities consisting of arithmetic combinations of other quantities measured with error. The rule for adding two quantities (their uncertainties add in quadrature) is justified by citing a probabilistic argument given later in the book (p. 80), which shows that this is true for standard deviations when adding two independent random variables. This is a subtle sleight of hand. The rules for propagation of error apply equally to deterministic uncertainties, as justified by approximation theory (namely, Taylor approximation) as shown by Taylor (1997). It is reassuring that the approximation theoretic and probabilistic rules for propagating error are consistent. However, an opportunity is lost to distinguish between a deterministic uncertainty, such as the finite resolution of a measurement system, which can be applied to a single measurement, and a probabilistic uncertainty, which can only be evaluated by replicated, independent measurements. Incidentally, the latter is preferred, since the uncertainty of a single measurement based on finite resolution is almost always an underestimate of the full uncertainty of the measurement process, which can only be evaluated by replicated, independent measurements, preferably carried out on multiple days by multiple technicians, under varying conditions. The uncertainty of a measurement instrument is always less than the uncertainty of a measurement process as a whole -- this is another arch principle of statistical thinking often lost on physicists (of which I am one). Ironically, it was the great physicist-turned-statistician W. Edwards Deming, among others, who promoted a process-focused view of variability. (I hope to explore this issue in greater depth in a future post.)

The next and final post of this series will collect some miscellaneous criticisms of the book.

References

P.F. Dunn (2010): Measurement and Data Analysis for Engineering and Science, 2d ed. CRC Press.

G. Gigerenzer (2002): Calculated Risks: How to Know When Numbers Deceive You. Simon & Schuster.

W.-Y. Loh (1987): Does the correlation coefficient really measure the degree of clustering around a line? J. Educ. Stat., 12: 235-239.

J. Mandel (1984): The Statistical Analysis of Experimental Data, corrected reprint. Dover.

J.R. Taylor (1997): An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements, 2d ed. University Science Books.

The Diffusion Tensor Literary Review

Pages

Wednesday, August 21, 2013

A review of "Research Methods for Science," by Michael Marder. Part 4. More about statistics.