Wednesday, November 30, 2016

"To err is human, but so often?" - David Freedman

Nature's editorial this week discusses the unleashing of the "statcheck" computer program on psychology journal articles.  Evidently it is an automated mechanism to detect errors in the calculation of p-values reported in published papers.

While I do not object to anything they said in the editorial, my concern is that there is very little penalty for carelessness in scientific research.  P-values are actually the least of my concerns; of greater concern are errors or even sub-optimal practices in the design, execution, and reporting of research.  Statistical inference is of no value if these other issues are present, and even when not present, statistical inference remains of incredibly limited value compared to a descriptive presentation of the data.  There are several reasons for this, such as:
  • Statistical inference presumes some kind of generalization, usually to a larger, stable population of which the data in the study can be thought of as representative.  This is rarely justified.
  • The statistical analysis adds information to the data  in the form of an assumed probability model.  This model's assumptions may well influence the outcome more than the data does.
  • Statistical inference is an inherently confirmatory activity, while most research is exploratory.  Statistical models in this context are overfitted to the data, and the generalization implied by statistical inference is invalid.
Nonetheless, sloppy calculation is a sign of carelessness, and for this reason the "statcheck" episode has certainly done a service, if it dis-incentivizes future carelessness.  On the other hand, legitimate criticisms of "statcheck's" own error rate have been raised.  I see this as the needed back-and-forth in the ongoing discussion of reproducible research, and accusations of "harassment" on the part of "statcheck's" creators are over-sensitive and unwarranted.



Saturday, October 15, 2016

Book review: The Five Ages of the Universe, by F. Adams and G. Laughlin

Adams and Laughlin (1999) propose that the history of the Universe (past and future) be classified into five periods, not unlike geologic eras, based on what we've learned from physics and cosmology so far. These eras include the primordial era, the stelliferous era (in which we now live), the degenerate era, the black hole era, and the dark era. Much of this is an extrapolation of our physics into the far future of the Universe, and is thus somewhat speculative. The authors propose a Copernican time principle, which states that the era in which we humans find ourselves is not a privileged one in the history of the Universe. (Copernicus earlier showed the our location (Earth) is not a special one in the solar system.)

Reference


Fred Adams and Greg Laughlin, 1999: The Five Ages of the Universe: Inside the Physics of Eternity. New York: Free Press.

Book review: The Trouble with Physics, by Lee Smolin

This controversial book (Smolin, 2006) is an attack on string theory and its dominance in theoretical physics. The author has worked in string theory himself, as well as in a major competing approach, loop quantum gravity. He argues that string theory has led nowhere despite being the dominant approach in the field for a long time. His scientific criticisms of string theory are heavily disputed in the community, and he may have overstated the case. However, he also makes a sociological criticism, that (in the U.S.) string theory has suffocated funding and employment opportunities for physicists who pursue alternative theories. Here Smolin's case seems more compelling.

The book has four parts. The first, "The Unfinished Revolution", is an enjoyable capsule history of unification in theoretical physics. Here the author proposes his list of the five great unsolved problems in physics: (1) Combining quantum theory with general relativity, (2) Resolving the difficulties in the foundations of quantum theory, perhaps by replacing it, (3) finding a theory unifying particles and forces, (4) explain the values of the free constants in the Standard Model of Particle Physics, and (5) explain dark matter and dark energy; alternatively explain the values of the constants in the Standard Model of Cosmology. The second part of the book, "A Brief History of String Theory", is precisely that. Here is where Smolin presents his assessment of the successes and alleged failures of string theory. I found this to be the toughest going and least enjoyable part of the book. The third part, "Beyond String Theory", has three chapters. The first discusses experimental and observational anomalies--for me, this was the most exciting part of the book. The other two chapters discuss speculative theories of physics, alternatives to both currently established theory and string theory. The final part of the book, "Learning from Experience", delves into the philosophy and sociology of physics. This is perhaps the most important part of the book. He feels that theoretical physics has run aground and is ripe for a paradigm shift. The "shut up and calculate" mentality that has been successful for the last 60 years has run its course, and it may be time for radical new ideas. The structure and sociology of the physics community is currently an obstacle to any such radicalism. Although I cannot go all the way along with Smolin in this section, I too am a critic of the academic tenure system and the funding mechanisms for science in the U.S.

Smolin has thought a great deal about the history, philosophy, and sociology of the physics profession. This is unusual for a physicist nowadays. His book presents an opportunity for the rest of us to do so too.

Reference


Lee Smolin, 2006: The Trouble with Physics: The Rise of String Theory, the Fall of a Science, and What Comes Next. Houghton Mifflin.

Book review: Calculated Risks, by Gerd Gigerenzer

Gigerenzer (2003) addresses statistical thinking, and the lack thereof, in medical and legal contexts, focusing on handling probabilities (risks). He identifies several issues and correctives:
  • The illusion of certainty. For instance, most patients are not told that diagnostic medical tests can make mistakes, and are not informed of the error rates (false positive and false negatives).
  • Ignorance of risk. Even if uncertainty is acknowledged, laymen and experts often do not know how great the level of risk is.
  • Miscommunication of risk. Because of the peculiarities of human psychology, the way that risk information is usually communicated (using probabilities expressed as frequencies) can be misleading. For instance, absolute risk reduction, relative risk reduction, and number needed to treat are all mathematically equivalent ways to express the efficacy of a treatment. However, relative risk reduction is usually the way to communicate the results that leaves the best impression on the untutored mind.
  • Clouded thinking. Even when risks are communicated properly, both experts and laypeople may not know how to reason with them. Expressing probabilities as natural frequencies forces us to focus on the reference class, and it allows people with little training to carry out Bayes Rule calculations easily.
The book provides a number of interesting case studies: the cases of breast cancer screening and AIDS counseling are particularly dramatic, and should be required reading for anyone taking a diagnostic medical test. Several other decision making heuristics that can result in misleading results are discussed, such as the "category effect". In general though, the book is not a complete discussion of the psychology of judgment and decision making applied to statistical thinking. Nonetheless, the author identifies a number of sub-optimal medical and legal practices in every day life.

Reference


Gerd Gigerenzer, 2003: Calculated Risks: How to Know When the Numbers Deceive You. Simon & Schuster.

Saturday, September 24, 2016

The Atlantic weighs in on reproducible research

DTLR has dwelled on the reproducibility crisis since its birth.  Most of the discussions I have cited are from the scientific literature, but some have appeared in venues intended for a general audience.  One of the best I've seen has just appeared in the Atlantic, in a post by Ed Yong.  It appropriately places the focus on the incentive system for scientists.  A one sentence summary is provided by its quotation of Richard Horton, editor of the Lancet, who said:  "No one is incentivized to be right.  Instead, scientists are incentivized to be productive."

Please take a look at Yong's post.


Sunday, August 14, 2016

Book Review: Stephen Stigler's "The Seven Pillars of Statistical Wisdom"



The Seven Pillars of Statistical Wisdom, by Stephen M. Stigler (Harvard University Press, Cambridge, Mass., 2016).

The book presents seven principles that the author believes support the core of statistics as a unified science of data, “the original and still preeminent data science” (p. 195).  It is intended for both professional statisticians and the “interested layperson,” though I suspect the latter would struggle a bit, as the author does not shy from formulae, calculations, and even name-drops of advanced statistical methods and concepts.  The author is a distinguished professor of statistics at the University of Chicago, and a leading historian of the field.  Each of the seven main chapters discusses one of the “pillars,” illustrated with historical examples (as opposed to contemporary ones) and often accompanied by discussions of the pitfalls involved with each principle.  The author states that “I will try to convince you that each of these was revolutionary when introduced, and each remains a deep and important conceptual advance” (p. 2).

The first principle is titled “Aggregation” or “the combination of observations,” of which the arithmetic mean is the chief example discussed.  The author implies that the method of least squares, and more general smoothing methods, also falls under aggregation, broadly understood.  The concept of aggregation is radical because the principle implies that individual observations can be discarded in favor of sufficient statistics.  Prior to a general acceptance of averages, scientists would often simply choose the “best” of a set of observations, or perhaps take a midrange (average of the highest and lowest values).  The concept offers other dangers, as the author illustrates with Quetelet’s notion of the Average Man.

The second principle is titled “Information:  its measurement and rate of change”, which focuses on the Central Limit Theorem and the root-N law (which roughly states that the gain in precision of an estimate increases with only the square root of the amount of data used to calculate it).  The author acknowledges the contrast of statisticians’ usage of the term “Information” (specifically, Fisher Information) with its more general use in signal processing and information theory (specifically, Shannon information).  Again, pitfalls are discussed, including a case where randomly selecting one of two data points is better than using their average.  This is a case where cannonballs of two calibers are reported by different spies.  A cannon whose caliber equals their average would not exist.  “The measurement of information clearly required attention to the goal of the investigation” (p. 59).  (In my view, one could write an entire chapter on that last point, and it would be more important than most of the 7 principles selected by the author for this book.)

The third principle is titled “Likelihood:  Calibration on a Probability Scale”.  Here the concept of a statistical significance test is introduced, along with p-values, Bayesian induction, and the theory of maximum likelihood.  The fourth principle is titled “Intercomparison:  within-sample variation as the standard.”  He illustrates it with Student’s distribution and t-test, and the analysis of variance.  Pitfalls are illustrated with an example of data dredging in the hands of economist William Stanley Jevons.  The author acknowledges further pitfalls, “for the lack of appeal to an outside standard can remove our conclusions from all relevance” (p. 198). (In my view this concern is understated:  statisticians are fond of standardizing data, but this prevents multiple data sets from being compared using an external standard.  Dimensional analysis offers an alternate approach.)

The fifth principle is titled “Regression:  Multivariate Analysis, Bayesian Inference, and Causal Inference”.  This principle warrants the longest chapter of the book, and begins by focusing on regression to the mean, a discovery made by Francis Galton.  This discovery resolved a paradox Galton had noticed in Darwin’s theory of evolution:  if each generation produced heritable variation of traits to its offspring, why was the aggregate variation in those traits stable over time?  Later in the chapter, Stein’s paradox is discussed, and shrinkage estimation is presented as a version of regression.  The correlation-causation fallacy is also discussed, including spurious correlation and Austin Bradford Hill’s principles for epidemiological inference.  The chapter also covers multivariate analysis, Bayesian statistics, and path analysis-- a real hodge podge.

The sixth principle is “Design:  Experimental Planning and the Role of Randomization”.  Fisher’s demolishment of one-factor-at-a-time experimentation is discussed, as is Pierce’s innovation of using randomization in experimental psychology studies, and later Neyman’s discussion of random sampling in social science.  The chapter ends with a brief discussion of clinical trials and a lengthier discussion of the French lottery of the 18th and 19th centuries.

The seventh principle is titled “Residual”, by which the author means both residual analysis, a commonly used approach for statistical model criticism, as well as formal model comparison for nested models, using a significance test.  The author also detours into the history of data graphics.  The chapter is marred by infelicities in its history of physics and astronomy.  At one point the author states that “We are still looking for that [lumineferous] aether” (p. 172).  Rest assured, most physicists are not worried about that.  The author then describes Laplace’s approach to resolving an apparent discrepancy in the orbits of Jupiter and Saturn; Laplace was able to show that the motions could be explained using a mutual 3-body problem with the sun.  Using an exaggeration worthy of our current Presidential candidates, the author observes that “A residual analysis had saved the solar system.”  Finally, in the Conclusion, the author speculates about the possibility of an (as yet unknown) eighth pillar to accommodate the era of big data.

At this point, readers should be warned that I have an unconventional and dissident view of statistical ideology.  For instance, where the author states about statistical significance tests, “misleading uses have been paraded as if they were evidence to damn the entire enterprise rather than the particular use” (p. 197), I would number myself among those who would damn the entire enterprise.  (This is a topic of current controversy, as evidenced by Wasserstein, 2016.)  There is some value in distilling the ideas of statistics into a set of principles; similar exercises are commonly embarked on, and Kass et. al (2016) is another example published in the same year.  Were I to write such an account, it would differ from both Stigler’s and others’, and present my own statistical ideology.  This will have to wait for another day.  Suffice it to say that my selection of pillars would differ, and any discussion I'd offer of Stigler's would dwell far more on the pitfalls and hazards than he has.

In my view, this book’s chapter on “Design” is the best (except for the digression on the French lottery), while the topics discussed in the other chapters are so fraught with difficulties that the concepts described might be as potentially harmful as they are helpful to the serious data analyst.  I found the book disappointing and less enlightening than I had hoped. While not as bad as Salsburg's The Lady Tasting Tea, I would find it difficult to recommend this book to readers of any level of statistical sophistication.


References


Kass, R.E., et al., 2016:  Ten simple rules for effective statistical practice.  PLoS Computational Biology, vol. 12 (6), e1004961.

Wasserstein, R. L. (ed.), 2016:  ASA statement on statistical significance and P-values.  The American Statistician, vol. 70, pp. 129-133.

Friday, July 22, 2016

A less dismal science

This blog rarely strays into the behavioral sciences, and for good reasons.  Some of these reasons are outlined in an article in last week's The Economist, in a special insert, "The World If".  This particular piece ponders the scenario, What if economists reformed themselves?  One of the criticisms identified in the article is "model mania"; the author writes, "problems arise when they mistake the map for the territory."  Frankly, I think this is a criticism that applies more broadly, to any area of mathematical modeling where the contact between model and reality is very loose or non-existent.  This occurs when mathematical models are not validated by comparison with actual data; the ultimate validation regime is to predict new phenomena or future data, and compare such predictions with experimental or observational data.  Theories of physics are usually test driven in this way, as are engineering models, and many of those in data science.  Such validation is often lacking in both economics and inferential (as opposed to predictive) statistical modeling in general.  The author of the Economist piece recommends that economists repeat the mantra, "My model is a model, not the model."  DTLR advises all other users of mathematical and statistical models to do the same.

For further reading, see The Financial Modelers' Manifesto by Paul Wilmott and Emanuel Derman (2009).






Exploratory or confirmatory?

In last week's issue of Science, outgoing editor Marcia McNutt was interviewed (Shell, 2016) on the occasion of beginning a term as President of the National Academy of Sciences.  I am going to reproduce a lengthy quote from the interview.

At Science, the paradigm is changing.  We're talking about asking authors, 'Is this hypothesis testing or exploratory?'  An exploratory study explores new questions rather than tests an existing hypothesis.  But scientists have felt that they had to disguise an exploratory study as hypothesis testing, and that is totally dishonest.  I have no problem with true exploratory science.  That is what I did most of my career.  But it is important that scientists call it as such and not try to pass it off as something else.  If the result is important and exciting, we want to publish exploratory studies, but at the same time make clear that they are generally statistically underpowered, and need to be reproduced.

Bravo, Dr. McNutt!  DTLR agrees completely with the sentiment here.  It matters because the statistical dressing that accompanies much scientific research is usually only appropriate for confirmatory studies, or those that McNutt calls hypothesis testing, rather than hypothesis finding (exploratory).  It is rare to find the editor of a major scientific journal express this view in such a crisp, precise manner.   DTLR hopes that her successor, and other editors and referees of scientific journals, follow the lead set by McNutt.  DTLR also recommends all readers of this blog to take a look at Tukey (1980).

Reference


Ellen Ruppel Shell, 2016:  Hurdling obstacles:  Meet Marcia McNutt, scientist, administrator, editor, and now National Academy of Sciences president.  Science, vol. 353, pp. 116-119.

John W. Tukey, 1980:  We need both exploratory and confirmatory.  The American Statistician, vol. 34, pp. 23-25.

Friday, June 17, 2016

Randomized clinical trials, defended

Medical blogger Vinay Prasad has posted a vigorous defense of randomized clinical trials, responding to a recent paper in the New England Journal of Medicine.  It's worth a look.

H/T:  In the Pipeline by Derek Lowe (discussion here).

Wednesday, May 25, 2016

Nature keeps the heat up on reproducible research

This week's issue of Nature has a good article by Monya Baker on a wide-ranging survey of scientists about reproducible research, and a related editorial.  DTLR is most encouraged by the final table in Baker's article, the ratings of factors most likely to improve reproducibility.  "More robust experimental design" received the most combined "likely" and "very likely" ratings.  I think that this is the right answer.  Also highly ranked were "better mentoring/supervision" and "better understanding of statistics".  This latter one is a tough call, as statisticians themselves seem not to have reached a consensus on how to move forward, as evidenced in the extensive Discussion items published along with the American Statistical Association's Statement on Statistical Significance and P-valuesposted in early March.

DTLR expresses thanks to Nature for keeping the drums beating on reproducible research.  The issue is very visible right now, and the community should strike while the iron is hot, in terms of reforming the infrastructure of our community (laboratory practices, publication standards, and incentives for grant funding, promotion, and tenure).  Mis-aligned incentives are ultimately the cause of non-reproducibility, though methodological issues (poor study design and execution, inappropriate use of statistical methods, etc.) are key enablers.



Sunday, April 10, 2016

The water watchdog

DTLR supports the views of Prof. Marc Edwards, expressed in interviews with the Chronicle of Higher Education (with Steve Kolowich, here) and Science Magazine's Working Life (with Rachel Bernstein, here), regarding the mis-aligned incentives for academic scientists, and other topics.  He is one of the experts worked to "uncover and address the elevated lead levels in drinking water in Flint, Michigan" (as Bernstein wrote).


Saturday, March 5, 2016

Gravitational waves and colliding black holes

DTLR has been dormant for nearly a half-year.  However a significant discovery has been reported recently that is worthy of note.  The LIGO (Laser Interferometry Gravitational-Wave Observatory) detected the emission of gravitational radiation from a source event that appears to be the collision of two black holes.  The event marks the dawn of the age of gravitational wave astrophysics.  We are again grateful to be witnesses to the making of physics history.  Congratulations to the LIGO and VIRGO collaborations.

Reference:


B.P. Abbott, et al., 2016:  Observation of gravitational waves from a binary black hole merger.  Physical Review Letters, 116:  061102.