A few weeks ago, the Economist's Technology Quarterly had an excellent profile of biostatistician Dr. Susan Ellenberg in its 'Brain Scan' column. The article describes her long and influential career in clinical trials, spanning the NIH, the FDA, and academia. Example include her insistence that patients who did not follow the trial protocol be tracked, her championing of surrogate endpoints in cancer and HIV trials, her involving of patient groups in planning clinical trials, and her work on interim analysis and vaccine safety.
The article brings us to the present, and the benefits and hazards of using big data, which is typically observational data found in massive health care records databases, such as those owned by healthcare and insurance organizations. Ellenberg summarizes the issues as follows: "The more people you have the richer your database will be but also the more ways there are to be misled by the data." The article concludes, "We've got all this data...The answer isn't to ignore it. The answer is to figure out how to limit the number of mistakes we make."
The article does not give examples of such mistakes, but readers steeped in statistical thinking can come up with examples of their own. Multiplicity is attached to many such mistakes; a phenomenon that can lead to identifying spurious correlations. For instance, a blind search for correlated variables in such databases, perhaps assisted by subsetting and subgrouping, is bound to find many spurious correlations by chance alone. Few of these would be reproduced in other data sets or in future data. For those that are reproduced, the direction of causality, if there is one, may be unclear; alternately a lurking variable (one not measured or captured in the database) may hold the causal insights. One way to protect ourselves from such mistakes is to use findings from big data only as hypotheses, to be confirmed by prospective, randomized, and blinded trials. The short article goes into neither the nature of the mistakes or possible remedies. Nonetheless, the article plays a valuable role in tamping down expectations of big data, a term that has received a great deal of hype in recent years.