If journalism is a first draft of history, Physics World writer Edwin Cartlidge has done a superb job this month of reporting on the "Proton Radius Puzzle" and the pitfalls of combining data from multiple studies. Cartlidge's piece is also a superb case study of a phenomenon described by David Bailey a few years ago.
Over time, a number of experiments around the world, using different physical principles, attempted to measure the radius of the proton. An international group, CODATA, has the task of compiling all such data and basically reporting the community's best estimate. This is done by first voting on which studies should be included; then the data are simply (weighted) averaged; presumably the weights are determined by the error bars reported by the individual studies. The thus combined estimate, it is hoped, will have a more accurate point estimate, and narrow error bars, than any of the individual findings. The former chair of the CODATA group working on the proton radius is quoted in Cartlidge's article arguing that this process incorporates all "individually credible results" but passes no judgment on whether each of those results is "right or wrong", a task that "would require superhuman powers".
Well, for about eight years, the measurement by the CREMA experiment, using a unique muon-based principle, was excluded from the average, as it was an outlier compared to other reports (in exactly the sense that Bailey has described). However, in the interim, other groups using the more conventional approaches started to obtain results comparable to the lower value given by CREMA. In 2018 CODATA finally incorporated the outlying results, though the stated error bars for the combined estimate had increased, unsurprisingly. See Cartlidge's article for the twists and turns of the story.
DTLR's interest here is in the whole concept of combining data. Something like this is widely practiced in statistics, under the name "meta-analysis". I consider this poor practice, because it sweeps under the rug potential systematic errors in the individual results. In the proton radius case, Cartlidge even seems to suggest that groupthink might have been at play in the CODATA decisions.
Here is DTLR's opinion about combinging data from multiple studies. Don't do it. Instead of meta-analysis, the individual study results, with their error bars, should simply be displayed together. Users should be directed to critically review the study design, execution, analysis, and reporting of the individual studies, seeking out differences among them. Authors of systematic reviews should use their judgment and discuss the similarities and differences, without blindly pooling all the data together. Cartlidge writes, "the CREMA result was not really at odds with individual spectroscopy experiments – all but one differed by no more than 1.5 standard deviations, or σ. The only significant disparity – of at least 5 σ – arose when the conventional data were averaged and the error bars shrunk. But that disparity could only be maintained if the muon result itself was kept out of the fitting process – given how much it would otherwise shift the CODATA average towards itself." My interpretation is that the artificial task of combining the data drove the source of confusion; this would have been avoided by simply presenting all the individual study results separately. The field has clearly not reached sufficient maturity for a combined "best" estimate to be meaningful, in my opinion, and this is probably even more true of the meta-analyses often reported in the medical and public health literature.
Just days after Cartlidge's article came out, another one authored by him was published that also has combining data at its heart. This one was about gravitational waves, but the story is complicated even further by the waveform modeling required to interpret gravitaitional wave signals.