Estimating the extent of pathological science
For the past few years, the science media and the research community have been up in arms about a “reproducibility crisis.” Some unacceptably large proportion of the published research (at least in specific disciplines, most notably psychology and drug development) is irreproducible. Give a second group of researchers the necessary instructions and tools to reproduce the results of a published paper and they will typically get a very different answer. This implies that had the original researchers done the experiment right and/or interpreted it correctly, they would have had nothing of interest to publish.
This does not necessarily mean, however, that the second group got the right answer or, for that matter, that either got it right. As Irving Langmuir noted, examples of pathological science — the “science of things that aren’t so” — are often replicated, multiple times, as was the case with cold fusion. That’s what keeps these phenomena going.
Any time we’re confronted by a scientific controversy — two or more research factions promoting competing answers to a question of interest — the implication is that one or more of these research factions is wrong. This, in turn, implies that the researchers who believe the wrong hypothesis are deluding themselves and doing so on the basis of evidence that is misinterpreted or incorrect — i.e., pathological. The research community will often try to avoid this conclusion by arguing for a compromise in which everyone involved can be at least partly right. That’s always possible, but the smart money would bet against it.
A very likely explanation is that all scientific research simply can be placed on a distribution of health and pathology — a bell-shaped curve — from the very healthy to the outright pathological, and it’s only on the very healthy side of the curve that the right answers and reliable knowledge can be obtained. We could assume the same curve of any other occupation, from accounting to zookeeping. Some working in these occupations are exceedingly good at what they do, some are hopelessly inept (but got hired and kept their jobs so far anyway), and the great proportion fall somewhere in the middle.
Why not assume the same for science? The caveat is that achieving scientific excellence and elucidating reliable knowledge is likely to take a unique skill set and is, almost by definition, harder than all these other occupations. In science, after all, the easy problems have de facto already been solved; meaningful scientific research always targets the most difficult. Even brain surgeons and rocket scientists will be commonly called in to do routine jobs; not so the best research scientists who will always be working on the hardest problems, at the limits of their available technology to investigate. As such, a smaller percentage of the relevant research community is doing excellent or even acceptable work in science than what might be the case with other occupations.
The error rate in science — the proportion of results or conclusions that have yet to be corrected and constitute unreliable knowledge — is almost assuredly discipline specific. Informed estimates can be found, though, that are high enough to give us pause even about the hardest of hard sciences. One of the first numerical estimates I ever encountered was John Ziman’s in his book Reliable Knowledge. Ziman had been a working physicist before becoming a philosopher of science. He described the front line of scientific research (the stuff of the latest journal articles and the media reporting that accompanies it) as “the place where controversy, conjecture, contradiction, and confusion are rife.”
“The physics of undergraduate text-books is 90% true,” Ziman wrote. “The contents of the primary research journals of physics is 90% false,” he added. He then described the “scientific system” as involved as much in “distilling the former out of the latter as it is in creating and transferring more and more bits of data and pieces of `information.’” Ziman may have been engaged in hyperbole with that 90 percent number to make a point. If so, however, it was in the context of a discipline, physics, that would be expected to have fewer uncorrected errors than others. It is “a very special type of science,” Ziman wrote, because the “subject matter is deliberately chosen so as to be amenable to quantitative analysis” (p. 9).
Ziman’s estimate is also in line with others. In 2009, for instance, two authorities on what has come to be called evidence-based medicine — Iain Chalmers, a founder of the Cochrane Collaboration, and Paul Glasziou — estimated that 85 percent of all research dollars are “wasted.” In these cases, they considered the reasons correctable, from addressing “low priority questions” to failing to control for bias in the experiments to not publishing negative evidence. High as it is, Chalmers and Glasziou’s 85 percent estimate did not take into account research that simply gets the wrong answer because the problem the researchers are attempting to solve is still too difficult. These errors would mostly be uncorrectable until after the fact, whether or not the assertion that money is wasted is open to debate.
In a 2005 essay that has since become legendary (now cited almost 4,000 times), Stanford epidemiologist John Ioannidis argued that “for most study designs and settings, it is more likely for a research claim to be false than true.” He did not give a numerical estimate, but his analysis concluded that “the majority of modern biomedical research is operating in areas with very low pre- and post-study probability for true findings.”
Ioannidis’ argument was similar to a point made by philosopher of science Karl Popper, which is that there are always an infinite number of wrong interpretations of any scientific evidence and ultimately only one that is meaningfully correct. Our capacity to come to the wrong conclusions is bounded only by our imaginations: The right conclusion, as mathematicians might say, is severely bounded by reality. As such, the odds are always against any result or conclusion being the right one. The harder the challenge, which means the more interesting the result and the more likely it is to be newsworthy, the less likely it is to be correct.
Two of Ioaniddis’ six “corollaries” about the probability of a research finding being true spoke to precisely this point. Corollary 5 stated, “The greater the financial and other interests and prejudices in a scientific field, the less likely the research findings are to be true.” And Corollary 6 claimed, “The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true.” (This was the case with the research pursued in my first two books. In a hot, fashionable field of science, multiple teams are competing to get the right answer, and so at least one of them is likely to jump the gun and publish the result they hope is correct, right or wrong, on the basis of premature and inconclusive evidence.)
Ultimately, there’s simply no doing science without making errors and allowing for errors to be made. The difference between pathological science and healthy, functional science is in how those errors or very likely errors are treated. The process of science, as Ziman suggested, is the business of error correction: establishing what’s reliable knowledge from the published literature and what’s not while maintaining awareness that the unreliable is the great bulk of it. Healthy science is not error-free science. Such a situation is nonsensical. It’s a science in which the awareness of how easy it is to make mistakes, to fool ourselves, remains first and foremost, the primary principle.
“Science thrives on errors, cutting them away one by one,” is how the Cornell University author, astronomer, and public communicator of science Carl Sagan, writing with his wife, Emmy and Peabody Award-winning writer Ann Druyan, described this in the 1996 book The Demon-Haunted World: Science as a Candle in the Dark. “False conclusions are drawn all the time, but they are drawn tentatively. Hypotheses are framed so they are capable of being disproved. A succession of alternative hypotheses is confronted by experiment and observation. Science gropes and staggers toward improved understanding.”
By saying conclusions are “drawn tentatively” and “hypotheses are framed so they are capable of being disproved,” Sagan is accepting that a critical characteristic of a healthy science is skepticism, the awareness by the researchers of how easy it is to fool themselves and be led astray by wishful thinking. A conclusion that is drawn tentatively is a conclusion from which the researcher can (more easily) back away when it turns out to be clearly wrong, as is very likely to be the case. A hypothesis capable of being disproved is one that can be tested and found wanting.
The role of (rigorous) experiment
If science gropes and staggers toward improved understanding, to borrow from Sagan, it does so through the process of hypothesizing and testing. It’s self-correcting because experiments can be done or observations made that can correct errors. The hypotheses generate predictions about what should be observed in experiments or in nature, and then the experiments and the observations are done to see if those predictions pan out. The predictions are capable of being disproved. And then they have to be independently replicated by others to assess whether those first experiments or observations were actually done and reported correctly.
This is why independent replication is also an essential part of the scientific process. It’s not enough that the same investigators repeatedly reproduce their experimental results (although ideally they’ve done so prior to publishing), because those investigators might be making the same mistakes repeatedly. Hence, reliable knowledge requires that other investigators step in to replicate the results. And they should do so for a very simple reason: If the results are sufficiently important to be the basis of further research, then anyone who wants to do that further research should replicate the initial experiments first to assure they’re really right. One of the unwritten rules of experimental research that I learned in reporting my first two books was: “Trust the scientists to be reporting truthfully what they did, be skeptical of the evidence and the interpretation.”
Independent replication also provides a reason for researchers to do their very best work. Anything less and they’re likely to be embarrassed when someone else does the same job better. “We’ve learned from experience that the truth will out,” as Richard Feynman said. “Other experimenters will repeat your experiment and find out whether you were wrong or right. Nature’s phenomena will agree or they’ll disagree with your theory. And, although you may gain some temporary fame and excitement, you will not gain a good reputation as a scientist if you haven’t tried to be very careful in this kind of work.”
When philosopher of science Karl Popper famously argued hypotheses had to be “falsifiable” to be meaningful, he meant what Sagan did by the idea that hypotheses have to be framed such that they can be disproved. They have to make firm predictions that can be tested. Without such predictions, they are of little use. A hypothesis that only explains what we already know and makes no predictions about the future or other related phenomena as yet unexplained is both useless and unverifiable.
We can create a computer program, for instance, a mathematical model that purports to explain why the last time I flipped a coin 100 times I got precisely 60 heads (maybe based on assumptions about coin density, barometric pressure, and the way I hold the coin for the flipping), but without correctly predicting the number of heads in my next 100 coin flips and the next 100 after that, how can I have confidence that the hypothesis and assumptions upon which it’s based are right? What use is it?
The advantage of the episodes of pathological science that Langmuir had discussed and that have been unambiguously identified since — cold fusion, polywater, and an early (1960s-era) erroneous “discovery” of gravity waves, among the more obvious — is that they were readily amenable to experimental verification. Other independent researchers could replicate the work and see if they got the same result. Build the correct experimental apparatus and then do this and that and you will either see these effects or you won’t. They could also knowledgeably assess whether the apparatus used to do the research initially was capable of producing or discerning what the researchers were reporting.
As Langmuir noted, one reason the supposed effects he was discussing were threshold effects, barely observable with the existing technology of the time, is because it allowed researchers to think they saw them when they thought they should. This allowed the research to remain scientifically viable for a decade or two, because nobody could demonstrate definitively what did or did not exist at these thresholds of observation. Ultimately, though, these effects were verifiable by experiment, and as the experimental technology improved, they eventually failed the test at a level of certainty that was no longer possible to ignore.
Cold fusion had a short half-life as a serious subject of scientific investigation (just a few months) largely because the initial reports of the cold fusion effect were so dramatic. There was nothing threshold, subjective, or subtle about what the purported discoverers had claimed. When better scientists then tried to replicate the experiments using rigorous techniques, they saw nothing like what had initially been predicted. These better scientists also found it easy to understand why those initial claims might have been erroneous. Enough obvious mistakes had been made in the experimental procedures that anyone who had anything interesting to do prior to the cold fusion announcement could safely dismiss cold fusion as pathological and go back to their day jobs. It was not a coincidence that researchers who claimed to have generated cold fusion in their experiments (despite its nonexistence) were researchers whose careers were mostly going nowhere. Cold fusion represented a route to being relevant again, at the forefront of a revolution. Hence, these researchers reported that they saw precisely what they hoped to see.
A characteristic of “rigorous” experiments is that the researchers must be blinded to the conditions of the experiment. They can have no expectation of when the experimental conditions are tuned correctly, such that they should see what they hope to see. The logic of such blinding goes back to Francis Bacon: If the researchers expect to see a meaningful signal standing out from the background of noise, then they will, regardless of whether it’s real. They will see whatever it is that can justify the hard work they’ve put into their careers. This is, as Bacon said, human nature.
Here’s Feynman describing how this tendency to see what’s expected plays out in the historical record of physics:
One example: [Robert] Millikan measured the charge on an electron by an experiment with falling oil drops and got an answer which we now know not to be quite right. It’s a little bit off, because he had the incorrect value for the viscosity of air. It’s interesting to look at the history of measurements of the charge of the electron, after Millikan. If you plot them as a function of time, you find that one is a little bigger than Millikan’s, and the next one’s a little bit bigger than that, and the next one’s a little bit bigger than that, until finally they settle down to a number which is higher.
Why didn’t they discover that the new number was higher right away? It’s a thing that scientists are ashamed of — this history — because it’s apparent that people did things like this: When they got a number that was too high above Millikan’s, they thought something must be wrong — and they would look for and find a reason why something might be wrong. When they got a number closer to Millikan’s value they didn’t look so hard. And so they eliminated the numbers that were too far off, and did other things like that. We’ve learned those tricks nowadays, and now we don’t have that kind of a disease.
When the Nobel laureate physicist Luis Alvarez discussed this problem in his memoir (Adventures of a Physicist, 1987), he called it “intellectual phase lock” and suggested it “occurs partly because nobody likes to stand alone.” He then described the most successful example he knew of someone avoiding the problem — in this case, measuring the charge-to-mass ratio of the electron — and how the researcher Frank Dunnington “had to devise a scheme to avoid tilting the answer to an anticipated value.” He did so “by deliberately obscuring a crucial piece of information.”
Dunnington then wrote the paper and left a blank for the value obtained for the measurement. When he finished the experiment and the analysis, he finally looked to see what number he got. He unblinded himself, filled in the blank in the paper, and submitted. He didn’t allow himself to second guess. That value, wrote Alvarez, was the best ever achieved. “Dunnington’s care to avoid intellectual phase lock illustrates one major difference between scientists and most other people,” Alvarez wrote, echoing both Feynman and Robert Merton. “Most people are concerned that someone might cheat them; the scientist is even more concerned that he might cheat himself.”
When Langmuir described his own experience investigating one such pathological scientific phenomenon (the Davis-Barnes effect) and blinding one of the researchers, Barnes, to the experimental conditions, he described it as “play[ing] a dirty trick” on Barnes. Without the awareness of when he could expect to see an effect, Barnes did no better than chance in claiming he had seen what he had hoped to see. “You’re through,” Langmuir recalled saying to him at that moment. “You’re not measuring anything at all. You never have measured anything at all.”
Ioannidis also discussed this intellectual phase lock in his 2005 essay:
Let us suppose that in a research field there are no true findings at all to be discovered. History of science teaches us that scientific endeavor has often in the past wasted effort in fields with absolutely no yield of true scientific information, at least based on our current understanding. In such a “null field,” one would ideally expect all observed effect sizes to vary by chance around the null in the absence of bias. The extent that observed findings deviate from what is expected by chance alone would be simply a pure measure of the prevailing bias.
In short, the researchers would see what they expected to see, even when it’s not there. And the only way to avoid the effect of this bias is to do the experiments or clinical trials in such a way that the researchers are blinded. If so, the result they publish will not have been influenced in the course of the analysis by the result they expect.
There are two immediate implications of this bias problem for the kind of public health, preventive medicine research in which we’re interested. First, this is why clinical trials in medicine need to be done double-blind and placebo-controlled. Neither the researchers (or physicians) involved nor the patients can be aware of who is getting a supposedly active medication and who’s getting a placebo. The assumption is that any knowledge on either of their parts will bias the response — presumably in the direction that the researcher/physician or patient expects — and the results of the trial will then be misinterpreted. The results of that bias will be misinterpreted as being the result of the medication.
Researchers and journalists will often refer to placebo-controlled, double-blind, randomized-controlled trials as “the gold standard” of scientific evidence, but that underplays the importance and suggests a lack of understanding of the scientific endeavor. Depending on the question being asked, these trials are simply what’s necessary to establish reliable knowledge. It’s that simple. Anything else is insufficient; the results and conclusions are, by definition, tentative. They cannot be definitive. Further experiments/trials will always be necessary.
The second implication is for the use of what are called “meta-analyses” to establish reliable knowledge. The idea is that when studies can’t be done definitively, the researchers can establish guidelines in advance to determine which relevant studies to include in an analysis and which to exclude, and then combine the results to get an average measure that can be trusted. If intellectual phase lock exists in medical and public health research as it does in physics, though, then the very fact that a meta-analysis has been done tells us we have a problem. It implies that the results all existed at the threshold of the technology or methodology to observe; that none of the experiments were definitive or rigorous; and, most important, that the results would be biased by the expectations. While an average result could be obtained, it would not be reliable. As Ioannidis implied, such a result will represent the prevailing bias, not necessarily the truth.
Take, for instance, a meta-analysis published in 2015 by the Cochrane Collaboration, an organization founded to do unbiased meta-analyses (“systematic reviews,” per the Cochrane Collaboration) on ”reduction in saturated fat intake for cardiovascular disease.” The meta-analysis concluded that there seemed to be a slight benefit in terms of cardiovascular events when subjects of trials replaced saturated fats with unsaturated fats. But of the 15 trials included in the analysis, only one, dating to the 1960s and done in a Los Angeles VA hospital, appeared to “have adequate participant and study personnel blinding.”
The Cochrane researchers then decided to trust the result of their meta-analysis because the average result of the 14 unblinded studies was similar to that obtained in this one blinded trial. What they did not apparently consider is the possibility that the one blinded trial was not really blinded and/or got the wrong answer. As the Los Angeles VA researchers noted in their original article, the “experimental diet was prepared to simulate conventional food” using corn oil and other polyunsaturated fats instead of butter and other animal fats. But that doesn’t mean it did. Desserts, for instance, made with corn oil instead of butter are likely to taste different and, to most of us, not as good. If so, then this trial too was unblinded. If the researchers really wanted to know their assumption was true, they would have to redo the trial. And then they would have to replicate it, multiple times. Rather than do so, they assumed the convenient answer and, if Merton, Feynman, Alvarez, et al. are right, ceased to function as scientists.
Among the many problems with the kind of public health-related research — e.g., human nutrition, exercise physiology, disease physiology — is how exceedingly difficult it is, if not impossible, to ever blind subjects to the intervention. Eating a low-fat diet, for instance, rather than a low-carb diet or becoming physically active rather than remaining sedentary cannot be blinded. As a result, the researchers in these disciplines have allowed themselves to believe such blinding is unnecessary. In some cases, that may be true.
It’s always possible that effects of an intervention can be so obvious that no reliable alternative explanations can be imagined — the effects of cigarette smoking, for instance, on lung cancer rates. These are not threshold interactions. They’re obvious, and no reasonable alternative hypotheses can be imagined that might explain them. But the history of science once again suggests that as the effect size gets close to the threshold of what the technology/methodology can achieve — a threshold that is a priori unknown and can only be established by experimental tests — the conclusions have to be drawn ever more tentatively.
A pathological discipline is one in which the researchers respond to these challenges by lowering their standards for what they think is required to establish reliable knowledge rather than lowering the confidence with which they present their interpretation of the results.
Gary Taubes is co-founder of the Nutrition Science Initiative (NuSI) and an investigative science and health journalist. He is the author of The Case Against Sugar (2016), Why We Get Fat (2011), and Good Calories, Bad Calories (2007). Taubes was a contributing correspondent for the journal Science and a staff writer for Discover. As a freelancer, he has contributed articles to The Atlantic Monthly, The New York Times Magazine, Esquire, Slate, and many other publications. His work has been included in numerous “Best of” anthologies including The Best of the Best American Science Writing (2010). He is the first print journalist to be a three-time winner of the National Association of Science Writers Science-in-Society Journalism Award and the recipient of a Robert Wood Johnson Foundation Investigator Award in Health Policy Research. Taubes received his B.S. in physics from Harvard University, his M.S. in engineering from Stanford University, and his M.S. in journalism from Columbia University.