Deming, Data and Observational Studies

8
ByCrossFitJanuary 18, 2020

In this 2011 editorial, S. Stanley Young and Alan Karr, of Bioinformatics and the National Institute of Statistical Sciences, respectively, present evidence that the results of observational trials (1) are more likely to be false than true.

The validity of observational evidence has been questioned since at least 1988, when L. C. Mayes et al. found that across 56 topic areas, an average of 2.4 studies could be found to support a particular observational claim alongside 2.3 studies that did not (2). In the case of the antidepressant reserpine, for example, three studies indicated the drug increased cancer risk, while eight did not.

More recently, an analysis of 49 claims from highly cited studies found five of six claims drawn from observational studies — 83% — failed to replicate (3). The authors of the analysis gathered 12 randomized controlled trials (RCTs), which collectively tested the claims of 52 observational trials. Zero of the 52 observational claims were supported by the subsequent RCTs, with the RCTs finding significant effects in the opposite direction of the observational evidence (i.e., an RCT indicated harm when the observational study indicated health, or vice versa) in five out of 52 claims. Young and Karr conclude:

There is now enough evidence to say what many have long thought: that any claim coming from an observational study is most likely to be wrong — wrong in the sense that it will not replicate if tested rigorously.

The authors cite W. Edwards Deming, who introduced the concept of quality control, to argue this scale of failure suggests the process by which observational studies are generated is fundamentally flawed. They specifically point to three factors that consistently lead to unreliable observational evidence:

Multiple testing – A single observational study can test many associations simultaneously. For example, as mocked by the xkcd webcomic below, a single study can test for correlations between dozens of dietary and demographic inputs and clinical outcomes. Mere chance indicates some of these correlations will appear statistically significant, even if they lack any biological basis.

Bias – Data is often confounded by unobserved distortions. For example, the HIV drug abacavir was primarily given to patients at high risk of cardiovascular disease. Observational evidence concluded abacavir increased risk of heart disease when, in fact, abacavir was simply being given to patients who were more likely to have a heart attack, with or without the drug (4).

Multiple modeling – As with multiple testing, the regression models used to uncover correlations between variables in observational data can be manipulated until significant associations fall out. Here, the authors note the supposed harms of BPA were derived from an observational study that tested the urine of 1,000 people for 275 chemicals, then compared urine levels of these chemicals to 32 medical outcomes and 10 demographic variables. From this data set, 9 million possible regression models can be constructed, many of which will show significant associations as a result of chance alone (5).

These problems are exacerbated by the widespread reliance on p-values (i.e., statistical significance) to identify meaningful findings. Authors, journal editors, and those who read and interpret science treat statistically significant associations as meaningful and real, even when these significant associations may entirely result from the distortions above.

The authors propose a fix to this process that involves researchers splitting their data prior to analysis, analyzing part using a published statistical methodology, publishing the paper using only this data, and then analyzing the remaining data using the same methodology in an addendum. This method provides a sort of internal replication: If the addendum data informs the same conclusions as the main body of data, these conclusions are more likely to be valid, and if not, the authors’ analytical methods can be questioned.

The takeaway: This paper highlights the extreme degree to which observational evidence is unreliable. Based on the data these authors present, observational evidence is no more likely to be right than wrong, and in fact, it may be more likely to be wrong than right. In other words, any health claim based on observational evidence is only valuable according to the extent to which it informs subsequent controlled trials to test its validity.


Notes

  1. That is, trials in which the authors observe whether particular inputs (i.e., patterns in diet, behavior, demography, etc.) correlate with increased or decreased risk of disease, without attempting to affect those inputs.
  2. A collection of 56 topics with contradictory results in case-control research; Scientific standards in epidemiologic studies of the menace of daily life
  3. Contradicted and initially stronger effects in highly cited clinical research
  4. Use of nucleoside reverse transcriptase inhibitors and risk of myocardial infarction in HIV-infected patients enrolled in the D:A:D study: a multi-cohort collaboration
  5. Data presented in a letter to the editor, JAMA issue 301, pp. 720-721.

Comments on Deming, Data and Observational Studies

8 Comments

Comment thread URL copied!
Back to 200119
Greg Glassman
January 23rd, 2020 at 4:15 pm
Commented on: Deming, Data and Observational Studies

Below are seven observations received from Jeff Glassman as part of an ongoing conversation on the Young and Kerr editorial. They’re worth a read.


In his commentary on hypothesis testing, NP refers to statisticians Jerzy Neyman and Egon Pearson, known for the Neyman-Pearson lemma, while RAF refers to Ronald Aylmer Fisher and his contributions.



First, observational (or any other scientific study) should not be graded true or false. If so, the task of measuring has failed. The measurement job requires estimates of the probability distribution (more than any single statistic). It was the core of Neyman-Pearson and abrogated in full by Fisher. Science should be considered as a scientific method PLUS a statistical model. The issue today is principally Modern Science/NP vs Post Modern Science/RAF. Fisher’s method inserted the subjective and ruined the scientific value.


That argument goes further thanks to the work of Claude Shannon. We can measure the residue of predictions and when the probability distribution of that residue is at the level of maximum entropy (a coordinate dependent function), then the parameters in our model have been exhausted. Typically, we can measure not exhaustion, but statistical exhaustion. E.g., only a couple of percent remains to be accounted for in our model, so we’re more likely to be successful with a different set of parameters in our experiment.


Second, the 5% level or significance level is outside Modern Science. It’s bull shit. It’s Fisher. It’s the result of an experiment, technically a statistic, measured (by the p-value) against a hopelessly subjective probability distribution. Sometimes, as in dealing with gambling games, that probability (known as the null distribution) can be guessed quite well. But that’s the misleading tip of the science iceberg.


Third, we don’t (or shouldn’t) care a whit about the null hypothesis. It is not measured. It is not science. What we care about on the null side is the result of our experiment (study) without any candidate CAUSES. It’s hard to do, but the experiment needs to be designed both ways.


Fourth, Fisher stated that all experiments are test of a null hypothesis. Dead wrong. Here lies the greatest CAUSE of the gross failure of academic science, the so-called Replication Crisis. The true CAUSE of these outcomes is the presence of even an iota of something subjective in the logical sequence leading to the measured outcome. This hit me like an avalanche listening to the speakers at last year’s DDC. Of course! The p-value is a pure mathematical concept, an algorithm of various skills, except it’s defined on the null hypothesis! Bad guess. “Statistical tests virtually always focuses on Type I errors (misunderstood), falsely rejecting a true null hypothesis.” That is true only of academic science, the Popper/Fisher axis (should be Fisher/Popper according to who published first), the failed Post Modern Science. I can’t say it never occurs in industrial science, along the Bacon/NP/Kolmogorov/Shannon axis, I just have never seen it nor allowed it.


Significance is a mental concept, not a scientific one. Best to rule it out in talking about Modern Science.


Fifth, what is needed are the empirical probability distributions (not densities) of heart attacks (i.e. Sheehan) both with and without various conditions of interest, things leading to meds, environment, race, location, etc., etc. What happens is that a team of n scientists publish a study, each getting credit for one paper (should be 1/n each), where included in n is 1 or 2 statisticians. Those statisticians do some magic and come up with a probability density for the null hypothesis, assumptions not published, assuming Lord knows what, apply to the true scientists stats, providing some good enough p-values (out of scores possible). The ASA took a stand disapproving that standard procedure. The ASA however chickened out when it came time to prescribe what should be done beyond “don’t publish that crap.”


Sixth, statisticians since they came into existence have sought tests of significance, of robustness. NP found the key, but their work is seen only in special schools, like engineering and (rarely) physics.


Seventh, I may have mentioned on the phone something worse than the failed academic studies being pumped out by the thousands. Those are the meta-analyses. Those studies don’t even have Fisher statistics.

Comment URL copied!
Collin Donahue-Oponski
January 19th, 2020 at 6:25 pm
Commented on: Deming, Data and Observational Studies

I almost always appreciate the articles and positions of CrossFit.com. In this case, I feel that you are taking a step backward. Probably the correct conclusion is that observational data can be used incorrectly — but so can the data from RCTs. I highly recommend reading “The Book of Why”

Comment URL copied!
Clarke Read
January 19th, 2020 at 5:37 pm
Commented on: Deming, Data and Observational Studies

CrossFit.com has featured plenty of research and writing questioning the value of observational research. This is for good reason - there are deep, fundamental issues with drawing conclusions from observational data, which no amount of statistical manipulation may be able to fix.


As I see it, there are three possible responses to these issues:


1. Observational evidence is flawed, but it is the best we can do. It is infeasible to study the link between lifestyle and chronic disease (e.g., between diet and heart disease) by any other method, so we ought to use whatever statistical tools we have to draw as much value out of observational evidence as we can. For all its flaws, the field has taken great pains to clarify the signal and dissipate the noise in the data, and it is generally trustworthy.


2. Observational evidence is flawed, but still has value as “hypothesis-generating” evidence. Observational evidence should inform neither clinical practice nor individual behavior nor public-facing recommendations, but it does tell us where to look in future controlled trials.


3. Observational evidence has no predictive value at all, and cannot be trusted even to reliably generate hypotheses. Both the public and the research community would be better off ignoring any conclusions derived from observational research.


The most prominent and influential forces in nutrition research generally provide response #1. Young and Kerr’s paper supports response #3. Response #3 implies that, barring substantial changes in how observational research is done and/or interpreted, we ought to stop funding it, as it is providing nothing worth paying for and that time and money would be better spent elsewhere, except maybe in rare cases where interventional studies would be unambiguously unethical.


If Young and Kerr drew from a representative sample of observational studies, and their analysis is valid - and I see no reason to argue against either - then designing controlled trials on the basis of observational evidence is no better than designing them based an informed understanding of biological mechanisms. There are, of course, infinite hypotheses only a few of which are right, but if observational evidence was adding any value then we would expect a much larger share of observationally-informed clinical trials to reinforce the conclusions of that previous research. Combine this with the fact that observational evidence remains slow, expensive and complex and it is difficult to argue the continued funding of observational evidence is bringing us any closer to effective treatments for chronic disease. It looks like a field absorbing large amounts of private and public funding and delivering little in return.


Young and Kerr’s editorial was published nearly a decade ago, and seems to have had little impact on the field. I wonder whether a larger, more prominent replication - which could easily be done by a magazine or newspaper - would. The results put the burden of proof on the field of observational evidence to justify its continued existence and support. More importantly, they clearly implicate over-reliance on observational evidence as a probable cause of the failure of nutrition research to deliver meaningful clinical benefit, and fundamentally undermine the basis of many of the field’s most well-known findings. We’d be fools to expect things to improve without change.

Comment URL copied!
Greg Glassman
January 23rd, 2020 at 5:15 pm

And further from Jeff!



Moving into Read II: “Observational evidence has no predictive value at all… “. In fact, no evidence does. Prediction comes from Cause & Effect relationships, presumed experiment by experiment, and melded into the core of models of the Real World. C&E relationships are not to be proved, but measured – measured statistically.


For the most part, C&E relationships are statistical, not observational. Sometimes C&E is obvious, as in beheading causes death. More often it’s like Jacci Oddino’s surgical shortening of her esophagus to remove a late stage cancer, a procedure with a 5% survival rate over x years. And as we discussed on the phone, perhaps not as good as negligence.


A concluding observation for Read: As the p-value is not the culprit in the Replication Crisis. Neither are observational studies. The problem is the absence of Modern Science, fully measured and subjectivity-free.

Comment URL copied!
Clarke Read
January 23rd, 2020 at 11:55 pm

First, I want to thank Jeff for taking the time to provide that feedback (both comments). It is extremely helpful, both independent of and with respect to the perspective he brings.


Jeff, you’re taking a step back and upon reflection you’re right to do so. I treated p-values as a second-order problem here (with observational research being the first-order problem) when they are (consistent with Collin’s comment below, also appreciated) a third-order problem. The first-order problem is the scientific environment that permits or even encourages the standards, processes and methods that allow for this sort of science to be produced, and have little demonstrable benefit, without consequence. I’d posit it reflects distortions in the individuals involved, the incentive structure they operate under, and the biases pervasive throughout the current nutrition science community.


Following on Collin’s comment, this same environment would predictably produce not just observational science that fails to effectively track reality (discovered through replication failure), but non-observational research that is equally flawed. Targeting observational research specifically is, in a strict theoretical sense, misdiagnosing the true problem and indicting a party guilty only by association. Contextually, there may still be value in pointing out the flaws in observational research, specifically because this form of research has been so influential in forming the thinking that predominates nutrition science, but even if we fixed the problems with the use of observational research here, the scientific community would still be failing to consistently produce understanding that informs beneficial action by patients and the population.


Similarly, my statement that the cost of compliance with Y+K’s guidelines was “unfortunate” was misguided, if well-intentioned. Adding to the cost of research would inevitably harm some innocent individuals, unless the total pool of funding expands, but the increased accountability that comes with a larger required budget would only help shift the field in a more effective direction.


This, to me, begs three questions:


1. Can nutrition science realistically shift toward a more effective framework, and if so, what changes in the incentives, the population, and other structures need to happen to get it there?


2. Considering the answer to (1), what role ought academic science play compared to other potential sources of information - such as private research, industry, and even lay observations? What will get us to the truth most quickly and reliably?


3. What should we do with the science that exists, given the issues Jeff presents, particularly given we know some of it does in fact predict effective treatments or causes of disease?


I don’t know the answers to these questions, but I think answers to them might start to tell us what a more effective engine to generate nutritional knowledge looks like - and what best to do with the knowledge, in all forms, we already have.

Comment URL copied!
Richard Sheehan
January 19th, 2020 at 2:58 pm
Commented on: Deming, Data and Observational Studies

Observations trials are not more likely to be false than true. Observational trials may be highly likely to be misinterpreted but that does not make either the trial itself of the underlying statistical analysis false or inappropriate.


The cartoon does a great job of making this point. The scientists run 20 tests on different color jelly beans and find 1 statistically significant result at the p>0.05 level. A pretty standard approach. And that result is exactly what one should expect if there is no relationship at the 5% level. In other words, the scientists are testing the null hypothesis that there is no relation against the alternative hypothesis that there is a relation and will reject the null hypothesis at a certain level of significance. Well, at a 5% level with 20 tests run, one should expect to find 0.05*20 = 1 significant result. That is, one should expect to find one false positive in among the tests undertaken. Simply stated, that p-value is telling everyone what percent of the time is the test indicating that you have a Type I error or a false positive, rejecting the null hypothesis that is, in fact, true.


If readers or reviewers or journal editors don't understand that point, it's not a problem with the observational trial or with the statistics. It's a problem either with the reader's lack of understanding of statistics or the authors lack of putting the results in proper context.


Statistical tests virtually always focuses on Type I errors, falsely rejecting a true null hypothesis. But one should also keep in mind the potential for Type II errors, falsely failing to reject a false null hypothesis. Think of the null hypothesis that smoking does not cause cancer and failing to reject that null hypothesis.


The "solution" to the problems mentioned here is likely not splitting the sample, at least for most health-related issues. The probability of an event, e.g. heart attack, tends to be relatively low so a large sample is required to identify whether there is a statistically significant difference between the control group and the testing group. In general, the larger the sample, the higher the power of the test. Splitting samples would mean either substantially more expense for any test or substantially fewer tests or less informative test results.


Do the tests correctly; report the results correctly; undertake statistical analysis that will focus on the robustness of the results; and identify any potential conflicts of interest or bias. Address those points and we don't have statements made like "the results of observational trials are more likely to be false than true."

Comment URL copied!
Clarke Read
January 19th, 2020 at 4:19 pm

Richard, the points you make at the end are valid, and your disagreement with the authors reflects a difference in preferred tactics.


Young and Karr's proposed methodology is, as I read it, aiming toward two ends. First, they want to capture some of the value of replication within a single study. But second, and more importantly, they want to increase the potential for overzealous or indeliberate authors to be shamed for taking liberties with the data. If the data were split, as they suggest, and the analyses of the "split" data disagreed with the main analytical body, it would lead readers and editors to question the analytical methods. They are not trying to directly improve the quality of the evidence so much as they are trying to have stronger basis to shame those who seek to make conclusions based on low-quality evidence, and improve things indirectly through that sort of accountability. We'd expect that would improve the quality of evidence anyway, if it can be improved, as it would make researchers more hesitant to publish or even start a study with low predictive value.


You are right that one of the unfortunate consequences of this method is that studies would be more expensive and take longer, simply because more data needs to be gathered. It's hard to spitball the full set of consequences of such a change, but I don't know if that would necessarily be a bad thing.


I did a quick search in NIH's RePORTER database of clinical trials. Searching for the tag "observational study" (a frequently-used tag within the database) pulls 1,103 grants which received a total of $667 million in funding in FY19. (This number probably isn't precisely correct due to the way the database is set up, but is probably directionally correct) Those 1,103 grants were spread across 1,003 PIs and 249 institutions, and included 347 R-01 grants. In other words, there is plenty of money to go around, but it is spread fairly thin. This isn't news to anybody who receives or distributes NIH funding.


We could argue that pushing researchers to increase the quality of their research and their analysis would fix this problem; in the abstract, it would. But the research (and to a larger extent, the journalism-of-research) community has been pushing for these sorts of changes for years, and they have not been effective. It's becoming increasingly hard to believe researchers will police themselves or hold themselves to the necessary standards, individually or as a community. Accountability would help. The researchers who published the papers Young and Kerr reviewed here, who found results that subsequently failed to replicate in controlled trials, almost certainly faced no negative repercussions for this failure, and were able to use similar methods to come to similar conclusions in future papers. I don't know if Young and Kerr's method - a statistics-based solution to a more-than-just-statistics problem - is the right one. But I would hypothesize that if observational research were held to a more demanding standard, and had to be done in a way that substantially increased its predictive value, we would need to see fewer, larger grants distributed to fewer, more competent and qualified teams. That has risks, as any concentration of funding or effort does. But the way epidemiological research is managed and produced right now has virtually zero predictive or informative value. With the right barriers in place, the only way to go is up.

Comment URL copied!
Greg Glassman
January 23rd, 2020 at 5:12 pm

Further notes from Jeff Glassman: 



Read says, “one of the unfortunate consequences of this method is that studies would be more expensive and take longer … .” This is the other side of the coin of the attractiveness of Popper’s Post Modern Science (PMS) in the first place! No tough statistics are involved in PMS. No tough approvals. Models in PMS do not even have to work, just be approved, in theory, by (1) a peer group, (2) a certified journal, (3) an alleged consensus. The theory also includes that the model have a falsification clause. In fact no such study is known to exist. In further fact, the only thing that counts is to be published. Publish or Perish is not just a slogan, it’s another fact of life. Popper & Fisher made life easy for the pseudoscientist, and they have flourished.


Further on Read, the real cost of academic science is just now being paid. It is the costs of failure, of studies, of science, and of the goals of science.


We have way too much bad science and way too little good science (outside of industry).

Comment URL copied!