The Complexities and Confusions of Medical Science

Richard Smith, former editor of the BMJ

May 11, 2020

The following excerpt is reprinted with permission from The Trouble With Medical Journals (Taylor & Francis, 2000).

‘You open the papers one day and read that alcohol’s bad for you. Two days later there’s another report saying it makes you live longer. Do these scientists know what they’re doing?’ You must have heard such sentences. Scientists are sometimes inclined to blame the media, but the reality is that the study of health and disease is hard, incremental and full of false signals. I think it’s important in a book like this to provide some insights into how difficult many of the studies published are both to do and to interpret. I don’t want to provide a crash course in clinical epidemiology, but I will illustrate the difficulties by discussing how you can determine whether or not a treatment works together with some of the traps that you see commonly in medical journals.

This isn’t an easy chapter. I’ve tried to write this book so that people can read it without too much effort, and Beth Kilcoyne, a writer and actress, who read the whole book in draft, told me that she could read the book easily — apart from this chapter. She found it turgid and was left wondering at the end ‘Why bother publishing anything?’ In my anxiety to make clear to readers the complexities and confusions of medical science I’d clearly gone too far. Progress in medicine may not be straightforward, but it has been dramatic in the past 60 years — and publishing research is central to making progress. It may be that most research papers disappear within a few years, but the same is true of literature: only a few great works survive for centuries. So publishing in medical journals is important, and I’ve tried in response to Beth’s criticism to lighten this chapter. I don’t, however, think that there is anything in the chapter that cannot be understood by anybody, and I hope that it doesn’t sound patronizing if I suggest that it may be worth the effort of reading the chapter slowly and carefully.

I’ve chosen to use the example of trying to determine whether a treatment works because it is important to both patients and doctors. In addition, assessing evidence on whether or not a treatment is effective is simpler than working out whether or not a diagnostic test works, what the prognosis might be of a disease or what the adverse effects of a treatment might be. Nevertheless, the history of medicine is largely a history of ineffective and often dangerous treatments. ‘The role of the doctor,’ joked Voltaire, ‘is to amuse the patient while nature takes its course.’ It may for millennia have been acceptable to treat a patient based on a plausible theory or simple experience, but increasingly it isn’t. New drugs must undergo rigorous testing in order to get onto the market, and slowly but surely surgical treatments are also being regulated.

The simplest way to see if a treatment works is to give it to a patient and see what happens. The resulting publication is called a case report and was for many years the mainstay of medical journals. But you don’t need to be much of a scientist to see the limitations of such evidence. What would have happened if the patient hadn’t been given the treatment? How do I know as a reader whether or not the reported patient is like my patient and whether or not what happened to that patient would happen to my patient?

And what about the ‘placebo effect’? We know that whatever you do to patients about one-third of them will improve. There is also evidence that the more severe the intervention — for example, an operation rather than a pill — the more powerful will be the beneficial effect. (A friend of mine has applied for a licence for a drug called ‘placebo’. There is overwhelming evidence from many trials showing that it works in any condition and has a powerful effect, more powerful than the many drugs that actually have rather limited effects.) New drugs are routinely tested against placebos, but using placebos in surgical trials is difficult: it means, for example, making incisions in patients’ skins so that they think that they have had a full operation but actually not performing the operation. Ideally, new surgical treatments should be tested against such sham operations — and when they are, they are shown to be less effective than previously thought.

Then there is a powerful statistical force called ‘regression to the mean (the average)'(103). If you find somebody with, for example, high blood pressure, it is highly likely that the next time you measure his blood pressure it will be lower. Blood pressure like most biological variables (cholesterol, mood, heart rate) swings around, and if you measure the variable when it happens to be especially high or low — and so far from the mean — then the chances are that it will be closer to the mean the next time. The treatment might then seem to be working when in reality it is doing nothing. Much of the seeming effectiveness of antidepressants may result from regression to the mean. Patients are treated when particularly depressed and inevitably seem to improve as their mood returns towards the mean. Doctors and other readers of medical journals consistently underestimate the power of regression to the mean.

The next test for a treatment after giving it to a single patient is to give it to a group of patients — a series or a cohort in the jargon of medical journals. Reports on series of patients are still the staple diet of surgical journals, which is why surgical research has been compared unkindly to comic opera: such studies are not scientifically serious — and as such are a poor basis for surgical practice (104). The problems that apply to case reports also apply to case series plus there is much room for manipulation, either conscious or unconscious.

The best way for a surgeon to get good results is to operate on people who don’t need an operation in the first place. The way to get bad results is to operate on very sick people. This is one of the fundamental problems of ‘league tables’ of performance of surgeons or hospitals. Those who compile such tables are aware of the problem and so try to ‘risk adjust’ — statistically manipulate the results to allow for things like the age, sex, social class and degree of sickness of the patients. But this is an inexact science. Sometimes the ‘risk adjustment’ is overdone, encouraging surgeons to operate on sick patients because they know that their results will ‘improve’ as a result. Or surgeons may ‘game’ the process (105). They might, for example, report patients who have smoked five cigarettes in their lives as smokers and so increase their patients’ recorded risk (because smokers have poorer results from operations than non-smokers) and ‘improve’ the performance of the surgeon after risk adjustment.

Another problem with league tables, in passing, is that we all tend to underestimate the power of chance. We published a study in the BMJ in which a statistician calculated the range of results that would by chance alone occur in a year in a group of surgeons doing a small number of operations each year with a high failure rate (106). The surgeon at the bottom of the list would have results several times worse than the average and many more times worse than the ‘best surgeon’. It’s hard to believe that such differences could all be due to chance, and yet by definition in this case they were. The following year the results would be completely different. ‘The worst’ might now be ‘the best’. This illustrates the difficulty of us all wanting to be operated on by ‘the best surgeon’. Not that we could anyway: she’d be run off her feet and all other surgeons would be unemployed. Plus nobody could ever get trained.

The problems of patient series and risk adjustment have been well illustrated by studies of the benefits of hormone replacement therapy for women at the menopause. It is a major, expensive and time consuming business to do a randomized trial (see below), and so with many treatments we have to make do initially with results from large cohorts of patients. So there are many reports of thousands of women, even tens of thousands, taking hormone replacement therapy. What happens to them is compared with what happens to women who didn’t take hormone replacement therapy. (So this is one step up in terms of evidence from a simple series in that a comparison is made with a group not taking the treatment.) The women taking the therapy seem to have fewer deaths from heart disease, which is important because in women as in men heart disease is the most common form of death. So a small reduction in the risk of death can result in many lives being prolonged. This seems like an important benefit and is heavily promoted, not least by the companies which manufactured the treatments. (Although the BMJ was also rather carried away some years back with the benefits of the therapy and argued that every woman should take it.)

Unfortunately the women who take the therapy are not the same as those who don’t. Women who take the therapy are likely to be better educated, richer, more concerned with their health, less likely to smoke, more likely to exercise and generally healthier. So it’s not surprising that they have fewer deaths from heart disease. Statisticians ‘risk adjusted’ the data and many declared themselves satisfied that the therapy is truly beneficial. Those who are doubtful are denounced (not too strong a word) as naysayers. But when the first results were announced at the end of 2002 from a large randomized trial published in JAMA it emerged that the therapy made you more likely to suffer heart disease (107).

This pattern of studies of a series of patients suggesting benefits from a treatment and randomized trials showing no benefit or even harm is repeated time and time again. I think that I see it happening now with various interventions for dementia. (A month after I wrote this sentence it emerged — from the same large randomized trial — that hormone replacement therapy, which had seemed to protect against dementia, actually made it more likely (108). I’m not claiming supernatural powers but just illustrating the dangers of drawing premature conclusions from scientifically weak studies.)

These studies of hormone replacement therapy have compared women taking treatment with women not taking treatment. Another way to make a comparison is with patients treated in the past — ‘a before and after study’. This could potentially be better than a simultaneous comparison because everybody should have been treated with the old treatment in the past and the new now. This might then avoid the problem of the treated patients being fundamentally different from the untreated patients. But unfortunately there will be changes over time — in staff, the processes in the hospital, the weather and dozens of other things. So it is hard to be sure that any benefit you see in the group given the new treatment comes from the treatment rather than something else.

Then there is the problem of the ‘Hawthorne effect’, which was first noticed in studies conducted in the Hawthorne factory of Western Electric. ‘The studies showed that the minute you started to study people results improved. ‘The very fact of being studied led to improvement. So, if you make a comparison between patients you are studying and results from patients that you simply treated in the past but not as part of a study, then you may see benefits that result simply from the study not from the treatment.

One of the biggest problems of all in assessing the effectiveness of treatments is bias. Bias is a strong word and has overtones of naughtiness, but it simply means that if you think that a treatment may work (even subconsciously) then Io and behold it may seem to work even though in reality it doesn’t. Bias might result from you trying harder in other ways with people receiving the treatment, excluding from the treatment patients who may be less likely to benefit, or taking measurements in patients in a different way. Or the patients, detecting your enthusiasm for the treatment and wanting to please you, will try and give you the results you want, telling you perhaps that they feel better when they don’t. There are a thousand subtle ways in which bias can arise, and it’s important to make clear that it is not dishonesty. Bias is unconscious and pervasive.

That’s one reason why the double blind randomized trial — where the doctors, the patients and the researchers do not know who is being treated and who isn’t — is so important. Allocating the patients to different groups randomly also means that the problem of patients being different in age, sex, social class, degree of sickness or many other ways that cannot be identified but may be important will also disappear. And it turns out that all the aspects of a trial matter. If patients are not adequately randomized (but allocated to groups alternately or by the first letter of their surname, the day of the week they are admitted to the trial, or whatever) or the trial is not properly double blind, then the results are distorted — usually the weaker the study the more likely it is to show that the treatment works. The bias is towards the positive.

But randomized trials are no panacea. The biggest problem with trials is that they have usually been too small (109). If a treatment is so powerful that everybody who gets its lives and everybody who doesn’t dies (and I don’t think that there is such a treatment), then you probably don’t need a randomized trial. But with most treatments you need to treat dozens of patients in order to have one fewer death or heart attack or whatever. This means that you need large trials — with perhaps thousands of patients — to be sure whether a treatment works or not. Unfortunately — perhaps because of the inherent over-optimism of doctors or perhaps because of the reluctance of pharmaceutical companies, who fund most trials, to get results that they don’t want — most studies have been too small.

There are many other possible problems that can arise with trials and I want to describe just a few more. One problem arises from having many different possible outcomes to a study. You measure not just whether patients live or die but also whether they have heart attacks, strokes or other problems, whether they are admitted to hospital, whether they have time off work and so on. If you measure enough things then you will find differences in some of them between the treated and the untreated that seem too big to be due to chance. They are, in the jargon of medical journals, ‘statistically significant’. But they may still be related to chance. You’ve increased the chance of finding a result that seems not to be due to chance by measuring so many things.

It’s again really the trial being too small, as is the next problem of over-analysing the data. A famous saying of statisticians is that ‘if you torture the data long enough they will confess’. You do a trial of X against Y. You find no difference in the outcome. So you then look at different groups within the study, and you will find, by definition, groups — perhaps men aged 40 to 50 who smoke — where there are big differences between those treated with X and those treated with Y. It might be that this is a real difference or it might be chance. It’s important for readers to know whether you hypothesized that such a group would show a difference before you analysed the results. You perhaps had biological reasons for your hypothesis. If you did the difference might be real. Otherwise, it almost certainly would not be.

Sometimes people will deliberately torture data to produce the result they want. This is misconduct. More often naive researchers — and there are many such in medicine — play around with the data to produce ‘interesting positive results’, unaware of the implications of what they are doing. The wide availability of statistical packages for computers makes this very easy to do.

Leonard Leibovici, an Israeli physician, illustrated some of these problems and others in a very tongue-in-cheek way by conducting a trial of what he called ‘retroactive intercessory prayer’ (110). He and his team prayed, within a randomized trial, for a group of patients who had been in their hospital four years previously. They found that compared with controls the patients they prayed for did better. This paper was published in the Christmas edition of the BMJ and greatly perplexed some readers and journalists. Should it be taken seriously? The authors pointed out that it was human and small minded to think that time flowed in only one direction. God is under no such restrictions. I think that the results can be explained without resorting to the supernatural. It may be that the authors looked at many possible outcomes and reported on only the positive, or it may be, as I explain below, that you have a high chance of coming up with positive results by chance when testing silly hypotheses.

Other sorts of problems that can arise with clinical trials are illustrated by the famous but apocryphal statistical story trial of getting patients to run up Ben Nevis (Britain’s highest mountain) as a treatment for heart attack. The fact that the treatment works is illustrated by the 25 patients who completed the treatment all surviving 10 years. But the authors of this important trial (sponsored perhaps by the Scottish Tourist Board) neglected to report the 25 patients who refused the treatment, the 25 lost on the mountain and the 25 who died while running. This story illustrates the importance of keeping results on all patients who enter trials. In the jargon of trials you should conduct ‘an intention to treat analysis’, which will include all those who refused to take the treatment, didn’t take it or who disappeared from the trial. This is the relevant analysis to the doctor and the patient thinking of starting a treatment because many patients will be just like the patients in the trial.

(Reliable evidence suggests that only about one-half of patients take drugs as they are prescribed (111). I heard Richard Doll, perhaps Britain’s pre-eminent doctor researcher of the 20th century and one of those who developed the randomized trial, tell another story that illustrates the difficulty of doing trials. A patient is entered into a double blind randomized cross-over trial [which means at a point unknown to either the doctor or the patient the drug the patient is taking changes]. During the trial the patient asks the doctor: ‘Have you changed my treatment?’ The doctor explains, yet again, that he wouldn’t know. ‘But what,’ the doctor asks, ‘makes you think that I have?’ ‘Well now when I flush the tablets down the toilet they float. They never used to.’)

The outcomes of trials can also be analysed by using results only from those who completed the treatment as described — that is, in the Ben Nevis example, all those who reached the top of the mountain and came back. This is called in the jargon a ‘per protocol analysis’. Such an analysis can be useful to the doctor and patient because it provides information on what might happen to the patient if he or she does follow the treatment. It is common that ‘intention to treat’ analyses will suggest that treatments don’t work and ‘per protocol’ analyses that they do. In such circumstances it may be tempting for the researchers — and particularly the manufacturers of a drug — to report simply the ‘per protocol’ analysis. Would this be wrong? Until recently it probably would not have been regarded as ‘wrong’ and was common. Now, however, as standards are rising, it would perhaps be regarded as wrong — ‘misconduct’ in some form — to not present both analyses.

Because another problem with randomized trials is not just how they are done but how they are reported. There is substantial evidence that many trials have been badly reported (112). They have not given full information on patients, not described the methods of randomization and blinding, not given both ‘intention to treat’ and ‘per protocol’ analyses, not given data on adverse effects and failed in many other ways. This is an indictment not only of authors but also of journals, editors and the whole system of peer review. Widespread recognition of these deficiencies in reporting led a group of researchers and editors to develop a standardized form of reporting randomized trials — called CONSORT (Consolidated Standards for Reporting of Trials) (113). Many journals have now adopted these criteria and the reporting of trials seems to have improved as a result — although this conclusion was made from a ‘before and after’ study comparing trials in journals that adopted CONSORT and those that didn’t (not, as readers of this chapter will now know, the best way to decide if the intervention worked) (114).

Randomized trials, despite their problems, have become very important within medicine. Pharmaceutical companies must conduct such trials of their drugs before they can be put on the market. Classically the trials of new drugs have been conducted against placebos. The companies were simply required to show that their drugs were pure, safe and effective (that is, better than placebo). They have not had to show that their new drug is better than treatments already available. But the question that matters to doctors and patients is not whether the new drug is better than a placebo but whether it is better than the existing treatment. Many, including the World Medical Association, have declared it unethical to conduct trials against placebo rather than against the standard treatment.

At the BMJ, for example, we would not publish a trial of a new drug against a placebo when there was good evidence (derived from randomized trials) that an existing treatment works. There are many cases, however, where there is no good evidence to support the standard treatment. The standard treatment is simply what doctors have always given.

Increasingly the authorities who make decisions on which drugs will be available are requiring evidence that drugs are not simply pure, safe and effective, but also that they are better in some ways than existing treatments. There are many ways in which they might be better. They could be more effective, have fewer side-effects, be easier to take (perhaps once rather than three times a day) or cheaper. In England and Wales the National Institute for Health and Clinical Excellence (NICE) needs such evidence before advising that a new treatment be made widely available within the National Health Service.

Randomized trials are important to pharmaceutical companies not only for getting their drugs through regulatory authorities but also for the marketing of their drugs. Those who prescribe drugs increasingly want evidence from trials, and a huge trial conducted in many institutions in many countries can itself be a form of marketing — it draws the drug to the attention of a great many doctors and patients. But these trials can cost many millions of dollars to conduct. In such circumstances it can be disastrous for a company to carry out a trial that shows its drug to be inferior to a competitor drug. Not only will the millions spent on the trial be counterproductive but the hundreds of millions spent developing a new drug will be wasted.

Pharmaceutical companies are thus reluctant to conduct trials of their new drug against competitor drugs, the very trials that doctors and patients want. The answer, argues Silvio Garattini, director of the Mario Negri Institute in Milan, and others, may be for public money to be used to conduct such trials. Garattini also points out that companies are clever at conducting trials that will be beneficial for marketing but do not run the risk of damaging their drug. They conduct trials that are big enough to show that their drug is no worse than that of a competitor (a ‘non-inferiority’ trial in the jargon) but not big enough to run the risk of showing it to be worse (115). There are many other ways in which companies can be sure of getting the results they want, and these were brilliantly and wittily parodied in an article by Dave Sackett and Andy Oxman, two of the founders of ‘evidence-based medicine’, in which they describe a research company called HARLOT (How to Achieve positive Results without actually Lying to Overcome the Truth) (116).

Garattini and others mockingly ask that patients entering such trials to sign a consent form saying:

‘Draft informed consent for an underpowered “equivalence trial”: “Let us treat you with something that at best is the same as what you would have had before, but might also reduce — though this is unlikely — most of the advantages previously attained in your condition. It might even benefit you more than any current therapy, but, should that actually happen, we will not be able to prove it or let you know whether the new treatment may somehow bother or even harm you more than the standard one, as potential side-effects may be too rare for us to be able to measure them in this study.’ (115)

I paraphrase: ‘I agree to participate in this trial which I understand to be of no scientific value but will be useful for pharmaceutical companies in marketing their drug.’ The influence of pharmaceutical companies over what is published in medical journals is discussed more fully in chapter 16.

Because so many randomized trials are too small researchers have developed a means of combining the results of many small trials into something called a meta-analysis or systematic review. These reviews also have the benefit that they include a wider range of patients than are usually included in any single trial. A common problem with trials is that they do not include elderly patients, patients with more than one condition or many other groups. Doctors are thus left wondering whether or not the results of the trial are relevant to the patients they see. Systematic reviews can thus have advantages over single randomized trials.

But just like every other methodology they have deficiencies. Tile essence of doing a systematic review is that you ask a question that is important to doctors and patients, gather all evidence on the question, evaluate the quality of that evidence, and then combine — perhaps statistically — the high-quality evidence. This is conceptually straightforward, but there are severe problems with every step.

Often the questions that matter to patients and doctors are different from those answered by researchers. It may thus be that there is no useful evidence.

Tile next problem is finding all the available evidence, and the all is important. It is easy to be misled if you find just some of the evidence and finding all relevant evidence is hard because there is so much and it is so disorganized. Major databases — like PubMed, which is compiled by the National Library of Medicine in Washington — contain only some of all published evidence, and then finding the relevant evidence within the database is difficult. Furthermore, much of the evidence is not published at all, and unfortunately an important bias is usually introduced by looking only at the evidence that is straightforward to find. This is because evidence that suggests that a treatment works is likely to be published in major journals and so easy to find, whereas the evidence that suggests that a treatment does not work may not be published at all or may be published in more obscure journals that are not included in the major bibliographic databases.

The next problem is to evaluate the quality of the evidence. There are many different methods for doing this, and they often do not select the same studies. This matters as well because often the poorer-quality studies suggest that a treatment works, whereas the higher-quality studies suggest it doesn’t. Tile cut-off point used by the person conducting the review may thus make a difference to the conclusion of the review. Another common problem encountered in such reviews is that there are many studies but all of low quality. The conclusion of the review is thus that we do not know if a treatment works or not. We had intense debates over whether or not we should publish such studies in the BMJ, but increasingly we did. Surely patients should be told that there is no good evidence to support a treatment that may be offered to them. Finally, combining results from very different sorts of trials can be both difficult and misleading. There are sophisticated statistical tests for assessing the heterogeneity of trials, but — just as with risk adjustment — they can never entirely compensate for inadequate data.

So far I have discussed the evidence that might be used to assess whether or not a treatment works, and I hope the reader — whose head may be spinning — will agree that this is no simple task and that the studies published in medical journals are hard to do, hard to interpret and may often mislead. John Ioannidis, a researcher from Greece, has gone as far as to argue in an article that has attracted lots of attention that ‘most published research findings are false’ (117). If you spend your time reading medical journals you may more often be misled than informed. You may be wasting your time. You’d be less misinformed if you never read a journal.

I don’t want to discuss the problems with every other sort of study, but if readers would like to read more on this subject in language that they will understand I suggest that they read Trish Greenhalgh’s book How to Read a Paper (118). (Trish is a friend of mine and the book is published by the BMJ Publishing Group of which I was the chief executive, but I will not benefit financially from increased sales of the book)

Before finishing I do, however, want briefly to discuss the sort of study that associates X with Y — because these are the studies that are most commonly reported in the mass media. X might be alcohol, smoking, exercise, coffee, garlic, sex or a thousand other things, and Y could be death, breast cancer, heart attacks, stroke, depression or many other horrors. Two cartoons tell the story. In the first a newsreader on ‘Today’s Random Medical News’ has behind him three dials indicating that ‘smoking, coffee, stress, etc’ cause ‘heart disease, depression, breast cancer, etc’ in ‘children, rats, men aged 25 to 40’. In the second cartoon a listener with a spinning head is being told ‘Don’t eat eggs … eat more eggs … stay out of the sun … don’t lie around inside’.

These stories come usually from epidemiological studies in which the researchers examine measurements in a large group of people and look for statistical associations between X and Y. Even if optimally done they do not prove causation. The fact that X and Y have a statistical association is a long way from proving causation. There is probably a fairly strong correlation between cases of autism and the numbers of people using personal computers in that both have climbed in the past 15 years, but there is no reason to think that there is any causal link (or maybe now somebody will suggest it).

Too many small studies and imprecise testing of improbable hypotheses in medical journals lead to an excessive numbers of associations that then spread through the mass media, according to statistician Jonathan Sterne and epidemiologist George Davey Smith (119). The other part of the problem is that medical journals and researchers have been obsessed with a statistical test (called at-test). It produces something called a p(for probability) value. Traditionally if p is less than 0.05 (meaning that there is a probability of only 5%, or one in 20, that the finding could have arisen by chance), then the result is taken as positive or ‘true’ — X andY are linked. And both authors and the media are then quick to suggest that X causes Y, which, as I’ve said, doesn’t follow even if X and Y are truly linked.

These studies can produce a result called a ‘false positive’ (when the result suggests a true link but in fact there is no link) and ‘false negatives’ (when the study says there is no link but in fact there is). (This is true, importantly, of all diagnostic tests. No test is perfect. They all produce false positives and false negatives, providing one reason why the art of diagnosis is so hard.)

Sterne and Davey Smith explain why we are deluged with bogus associations by making a plausible assumption that 10% of hypotheses are true and 90% untrue. Their second assumption is that most studies are too small and that studies reported in medical journals therefore have only a 50% chance of getting the right answer. Lots of evidence supports this assumption. They then consider 1000 studies testing different hypotheses. One hundred (that is, 10%) will be true, but 50% of those will be reported as untrue. From the 900 hypotheses that are untrue 45 will be reported as true because of the use of p<0.05 as true. So almost half of the 95 studies reported as ‘positive’ are false alarms.

Doctor and medical journalist, James Ie Fanu, has suggested that the answer to the problem would be the closure of all departments of epidemiology’ (120).

The problems of doing and interpreting studies on whether or not treatments work and whether or not X and Y might be linked go for every other sort of study published in journals, and journals are publishing an ever wider range of studies. For example, the BMJ was publishing steadily more studies that use the methods of the social sciences and economics. We did so because these methods are the best for answering some of the very broad range of questions that arise in healthcare. The methods of social science are, for example, optimal for asking questions like, ‘What do doctors and patients think of a subject and how do they behave?’ Economic methods clearly must be used for assessing the cost-effectiveness of treatments and healthcare systems have to consider costs not just clinical effectiveness. This, for example, is exactly what NICE does.

We thus need new methods, but each time a journal publishes new methods the editors must try to understand how to assess the quality of studies — and so must readers. Editors are, of course, paid to read studies and assess their quality, and so they work hard at understanding new methods. It does, however, take a long time to understand them — and most editors never achieve complete mastery of the methods, which is why they need reviewers and advisers. Readers, in contrast, are not paid to understand the methods — and most of them never do. Journals provide readers with an increasingly complex diet of research, most of which most of the readers are not able to assess. They have to trust the journals — despite the evidence that journals mostly publish material of limited relevance and low scientific quality.

The problem of complex and unfamiliar methodologies is likely to get much worse, warns Doug Altman, one of the BMJ‘s statistical advisers. The easy and cheap availability of immense computing power means that highly complex calculations can be easily done. But how can editors and statistical advisers review the validity of such tests when they don’t understand them and the results cannot be presented on paper? Salvation might lie, I hope (perhaps naively), in electronic publication of studies.

Authors will publish raw data together with the computer programs used to analyse the data. Editors, reviewers and readers might then be able to repeat the analyses for themselves.

You are convinced, I hope, that there is lots of room for error and misunderstanding in what medical journals publish even when everything is done honestly and in good faith. If we add misconduct and manipulation into the mix then the potential for confusion and harm is greatly increased. The answer is not, however, to abandon medical journals — as Beth was led to conclude — but to promote critical reading and debate. Or maybe Beth is right.

References

Wakefield AJ, Murch SH, Linnell AAJ et al. Ileal-lymphoid-nodular hyperplasia, non-specific colitis and pervasive developmental disorder in children. Lancet 1998;351:637-41.
Laumann E, Paik A, Rosen R. Sexual dysfunction in the United States: prevalence and predictors. JAMA 1999;281:537-44 (published erratum appears in JAMA 1999;281:1174).
Moynihan R. The making of a disease: female sexual dysfunction. BMJ 2003;326:45-7.
Hudson A, Mclellan F. Ethical issues in biomedical publication. Baltimore: Johns Hopkins University Press, 2000.
Sackett DL, Haynes RB, Guyatt GH, Tugwell P. Clinical epidemiology: a basic science for clinical medicine. London: Little, Brown, 1991.
Haynes RB. Where’s the meat in clinical journals? ACP Journal Club 1993;119:A23-4.
Altman DG. The scandal of poor medical research. BMJ 1994;308:283-4.
Shaughnessy AF, Slawson DC, Bennett JH. Becoming an information master: a guidebook to the medical information jungle. J Fam Pract 1994;39:489-99.
Bartrip P. Mirror of medicine: a history of the BMJ. Oxford: British Medical Journal and Oxford University Press, 1990.
Chen RT, DeStefano F. Vaccine adverse events: causal or coincidental? Lancet 1998;351:611-12.
Pobel D, Vial JF. Case-control study of leukaemia among young people near La Hague nuclear reprocessing plant: the environmental hypothesis revisited. BMJ 1997;314:101.
Horton R. A statement by the editors of the Lancet. Lancet 2004;363:820-1.
Murch SH, Anthony A, Casson DH et al. Retraction of an interpretation. Lancet 2004;363:750.
Smith R. The discomfort of patient power. BMJ 2002;324:497-8.
Antithrombotic Trialists’ Collaboration. Collaborative meta-analysis of randomised trials of antiplatelet therapy for prevention of death, myocardial infarction and stroke in high risk patients. BMJ 2002;324:71-86.
Cleland JGF. For debate: Preventing atherosclerotic events with aspirin. BMJ 2002;324:103-5.
Bagenal FS, Easton OF, Harris E et al. Survival of patients with breast cancer attending Bristol Cancer Help Centre. Lancet 1990;336:606-10.
Fox R. Quoted in: Smith R. Charity Commission censures British cancer charities. BMJ 1994;308:155-6.
Richards T. Death from complementary medicine. BMJ 1990;301:510.
Goodare H. The scandal of poor medical research: sloppy use of literature often to blame. BMJ 1994;308:593.
Bodmer W. Bristol Cancer Help Centre. Lancet 1990;336:1188.
Budd JM, Sievert ME, Schultz TR. Phenomena of retraction. Reasons for retraction and citations to the publications. JAMA 1998;280:296-7.
McVie G. Quoted in: Smith R. Charity Commission censures British cancer charities. BMJ 1994;308:155-6.
Smith R. Charity Commission censures British cancer charities. BMJ 1994;308:155-6.
Feachem RGA, Sekhri NK, White KL. Getting more for their dollar: a comparison of the NHS with California’s Kaiser Permanente. BMJ 2002;324:135-41.
Himmelstein DU, Woolhandler S, David OS et al. Getting more for their dollar: Kaiser v the NHS. BMJ 2002;324:1332.
Talbot-Smith A, Gnani S, Pollock A, Pereira Gray D. Questioning the daims from Kaiser. Br J Gen Pract 2004;54:415-21.
Ham C, York N, Sutch S, Shaw A. Hospital bed utilisation in the NHS, Kaiser Permanente, and the US Medicare programme: analysis of routine data. BMJ 2003;327:1257-61.
Sanders SA, Reinisch JM. Would you say you ‘had sex’ If…? JAMA 1999;281:275-7.
Anonymous. lfs over, Debbie. JAMA 1988;259:272.
Lundberg G. ‘lfs over, Debbie,’ and the euthanasia debate. JAMA 1988;259:2142-3.
Smith A. Euthanasia: time for a royal commission. BMJ 1992;305:728-9.
Doyal L, Doyal L. Why active euthanasia and physician assisted suicide should be legalised. BMJ 2001;323:1079-80.
Emanuel EJ. Euthanasia: where The Netherlands leads will the world follow? BMJ 2001;322:1376-7.
Angell M. The Supreme Court and physician-assisted suicide-the ultimate right N Eng J Med 1997;336:50-3.
Marshall VM. lfs almost over — more letters on Debbie. JAMA 1988;260:787.
Smith A. Cheating at medical school. BMJ 2000;321:398.
Davies S. Cheating at medical school. Summary of rapid responses. BMJ 2001;322:299.
Ewen SWB, Pusztai A. Effects of diets containing genetically modified potatoes expressing Galanthus nivalis lactin on rat small intestine. Lancet 1999;354:1353-4.
Horton A. Genetically modified foods: ‘absurd’ concern or welcome dialogue? Lancet 1999;354:1314-15.
Kuiper HA, Noteborn HPJM, Peijnenburg AACM. Adequacy of methods for testing the safety of genetically modified foods. Lancet 1999;354:1315.
Bombardier C, Laine L, Reicin A et al. Comparison of upper gastrointestinal toxicity of rofecoxib and naproxen in patients with rheumatoid arthritis. N Eng J Med 2000;343:1520-8.
Curfman GO, Morrissay S, Drazen JM. Expression of concern: Bombardier et al., ‘Comparison of Upper Gastrointestinal Toxicity of Rofecoxib and Naproxen in Patients with Rheumatoid Arthritis.’ N Eng J Med 2000;343:1520-8. N Eng J Med 2005;353:2813-4.
Curfman GO, Morrissey S, Drazen JM. Expression of concern reaffirmed. N Eng J Med 2006;354: 1193.
Laumann E, Paik A, Rosen A. Sexual dysfunction in the United States: prevalence and predictors. JAMA 1999;281:537-44 (published erratum appears in JAMA 1999;281:1174).
Smith A. In search of ‘non-disease.’ BMJ 2002;324:883-5.
Hughes C. BMJ admits ‘lapses’ after article wiped £30m off Scotia shares. Independent 10 June 2000.
Hettiaratchy S, Clarke J, Taubel J, Besa C. Bums after photodynamic therapy. BMJ 2000;320:1245.
Bryce A. Bums after photodynamic therapy. Drug point gives misleading impression of incidence of bums with temoporfin (Foscan). BMJ 2000;320:1731.
Richmond C. David Horrobin. BMJ 2003;326:885.
Enstrom JE, Kabat GC. Environmental tobacco smoke and tobacco related mortality in a prospective study of Californians, 1960-98. BMJ 2003;326:1057-60.
Roberts J, Smith A. Publishing research supported by the tobacco industry. BMJ 1996;312:133-4.
Lefanu WR. British periodicals of medicine 1640-1899. London: Wellcome Unit for the History of Medicine, 1984.
Squire Sprigge S. The life and times of Thomas Wakley. London: Longmans, 1897.
Bartrip PWJ. Themselves writ large: the BMA 183~1966. London: BMJ Books, 1996.
Delamothe T. How political should a general medical journal be? BMJ 2002;325:1431-2.
Gedalia A. Political motivation of a medical joumal [electronic response to Halileh and Hartling. Israeli-Palestinian conflict]. BMJ 2002. http:/lbmj.com/cgi/eletters/324173331361#20289 (accessed 10 Dec 2002).
Marchetti P. How political should a general medical journal be? Medical journal is no place for politics. BMJ 2003;326:1431-32.
Roberts I. The second gasoline war and how we can prevent the third. BMJ 2003;326:171.
Roberts IG. How political should a general medical journal be? Medical journals may have had role in justifying war. BMJ 2003;326:820.
Institute of Medicine. Crossing the quality chasm. Anew health system for the 21st century. Washington: National Academy Press, 2001.
Oxman AD, Thomson MA, Davis DA, Haynes RB. No magic bullets: a systematic review of 102 trials of interventions to improve professional practice. Can Med Assoc J 1995;153:1423-31.
Grimshaw JM, Russell IT. Effect of clinical guidelines on medical practice: a systematic review of rigorous evaluations. Lancet 1993;342:1317-22.
Grol R. Beliefs and evidence in changing clinical practice. BMJ 1997;315:418-21.
Smith R. What clinical information do doctors need? BMJ 1996;313:1062-8.
Godlee F, Smith A, Goldman D. Clinical evidence. BMJ 1999;318:1570-1.
Smith R. The BMJ: moving on. BMJ 2002;324:5-6.
Milton J. Aeropagitica. World Wide Web: Amazon Press (digital download), 2003.
Coulter A. The autonomous patient ending paternalism in medical care. London: Stationery Office Books, 2002.
Muir Gray JA. The resourceful patient. Oxford: Rosetta Press, 2001.
World Health Organization. Macroeconomics and health: investing in health for economic development. Report of the commission on macroeconomics and health. Geneva: WHO, 2001.
Mullner M, Groves T. Making research papers in the BMJ more accessible. BMJ 2002;325:456.
Godlee F, Jefferson T, eds. Peer review in health sciences, 2nd edn. London: BMJ Books, 2003.
Reiman AS. Dealing with conflicts of interest. N Eng J Med 1984;310:1182-3.
Hall D. Child protection: lessons from Victoria Climbié. BMJ 2003;326:293-4.
McCombs ME, Shaw DL. The agenda setting function of mass media. Public Opin Q 1972;36:176-87.
McCombs ME, Shaw DL. The evolution of agenda-setting research: twenty five years in the marketplace of ideas. J Commun 1993;43:58-67.
Edelstein L. The Hippocratic oath: text, translation, and interpretation. Baltimore: Johns Hopkins Press, 1943.
www.pbs.org/wgbhlnova/doctors/oath_modem.html (accessed 8 June 2003).
Weatherall DJ. The inhumanity of medicine. BMJ 1994;309:1671-2.
Smith R. Publishing information about patients. BMJ 1995;311:1240-1.
Smith R. Informed consent: edging forwards (and backwards). BMJ 1998;316:949-51 .
Caiman K. The profession of medicine. BMJ 1994;309:1140-3.
Smith R. Medicine’s core values. BMJ 1994;309:1247-8.
Smith R. Misconduct in research: editors respond. BMJ 1997;315:201-2.
McCall Smith A, Tonks A, Smith R. An ethics committee for the BMJ. BMJ 2000;321:720.
Smith R. Medical editor lambasts journals and editors. BMJ 2001;323:651.
Smith R, Rennie D. And now, evidence based editing. BMJ 1995;311:826.
Weeks WB, Wallace AE. Readability of British and American medical prose at the start of the 21st century. BMJ 2002;325:1451-2.
O’Donnell M. Evidence-based illiteracy: time to rescue ‘the literature’. Lancet 2000;355:489-91 .
O’Donnell M. The toxic effect of language on medicine. J R Coli Physicians Lond 1995;29:525-9.
Berwick D, Davidoff F, Hiatt H, Smith A. Refining and implementing the Tavistock principles for everybody in health care. BMJ 2001;323:616-20.
Gaylin W. Faulty diagnosis. Why Clinton’s health-care plan won’t cure what ails us. Harpers 1993;October:57-64.
Davidoff F. Reinecke RD. The 28th Amendment. Ann Intern Med 1999;130:692-4.
Davies S. Obituary for David Horrobin: summary of rapid responses. BMJ 2003;326: 1089.
Butler D. Medical journal under attack as dissenters seize AIDS platform. Nature 2003;426:215.
Smith A. Milton and Galileo would back BMJ on free speech. Nature 2004;427:287.
Carr EH. What is histoty? Harmondsworth: Penguin, 1990.
PopperK. The logic of scientific discovery. London: Routledge, 2002.
Kuhn T. The structure of scientific revolutions. London: Routledge, 1996.
www.guardian.co.uklnewsroomlstory/0,11718,850815,00.html (accessed 14 June 2003).
Davies S, Delamothe T. Revitalising rapid responses. BMJ 2005;330:1284.
Morton V, Torgerson OJ. Effect of regression to the mean on decision making in health care. BMJ 2003;326:1 083-4.
Horton R. Surgical research or comic opera: questions, but few answers. Lancet 1996;347:984-5.
Pitches D, Burls A, Fry-Smith A. How to make a silk purse from a sow’s ear — a comprehensive review of strategies to optimise data for corrupt managers and incompetent clinicians. BMJ 2003;327:1436-9.
Poloniecki J. Half of all doctors are below average. BMJ 1998;316:1734-6.
Writing group for the Women’s Health Initiative Investigators. Risks and benefits of estrogen plus progestin in healthy postmenopausal women. JAMA 2002;288:321-33.
Shumaker SA, Legauh C, Thai l et al. Estrogen plus progestin and the incidence of dementia and mild cognitive impairment in postmenopausal women: the Women’s Health Initiative Memory Study: a randomized controlled trial. JAMA 2003;289:2651-62.
Yusuf S, Collins R, Peto R. Why do we need some large, simple randomized trials? Stat Med 1984;3:409-22.
Leibovici L. Effects of remote, retroactive intercessory prayer on outcomes in patients with bloodstream infection: randomised controlled trial. BMJ 2001;323:1450-1.
Haynes RB, McKibbon A, Kanani R. Systematic review of randomised trials of interventions to assist patients to follow prescriptions for medications. Lancet 1996;348:383-6.
Schulz KF, Chalmers I, Hayes RJ, Altman DG. Empirical evidence of bias. Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA 1995;273:408-12.
Altman DG, Schulz KF, Moher D et a/., for the CONSORT Group. The revised CONSORT statement for reporting randomized trials: explanation and elaboration. Ann Intern Med 2001;134:663-94.
Moher D, Jones A, Lepage L; CONSORT Group (Consolitdated Standards for Reporting of Trials). Use of the CONSORT statement and quality of reports of randomized trials: a comparative before-and-after evaluation. JAMA 2001;285:1992-5.
Garattini S, Bertele V, U Bassi L. How can research ethics committees protect patients better? BMJ 2003;326:1199-201.
Sackett Dl, Oxman AD. HARLOT pic: an amalgamation of the world’s two oldest professions. BMJ 2003;327:1442-5.
loannidis JPA. Why most published research findings are false. PLoS Med 2005;2:e124.
Greenhalgh T. How to read a paper. London: BMJ Books, 1997.
Sterne JAC, Davey Smith G. Sifting the evidence: what’s wrong with significance tests? BMJ 2001 ;322:226-31.
Le Fanu J. The rise and fall of modem medicine. New York: Little, Brown, 1999.

More From The Trouble With Medical Journals

References