Dr James Burger

Despite a large amount of uncertainty in medicine, coming to the correct decision about a patient is imperative, both in order to help that person, as well as avoid the soul-destroying consequences of contravening our oath to primum non nocere; first, do no harm. As a result, we have become reliant on evidence-based medicine to guide us in these often-difficult choices.

Evidence based medicine is the “conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients” (1). This helps to improve diagnosis and clinical decision-making (1). EBM forms the basis for our decisions as doctors as well as at a policy level. We trust evidence to steer us towards the best way forward. But what happens when it fails to do this?

Incorrect research can have dangerous effects. As can be seen in the repercussions of the infamous article by Wakefield et al linking the MMR vaccine to autism, drops in vaccinations have been seen and subsequent outbreaks in measles resulting in the deaths of children, despite the majority of the authors retracting the interpretation and the British Medical Journal publishing a series on the fraud in the paper (2). Especially with the proliferation of social media and medical information sharing, there is the potential to spread bad research along with the good.

Owing to sampling and the nature of statistics, we can never be a hundred percent certain of the conclusions of research. However, we find ourselves in a reproducibility crisis where replications of even our landmark studies have often been unable to demonstrate repeated significant results (3; 4; 5).

In this quest for objectivity, we seem to have become too focussed on hypothesis testing as a litmus test for the importance of research findings. The pervasively used p-value was never intended to single-handedly conclude whether a study has important findings (6; 3; 4; 7). Yet, our eyes light up at the number of decimal points starting a p value, often distracting us from the size of the effect (3; 6).

As doctors, we find ourselves judged not just by technical skills and knowledge, but by our publications. With the increasingly prevalent requirement of research publications prior to specialisation, it is also no longer just the livelihoods of academics and researchers that rely on publication.

Due to the nature of the p-value, pre-test probability of results greatly influences the rate of false positives (6) and if researchers start to chase improbable results in order to get published, this needs to be taken into account with interpretations of the results.

There are many ways in which researchers can increase the likelihood of finding a significant result, from using multiple different methods, to analysing multiple variables, dropping certain data, and controlling for certain variables (3; 4). The concept of P-hacking is when researchers keep reanalysing until the data ultimately show the results that they want (6; 4).

The resulting ‘significant’ result, however, has vastly increased false-positive rates of up to 60% (6). When our medical policies and decisions about patients are so closely linked to study results, this carries significant risk for inappropriate allocation of resources and increasing patient morbidity and mortality. Boos (2011) also highlighted the limited value of reporting exact p-values and raised concern with the illusion of certainty created, due to the imprecision and varying nature of these tests.

We would hope that research outputs come from a highly objective methodology, although without the raw data, one simply cannot tell how much manipulation of the data happened prior to reaching the reported results (3; 4). There is so much on the line in terms of researchers’ careers hanging on a ‘significant’ result for publication and without complete blindness and randomisation, results tend to bias towards what the researcher believes to be true at the onset, even if unintentional (4).

Silberzahn et al (2015) showed how researchers can reach very different conclusions on one dataset. 29 research teams used an identical dataset to answer the research question of whether dark skin-toned players were more likely to receive red cards than lighter skin-toned players. Despite very reasonable decisions being made, there were not just varied levels of significance found, but major differences in effect sizes ranging from ORs of 0.89 to 2.93 (8).

Overall, all of these decisions in statistical analyses influence the results and there is a need for much more openness in methods (4; 8), with two-stage analyses (6) or crowdsourcing analysis (8) being suggested as a good ways forward.

The drive for evidence-based medicine has and hopefully will continue to benefit our patients and help guide us in policy. Research, though, is not without its flaws and incorrect statistical decision-making can have far-reaching negative effects on our patients. P-hacking and the manipulation of data by researchers to chase significance has been just one of the many factors that has led us to our current reproducibility crisis. A shared commitment is needed both by researchers and scientific journals towards more open methods and the judicious use of p-values with acknowledgement of their limitations, taking into account confidence intervals and effect sizes. While not perfect, evidence-based medicine is the best way we have to help our patients and we should strive to continue with critical thinking and interpreting results for ourselves as doctors, scientists and statisticians.

*The views and opinions expressed in this article are that of the author and not necessarily of Thumela as a whole*
If you wish to submit an article of your own, please contact us at james@thumela.org

Works Cited

  1. Evidence based medicine: what it is and what it isn’t. Sackett, D., Rosenberg, W., Muir Gray, J., Scott Richardson, W. No. 7023, 1996, British Medical Journal, Vol. 312, pp. 71-72.
  2. The MMR vaccine and autism: Sensation, refutation, retraction, and fraud. Sathyanarayana Rao, T.S., Andrade, C. 2, 2011 , Indian Journal of Psychiatry, Vol. 53, pp. 95–96.
  3. Why Most Published Research Findings Are False. Ioannidis, J. 8, August 2005, PLoS Medicine, Vol. 2, pp. 696-701.
  4. Insel, T. P-Hacking. National Institute of Mental Health. [Online] 14 November 2014. [Cited: 29 March 2017.] https://www.nimh.nih.gov/about/directors/thomas-insel/blog/2014/p-hacking.shtml.
  5. P-Value Precision and Reproducibility. Boos, D., Stefanski, L. 4, 2011, The American Statistician, Vol. 65, pp. 213-221.
  6. Scientific method: Statistical errors . Nuzzo, R. 506, 13 February 2014, NATURE, pp. 150–152.
  7. Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy . Goodman, S. 12, 1999, Annals of Internal Medicine, Vol. 130, pp. 995-1004.
  8. Many analysts, one dataset: Making transparent how variations in analytical choices affect results. Silberzahn R., Uhlmann E. L., Martin D. P., Anselmi P., Aust F., Awtrey E., Bahník Š., Bai F., Bannard C., Bonnier E., Carlsson R., Cheung F., Christensen G., Clay R., Craig M. A., Dalla Rosa A., Dam L., Evans M. H., Flores Cervantes I., Fong N., Gamez-Djokic M., Glenz A., Gordon-McKeon S., Heaton T. J., Hederos Eriksson K., Heene M., Hofelich Mohr A. J., Högden F., Hui K., Johannesson M., Kalodimos J., Kaszubowski E., Kennedy D.M., Lei R., Lindsay T. A., Liverani S, Madan C. R., Molden D., Molleman E., Morey R. D., Mulder L. B., Nijstad B. R., Pope N. G., Pope B., Prenoveau J. M., Rink F., Robusto E., Roderique H., Sandberg A., Schlüter E., Schönbrodt F. D., Sherman M. F., Sommer S.A., Sotak K., Spain, Spörlein C., Stafford T., Stefanutti L., Tauber S., Ullrich J., Vianello M., Wagenmakers E., Witkowiak M., Yoon S., & Nosek B. A. 2015, Open Science Framework.
  9. The Abuse of Power. Hoenig, J., Heisey, D. 1, 2001, The American Statistician, Vol. 55, pp. 19-24.