An introduction to EBM (session 4)

skills
beginner
evidence-based medicine
critical appraisal
clinical trials
Published

September 9, 2024

Session outline

  • this session is about core EBM statistics:
    • Greenhalgh’s three numbers (Greenhalgh 2019, 54):
      • sample size
      • duration of follow-up
      • completeness of follow-up
    • confidence intervals
    • p-values
    • effect sizes

Exercises

  • E1: what’s this all for?
  • E2: ways to cheat
  • E3: putting it all together

Introduction

This is non-technical, and non-comprehensive

  • medical statistics is big and complicated
    • hundreds of statistical tests
    • many are formidably complicated
      • really nice advice in chapter 5 of Greenhalgh’s book
    • so we’re just looking at a few core areas
  • statistics are widely abused in the health literature
  • this gives rise to worries about statistics in general terms

E1: what’s this all for?

Task
  1. Please review the abstract of Smith (2003) available at bmj.com. What point is the author trying to make?
  2. Please now review the abstract of Yeh et al. (2018) available at bmj.com. What point are the authors trying to make?

Finding out if treatments work is hard

  • most examples are not like the parachute case (Xu and Prasad 2022)
  • most treatment effects are subtle
  • statistics are a vital way of being more sure about the results we get
Note
  • the Number Needed to Treat is the number of patients that need to be treated to prevent an undesirable outcome

Sample size (n)

Introduction

The problem of growing sample sizes

  • recruitment is hard
  • larger n = harder recruitment
    • may end up relying on heterogeneous population
      • e.g. Montalescot et al. (2009)
      • 3534 participants
      • 707 sites
      • 30 countries

Confidence intervals

  • or, why make things so large?
  • intuition: the larger our sample, the more precise our estimates

Confidence intervals

  • imagine we’re evaluating an intervention in a small population
  • 10% of the treatment arm get some outcome, compared to 12% in the control arm
  • relative risk in this trial population = 83%
  • Q: what will the relative risk be in the whole population?

The idea of sampling error

  • our measure of risk in our small population sample might not accurately reflect the true population risk
    • AKA standard error
  • we can’t say exactly how accurate our sample estimate is
    • unless we know the true population risk
  • but we can give an interval estimate for our measurement, effectively describing likely lower and upper values within which the estimate might lie

Confidence intervals

  • for our example with 83% relative risk, our confidence interval might be between 75% and 91%
    • we could say “the chances are that our ‘real’ population risk would be between 75% and 91%”
  • we conventionally use 95% confidence for these kinds of EBM estimates of confidence
    • effectively, saying that there’s a 95% chance the ‘true’ population risk is contained within this interval
    • you might see these expressed as 95% CIs in print
  • Sedgwick (2014b) is a good next-step reading on CIs

E2: ways to cheat

Task
  1. Take Greenhalgh’s list of ways to cheat
    Greenhalgh’s ways to cheat
  2. Are there any areas where your paper from the previous session might be faulty in this way?

p-values

Introduction

  • imagine that you’re looking at a 95% CI that is reported as follows:
    • “the relative risk of death in the intervention group compared to the control group was 92% (95% CI 72-112%)”
      • what does this result mean to you?
      • is there a real reduction of death in the intervention group?
  • that wide confidence interval means the population risk might well be 100% - the control group might not prevent death at all
  • what does this mean for us?

two kinds of statistical work

  1. descriptive statistics = what do these numbers look like
  2. inferential statistics = what do these numbers tell us

hypothesis testing

  • technically, RCT reports are hypothesis tests
  • hypothesis = educated guess about outcomes
  • hypothesis testing is definitely part of inferential statistics
  • hypothesis vs null hypothesis
    • what we think will happen if the treatment works
    • what we think will happen if the treatment does not work

does the data support the hypothesis?

  • does the evidence (our sample data) support the hypothesis, or the null hypothesis?
  • we could use a two-tailed statistical test to compare the support for each
  • these tests give us a P-value, which is:

…the probability of obtaining the observed difference between treatment groups…if there was no difference between treatment groups (Sedgwick 2014a)

P-values tell us about support

  • assume the null hypothesis is true
    • that our treatment does nothing
  • the P-value tells us how often we’d expect to get results like our real results just by chance
  • P-values are probabilities, so reported as a number between 0 and 1
  • critical level of significance = the arbitrary level below which P-values are assumed to be convincing
    • usually P = 0.05 - below that, the results are said to be statistically significant
  • P-values tell us whether our data supports our hypothesis (when P < 0.05) or whether it supports the null hypothesis (when P is > 0.05)

effect sizes

Introduction

  • an important third statistical tool deals with effect sizes
  • so far, we’ve just looked at individual trial results
  • from now on, though, we’re going to look at pooled results
  • we’ll need a way of comparing outcomes between trials

Cohen’s d

  • informally, calculated by taking the difference between (control and treatment) groups, and dividing by the standard deviation
    • because we’re being non-technical in this session, we can think of the SD as a measure of variation or spread
    • usually positive (absolute value reported)
  • that means that d expresses the effect size in terms of the natural variation within the study population
  • anything over 1 is massive (a whole standard deviation difference), anything under 0.2 is tiny
  • the third reference to Sedgwick (2015) - really recommend that sequence of articles for the technical questions about EBM

E3: putting it all together

Task
  • here’s a timely reference: (The RECOVERY Collaborative Group 2020)?
  • what do you think here about the:
    • sample size
    • the duration and completeness of follow-up
    • the use of confidence intervals
    • the use of p-values
    • any discussion of effect sizes

References

Greenhalgh, Trisha. 2019. How to Read a Paper. Hoboken, NJ: Wiley-Blackwell. https://nhs-scot-primo.hosted.exlibrisgroup.com/permalink/f/3dhhck/44NHSS_ALMA5146749970003491.
Heneghan, Carl, Ben Goldacre, and Kamal R. Mahtani. 2017. “Why Clinical Trial Outcomes Fail to Translate into Benefits for Patients.” Trials 18 (1). https://doi.org/10.1186/s13063-017-1870-2.
Lortie-Forgues, Hugues, and Matthew Inglis. 2019. “Rigorous Large-Scale Educational RCTs Are Often Uninformative: Should We Be Concerned?” Educational Researcher 48 (3): 158–66. https://doi.org/10.3102/0013189x19832850.
Montalescot, Gilles, Stephen D Wiviott, Eugene Braunwald, Sabina A Murphy, C Michael Gibson, Carolyn H McCabe, and Elliott M Antman. 2009. “Prasugrel Compared with Clopidogrel in Patients Undergoing Percutaneous Coronary Intervention for ST-Elevation Myocardial Infarction (TRITON-TIMI 38): Double-Blind, Randomised Controlled Trial.” The Lancet 373 (9665): 723–31. https://doi.org/10.1016/s0140-6736(09)60441-4.
Penston, James. 2005. “Large-Scale Randomised Trials a Misguided Approach to Clinical Research.” Medical Hypotheses 64 (3): 651–57. https://doi.org/10.1016/j.mehy.2004.09.006.
Sedgwick, P. 2014a. “Understanding P Values.” BMJ 349 (jul11 3): g4550–50. https://doi.org/10.1136/bmj.g4550.
———. 2014b. “Understanding Confidence Intervals.” BMJ 349 (oct06 12): g6051–51. https://doi.org/10.1136/bmj.g6051.
———. 2015. “Randomised Controlled Trials: Understanding Effect Sizes.” BMJ 350 (mar27 2): h1690–90. https://doi.org/10.1136/bmj.h1690.
Shepherd, James. 1995. “The West of Scotland Coronary Prevention Study: A Trial of Cholesterol Reduction in Scottish Men.” The American Journal of Cardiology 76 (9): 113C–117C. https://doi.org/10.1016/s0002-9149(99)80480-9.
Smith, G. C S. 2003. “Parachute Use to Prevent Death and Major Trauma Related to Gravitational Challenge: Systematic Review of Randomised Controlled Trials.” BMJ 327 (7429): 1459–61. https://doi.org/10.1136/bmj.327.7429.1459.
“Streptomycin Treatment of Pulmonary Tuberculosis: A Medical Research Council Investigation.” 1948. BMJ 2 (4582): 769–82. https://doi.org/10.1136/bmj.2.4582.769.
The RECOVERY Collaborative Group. 2020. “Effect of Hydroxychloroquine in Hospitalized Patients with Covid-19.” New England Journal of Medicine 383 (21): 2030–40. https://doi.org/10.1056/nejmoa2022926.
Wallentin, Lars, Salim Yusuf, Michael D Ezekowitz, Marco Alings, Marcus Flather, Maria Grazia Franzosi, Prem Pais, et al. 2010. “Efficacy and Safety of Dabigatran Compared with Warfarin at Different Levels of International Normalised Ratio Control for Stroke Prevention in Atrial Fibrillation: An Analysis of the RE-LY Trial.” The Lancet 376 (9745): 975–83. https://doi.org/10.1016/s0140-6736(10)61194-4.
Xu, Alexander, and Vinay Prasad. 2022. “The Use and Meaning of the Parachute Metaphor in Biomedicine: A Citation Analysis of a Systematic Review and a Randomized Trial of the Parachute for Freefall.” Journal of Comparative Effectiveness Research 11 (6): 383–90. https://doi.org/10.2217/cer-2021-0171.
Yeh, Robert W, Linda R Valsdottir, Michael W Yeh, Changyu Shen, Daniel B Kramer, Jordan B Strom, Eric A Secemsky, et al. 2018. “Parachute Use to Prevent Death and Major Trauma When Jumping from Aircraft: Randomized Controlled Trial.” BMJ, December, k5094. https://doi.org/10.1136/bmj.k5094.