I am not a statistician, but a frequent consumer and purveyor of statistics. My interest in statistics concerns their use in surgical education and research: how can we best apply statistics in a clear-headed versus rote manner? I am drawn to the notion that the purpose of statistics is “to organize a useful argument from quantitative evidence, using a form of principled rhetoric.”1 Statistical analysis has a narrative role to play in our work. But to tell a good story, it has to make sense.
This article is for the non-lover of statistics who wants to learn how statistical analysis can help to tell a good story, and wants to be able to tell that story if called upon to do so. It will focus on one of the core tenets in our belief system, which also turns out to represent a long-standing controversy in the statistical community. That tenet involves the practice of (some would say single-minded, blind-sided, slave-like devotion to) null hypothesis significance testing (NHST) and the use of p<.05 as the break-point for determining “significant” findings. The article will also discuss and advocate for using measures of effect size to examine the strength of our alternative hypotheses and to judge the practical significance of our findings. Throughout, I rely on three excellent articles, one by Roger Kirk2 and two by Jacob Cohen,3,4 and a number of accessible texts that are listed among the references. Let’s start by reviewing some basics.
It means that a difference or a relationship between variables found in a random sample is smaller or larger than we would expect by chance alone. What do we mean by chance? We mean that there is a low probability of obtaining our result if the null hypothesis of no difference/no relationship is true. Because we use “low probability” as a reason to reject the null hypothesis, you might say we also mean the chance of being wrong–that is, of rejecting a true null hypothesis when we should have retained it. Such a finding is otherwise known as a false positive, or Type I error. We call our accepted chance of being wrong “alpha,” and generally set levels of “less than five percent of the time” (p<.05) or below (<.01, <.001) as our cut-point. If the probability of obtaining a particular result falls below this cut-point, we reject the null hypothesis on the grounds that our finding was so unlikely (only five percent of the time or less, through repeated and infinite sampling) the null can’t be supported.
The term “significant” does not mean “a really important finding,” or that a particularly large difference or relationship was found. It means that the likelihood of these data occurring by chance is significantly low enough to make us doubt the null hypothesis. A finding that falls below .01 (“highly significant”) is not necessarily larger, smaller, or more important than one that falls just below .05 (“significant”). This is a common and a surprisingly easy inference to make. I actually think it was brilliant marketing on Sir Ronald Fisher’s part (to whom null hypothesis significance testing is credited) to describe findings that are ever-less-likely-due-to-chance as significant, very significant, and highly significant. The temptation to believe that “significance” means the experiment (or other hypothesis) was hugely successful requires active resistance.
Investigators sometimes calculate p-Values on data that have been obtained from convenience samples (i.e., where neither random selection nor random assignment took place). This is generally frowned on, although I myself do it out of habit, curiosity, and the lazy knowledge that it may highlight patterns in the data. Frowning is warranted because the underlying logic and actual computation of NHST relies on random sampling, under which conditions the sampling distribution of the statistic being tested (e.g., a mean or correlation obtained from repeated and infinite sampling) assumes a predictable shape. The p-Value itself (not to be confused with the alpha, or cut-off level) means the probability of obtaining your particular finding based on this mathematically derived sampling distribution. (Remember the appendices in your college statistics book?)
The p-Value does not mean the probability that the null hypothesis is correct. It means the probability of obtaining our data, assuming the null is correct. (Even when there are no differences/relationships in the population, in five out of every 100 random samples we will get a result as large or as small as ours.) We may reject the null because our data fell below the .05 level and because the low probability of getting such a result casts suspicion on it, but that doesn’t mean the null is correct five percent of the time.
The complement of the p-Value (e.g., .95) does not represent the probability that a significant result will be found in replication. Just because we have rejected a null hypothesis, all we really know is that our results probably did not occur by chance. But we do not know from this one sample what the probability of finding such a result again is. Bummer!
Three main criticisms against NHST as a dominant method of scientific “proof” have been voiced for the past 80 years, ever since Sir Ronald Fisher introduced the concept and laid the basis for structural models and computations in 1925.
The use of this decision strategy can lead to the anomalous situation in which two researchers obtain identical treatment effects but draw different conclusions from their research. One researcher, for example, might obtain a p-Value of .06 and decide not to reject the null hypothesis. The other researcher uses slightly larger samples and obtains a p-Value of .05, which leads to a rejection. What is troubling here is that identical treatment effects can lead to different decisions.11
In sum, the critics argue that rejecting the null hypothesis neither tells us the probability that the null is true, nor the probability that an alternative research hypothesis is true (although we act as though it does). It also does not tell us anything about the magnitude of the differences or relationships found in our data—only about their likelihood of occurring by chance in a population where the null hypothesis is true. Sounds like a lot of work for little gain, no?
The philosophical argument advanced by Fisher was that ultimately hypotheses cannot be proved, but only disproved. Observing that 3000 people all have two legs does not prove the hypothesis, “Every person has two legs.” Observing even one person without a leg disproves that statement. Finding a null hypothesis to be unsupported by the data, Fisher argued, was therefore the best we could do to advance our own, alternative explanation. By declaring the null false based on probability, he provided a paradigm for testing the viability of a broad range of alternative hypotheses. By setting an a priori alpha level, this process became (and remains) attractive precisely for the reasons enumerated above by the critics. It reduces the messiness of our data and simplifies the task of interpretation. It appears objective (“based on statistical probability”), and is independent of content or context. It travels well; regardless of the research question at hand, we can generally find a way to test the probability of our data. And yes, it can be used to support an argument that something other than chance has influenced the data.
Aside from the fact that widespread use and endorsement by the scientific community makes NHST a fact of life, it seems to me important to report p-Values at appropriate times and in appropriate ways. The “rules” governing “appropriate use” lie beyond the scope of this article, but briefly, NHST works best when:
Knowing that one’s results are not likely due to chance is worth something under these conditions. As stated by Muijs,12 “We still must somehow decide whether our sample parameters, which will contain measurement error, are unusual enough for us to say that they are likely to result from actual population differences.” But I agree with Cohen (and others) that the primary product of research inquiry is one or more measures of effect size, not p-Values.
If the real show is the credibility of our alternative hypothesis, how can we describe the size of observed differences or relationships or qualify the accuracy of our predictions? This section of the article will touch on some of the more common methods that help us make sense of the data and determine “practical” significance: looking at raw and standardized effects, measures of association, and confidence intervals. The suitability of the method depends on the type of statistical test being employed and the purpose of the research or evaluation. Think of this next section as a refresher of things you might have once learned or an introduction to a larger discussion to be held with your friendly statistician.
Table 1: Sample Odds Ratios and Relative Risk Ratios for ABSITE Performance ABSITE Criterion Treatment Group (%) Control Group (%) Risk Difference Ratio (larger to smaller) > 35th percentile .90 .75 .15 1.20 (risk ratio) < 35th percentile .10 .25 .15 2.50 (risk ratio) Odds 9.0 3.0 3.00 (odds ratio)
Note: These are not real data
An odds ratio of 1.0 indicates the odds of passing the ABSITE are equal for both groups (the odds are 1 to 1). It means there is no relationship between treatment group and outcome. The magnitude of treatment effect is therefore expressed by the distance from 1.0; numbers below 1.0 indicate a negative relationship (effect) and numbers above 1.0 indicate a positive relationship (effect). If an odds ratio resulting from the above experiment is 3.00, that means the odds of residents in the study group intervention reaching the 35th percentile at the next ABSITE administration are three times as great as those in the control group. As with raw effect sizes, confidence intervals can also be placed around the odds ratio statistic to indicate the upper and lower bounds of this prediction. Like standardized effect sizes on treatment means, odds ratios can be used in meta-analysis to compare relative efficacy of interventions across multiple studies. Relative risk ratios focus on the comparative probabilities across groups of achieving or failing a criterion. (Think rows, not columns.) As shown in Table 1, the relative risk of failing the 35th percentile on the ABSITE is two-and-a-half times greater for the control group than the study group (.25 / .10 = 2.50). The probability of achieving the 35th percentile or greater is 1.20 times greater for the study group. You can also say the risk difference is 15 percent.
In sum, interpretation of surgical and educational research findings should be based on a sensible interpretation of the size of the effect and estimates of error (accuracy), given the sample size and the probability of the result occurring if the null were true. To decide whether findings are important involves questions of judgment that go beyond a mechanistic “above/below p=.05” criterion. It is particularly difficult to determine the importance of findings if one cannot translate statistical results back into the instrument’s original units of measure, into English, and then into practical terms, as in, “for every additional week devoted to studying, ABSITE scores increased by two points.” Often we fail to get to the practical interpretation of results, either by design (they reveal a statistically significant but unimportant result) or by default (we don’t know how to get from our data to the statistics and back again). A recent quip I read and enjoyed goes like this:
Always keep in mind the advice of Winifred Castle, a British statistician, who wrote that, “We researchers use statistics the way a drunkard uses a lamp post, more for support than illumination.22
May you reach for illumination by thinking through your numbers and your reasoning behind them. Otherwise, I’ll be seeing you down by the lamp post.
Acknowledgements: thanks to Michael G. Luxenberg, PhD, and Matt Christenson, of Professional Data Analysts, Inc., for reading a draft of this paper.