I am not a statistician, but a frequent consumer and purveyor of statistics. My interest in statistics concerns their use in surgical education and research: how can we best apply statistics in a clear-headed versus rote manner? I am drawn to the notion that the purpose of statistics is “to organize a useful argument from quantitative evidence, using a form of principled rhetoric.”1 Statistical analysis has a narrative role to play in our work. But to tell a good story, it has to make sense.
This article is for the non-lover of statistics who wants to learn how statistical analysis can help to tell a good story, and wants to be able to tell that story if called upon to do so. It will focus on one of the core tenets in our belief system, which also turns out to represent a long-standing controversy in the statistical community. That tenet involves the practice of (some would say single-minded, blind-sided, slave-like devotion to) null hypothesis significance testing (NHST) and the use of p<.05 as the break-point for determining “significant” findings. The article will also discuss and advocate for using measures of effect size to examine the strength of our alternative hypotheses and to judge the practical significance of our findings. Throughout, I rely on three excellent articles, one by Roger Kirk2 and two by Jacob Cohen,3,4 and a number of accessible texts that are listed among the references. Let’s start by reviewing some basics.
What Does “Statistically Significant” Mean?
It means that a difference or a relationship between variables found in a random sample is smaller or larger than we would expect by chance alone. What do we mean by chance? We mean that there is a low probability of obtaining our result if the null hypothesis of no difference/no relationship is true. Because we use “low probability” as a reason to reject the null hypothesis, you might say we also mean the chance of being wrong–that is, of rejecting a true null hypothesis when we should have retained it. Such a finding is otherwise known as a false positive, or Type I error. We call our accepted chance of being wrong “alpha,” and generally set levels of “less than five percent of the time” (p<.05) or below (<.01, <.001) as our cut-point. If the probability of obtaining a particular result falls below this cut-point, we reject the null hypothesis on the grounds that our finding was so unlikely (only five percent of the time or less, through repeated and infinite sampling) the null can’t be supported.
The term “significant” does not mean “a really important finding,” or that a particularly large difference or relationship was found. It means that the likelihood of these data occurring by chance is significantly low enough to make us doubt the null hypothesis. A finding that falls below .01 (“highly significant”) is not necessarily larger, smaller, or more important than one that falls just below .05 (“significant”). This is a common and a surprisingly easy inference to make. I actually think it was brilliant marketing on Sir Ronald Fisher’s part (to whom null hypothesis significance testing is credited) to describe findings that are ever-less-likely-due-to-chance as significant, very significant, and highly significant. The temptation to believe that “significance” means the experiment (or other hypothesis) was hugely successful requires active resistance.
Investigators sometimes calculate p-Values on data that have been obtained from convenience samples (i.e., where neither random selection nor random assignment took place). This is generally frowned on, although I myself do it out of habit, curiosity, and the lazy knowledge that it may highlight patterns in the data. Frowning is warranted because the underlying logic and actual computation of NHST relies on random sampling, under which conditions the sampling distribution of the statistic being tested (e.g., a mean or correlation obtained from repeated and infinite sampling) assumes a predictable shape. The p-Value itself (not to be confused with the alpha, or cut-off level) means the probability of obtaining your particular finding based on this mathematically derived sampling distribution. (Remember the appendices in your college statistics book?)
The p-Value does not mean the probability that the null hypothesis is correct. It means the probability of obtaining our data, assuming the null is correct. (Even when there are no differences/relationships in the population, in five out of every 100 random samples we will get a result as large or as small as ours.) We may reject the null because our data fell below the .05 level and because the low probability of getting such a result casts suspicion on it, but that doesn’t mean the null is correct five percent of the time.
The complement of the p-Value (e.g., .95) does not represent the probability that a significant result will be found in replication. Just because we have rejected a null hypothesis, all we really know is that our results probably did not occur by chance. But we do not know from this one sample what the probability of finding such a result again is. Bummer!
What’s “Wrong” with Null Hypothesis Significance Testing?
Three main criticisms against NHST as a dominant method of scientific “proof” have been voiced for the past 80 years, ever since Sir Ronald Fisher introduced the concept and laid the basis for structural models and computations in 1925.
- The first criticism is that by focusing on “proof by contradiction,” it doesn’t really tell us what we want to know, yet “we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does.”5 As Cohen and others have written, “What we want to know is, ‘Given these data, what is the probability that Ho [the null hypothesis] is true?’ But as most of us know, what [NHST] tells us is, ‘Given that Ho is true, what is the probability of these (or more extreme) data?’ (ibid).” These two statements are not the same—although they sound very similar. The scientific inference we wish to make is inductive: it goes from the sample to the population and concerns the viability of our alternative or research hypothesis. Null hypothesis testing is deductive: it starts with assumptions regarding the population (i.e., there is no difference/relationship), and calculates the probability of our sample data occurring by chance. In brief, rejecting the null does not really provide proof of the alternative hypothesis, nor does the exercise tell us much about the observed differences/relationships at hand, unless we look beyond the p-Values to the raw data and consider the magnitude of the observed effect.
- The second criticism is that NHST is actually a “trivial exercise,” because the null hypothesis can always be rejected, given a large enough sample size. This is because no two things in this world are ever exactly alike, and if they are unlike to any tiny fraction of a degree then the null hypothesis can be proved wrong if it can be tested on enough people or enough times. As John Tukey wrote,6 “The effects of A and B are always different—in some decimal place—for any A and B. Thus asking ‘Are the effects different?’ is foolish.” Even Fisher was said to treat non-significant findings as “inconclusive results,” or “findings for which we lack sufficient data.”7 Although statistical significance is also influenced by the amount of variability in the sample and the actual difference between groups (between the null and alternative hypotheses), the importance of sample size and NHST cannot be overstated. This has two important consequences:
- Cohen8 demonstrated in the 1960s that most research is underpowered, which means the statistical tests we run have a high probability of failing to detect a likely true difference. This is known as a Type II error, the probability of failing to reject a false hypothesis, or “beta.” Increasing the sample size is the primary means under the researcher’s control to reduce the potential of failing to recognize potentially true differences. This is important when studying differences or relationships that may be small in magnitude, but have important implications for practice.
- Huge sample sizes can lead to every comparison being labeled “significant.” “Significant” findings can be found between marginally related or even meaningless variables simply because the sample is so large any slight association or variation can be detected. (As Lord pointed out in a delightful parable based on football jerseys, “The numbers don’t know where they came from.”9)
- The third argument against NHST concerns the “ritualistic” use of a fixed alpha level (<.05 or other) and the dichotomous interpretation of data as either supporting or rejecting the null hypothesis. Especially when conclusions are drawn from a single study, the practice invites a mechanistic, reductive, simplistic view of research. To quote Cohen, “The prevailing yes-no decision at the magic .05 level from a single research study is a far cry from the use of informed judgment. Science simply doesn’t work that way.”10 Kirk illustrates further the irony of mechanistic decision making:
The use of this decision strategy can lead to the anomalous situation in which two researchers obtain identical treatment effects but draw different conclusions from their research. One researcher, for example, might obtain a p-Value of .06 and decide not to reject the null hypothesis. The other researcher uses slightly larger samples and obtains a p-Value of .05, which leads to a rejection. What is troubling here is that identical treatment effects can lead to different decisions.11
In sum, the critics argue that rejecting the null hypothesis neither tells us the probability that the null is true, nor the probability that an alternative research hypothesis is true (although we act as though it does). It also does not tell us anything about the magnitude of the differences or relationships found in our data—only about their likelihood of occurring by chance in a population where the null hypothesis is true. Sounds like a lot of work for little gain, no?
If It’s So “Wrong,” Why Do We All Do It?
The philosophical argument advanced by Fisher was that ultimately hypotheses cannot be proved, but only disproved. Observing that 3000 people all have two legs does not prove the hypothesis, “Every person has two legs.” Observing even one person without a leg disproves that statement. Finding a null hypothesis to be unsupported by the data, Fisher argued, was therefore the best we could do to advance our own, alternative explanation. By declaring the null false based on probability, he provided a paradigm for testing the viability of a broad range of alternative hypotheses. By setting an a priori alpha level, this process became (and remains) attractive precisely for the reasons enumerated above by the critics. It reduces the messiness of our data and simplifies the task of interpretation. It appears objective (“based on statistical probability”), and is independent of content or context. It travels well; regardless of the research question at hand, we can generally find a way to test the probability of our data. And yes, it can be used to support an argument that something other than chance has influenced the data.
Aside from the fact that widespread use and endorsement by the scientific community makes NHST a fact of life, it seems to me important to report p-Values at appropriate times and in appropriate ways. The “rules” governing “appropriate use” lie beyond the scope of this article, but briefly, NHST works best when:
- Randomization has occurred.
- You wish to make inferences about a larger population based on a sample (there is no real need to calculate p-Values if you have access to an entire population, or don’t expect your results to generalize).
- Samples are of a reasonable size (aka, the Goldilocks principle: neither too small to incur Type II errors, nor too large to generate meaningless results).
- You are testing a limited number of variables (which thereby limits the number of findings that might represent false positives—i.e., Type I errors).
- The selected alpha level takes into consideration the type of research and hypotheses being tested.
- P-Values are accompanied by measures of effect size and/or confidence intervals.
Knowing that one’s results are not likely due to chance is worth something under these conditions. As stated by Muijs,12 “We still must somehow decide whether our sample parameters, which will contain measurement error, are unusual enough for us to say that they are likely to result from actual population differences.” But I agree with Cohen (and others) that the primary product of research inquiry is one or more measures of effect size, not p-Values.
How Can We Judge the Size of Our Effects?
If the real show is the credibility of our alternative hypothesis, how can we describe the size of observed differences or relationships or qualify the accuracy of our predictions? This section of the article will touch on some of the more common methods that help us make sense of the data and determine “practical” significance: looking at raw and standardized effects, measures of association, and confidence intervals. The suitability of the method depends on the type of statistical test being employed and the purpose of the research or evaluation. Think of this next section as a refresher of things you might have once learned or an introduction to a larger discussion to be held with your friendly statistician.
- Raw effect size. “Raw effects” basically means the raw magnitude of an effect (e.g., the size of the difference between the means of two groups). Statements such as “a difference of 20 percentage points” or “an increase of nearly half a point (.40) on a four-point scale of satisfaction” nearly always communicate something of value mostly because they state results in their original units of scale, which generally have meaning to the people involved with conducting or reviewing the research. A second advantage is that raw effect sizes are not influenced by sample size as other measures can be.
If description rather than statistical inference is your main goal and you have a rationale for determining what represents a “small” versus “large difference,”13 raw effect sizes can be quite useful. It remains important to observe and report the raw differences, such as the difference between treatment means, even when statistical inference is the aim and significance tests [e.g., analysis of variance (ANOVA)] are being employed.
- Standardized effect sizes. When looking at the mean difference between two groups, one can also calculate the standardized effect size. While there are various ways to do this, the standardized effect size is most usually calculated as the mean of group A minus the mean of group B divided by the pooled standard deviation of scores14 on the response scale. Thus, if on a 10-point scale group A has a mean of 7.0, group B has a mean of 5.0, and the pooled standard deviation is 3.0, the standardized effect size is [7.0 – 5.0 = 2] / 3.0 = .666. This is also known as d, Cohen’s d, D, or delta, and it indicates the difference in outcome for the average person in group A from the average person in group B—i.e., the effect size of a particular treatment.15
Guidelines for interpreting the strength of d supplied by Cohen have some empirical basis. Cohen’s qualified a standardized effect of .5 as a “medium effect,” one that was “visible to the naked eye of a careful observer.” As reported by Kirk, subsequent surveys have found that .5 approximates the average size of observed effects seen across various fields.16 A standardized difference of .2 is considered a small but nontrivial effect; a difference of .8 is considered a large effect. When the effect size is as large as .8, nearly 50 percent of the frequency distribution of group A does not overlap with the frequency distribution of group B. The percent of “non overlap” for a medium effect is 33 percent, for a small effect it is about 15 percent.
Why go to the trouble of standardizing the effect size? Because doing so makes it independent of the particular response scale, which is helpful if you need to communicate something about the magnitude of your results to audiences who know nothing about the particular instrument or measure employed in your study. It also enables other investigators to pool similar studies and judge the relative efficacy of different interventions. (A variation of d supplied by Glass,17 which divides the raw effect size by the standard deviation of only the control group, is favored in meta-analyses.)
- Measures of association. As the name implies, measures of association are concerned with the strength of the relationship between two or more variables. Measures of association help us estimate how much of the overall variability in the data can be attributed to the systematic effects of one or more other variables.
- r and r2. When two “continuous” variables (i.e., measured on an interval or ratio scale) are correlated, the correlation statistic r can be squared to indicate the amount of “shared variability” or the degree to which variation in one variable can be “explained,” accounted for, or predicted by the other. Thus, if American Board of Surgery In-Training Examination (ABSITE) scores and scores on the Qualifying Exam (QE) are correlated at r = .57, 33 percent of the variability in QE scores [.57 x .57 = .3249] can be related to variation in ABSITE scores and the remaining 67 percent cannot. Is 33 percent a lot or a little? It’s hard to visualize “variability” as a quantity. It may help to think of it as uncertainty. Is reducing one’s uncertainty about potential QE scores by about one-third important? It is to me, and probably to the resident taking the test.
The answer may depend on the consequences of the inferences being made from the results, on the availability of other (possibly better) information, and on what effect sizes are considered common versus rare in your particular field. Readers can take some satisfaction in knowing that raw correlations are themselves direct measures of association. Correlations in the .2 to .3 range are generally considered modest but not trivial; .5 represents a good sized, moderate correlation; correlations of .8 are considered strong; and correlations above .8 are very strong.18
- R2. In multiple linear regression analyses, in which the relationship among several continuous variables is explored, the same trick applies. By squaring R (the overall multiple correlation), we get a measure of the amount of shared variance between a dependent (outcome) variable and two or more independent variables. R2 is said to reflect how well a general linear model “fits” the data. [A general linear model refers to our supposition that independent variables A, B, and C correspond to dependent variable D in a straight-line fashion, meaning that the more (or less) of the former three variables, the more (or less) of the latter variable.]
Suppose we wish to predict Qualifying Exam (QE) scores based on residents’ study habits (average number of hours per week) along with their post-graduate year of training. If we find that overall R = .63, we would say that nearly 40 percent of the variance in QE scores can be explained, or accounted for by the combination of training year (experience) and study habits. How good is 40 percent? The same comments made above for r2 apply to R2, but a rough rule of thumb suggests the following: 11 percent to 30 percent of variance accounted for represents a modest “fit” between our proposed linear model and the data, 31 percent to 50 percent represents a moderate fit, and anything over 50 percent represents a strong fit.19
- h2(Eta-squared). Eta is another correlation coefficient, like r, but it does not assume that the relationship between variables is linear. One of the oldest measures of strength of an experimental effect,20 h2 is deployed in ANOVA and its variations for studies with one or more categorical independent variables (eg, groups, treatments, conditions) and a continuous dependent variable (eg, scores on the ABSITE). h2 is like R2 in multiple linear regression analyses: it estimates the variance associated in the dependent variable by all of the independent variables taken together. Partial h2 calculations can be made for studies employing multiple factorial ANOVA designs to estimate the relative effect of different treatments, conditions, or subgroups.
If a study comparing three different ways to support residents’ self-study for the ABSITE were compared, and the proportion of variance explained (h2) by these treatments was 11 percent, we would conclude that collectively they had a small effect on raising ABSITE scores. To know whether one intervention was more effective than another, we would look at post-hoc comparisons and their partial h2 correlations.
- Relative risk ratios/odds ratios. Two other measures of association that are equally confused21 are helpful when the outcome you’re focused on is dichotomous (yes/no) or contain multiple categories that can be reduced to a dichotomy. Odds ratios represent the ratio of one set of odds for achieving an outcome (say for a treatment group) compared to the odds for another (e.g., control group). Each pair of odds simply refers to the percent of people within each group that achieves that outcome divided by the percent that did not.
Thus, the calculation of an odds ratio for an ABSITE study group intervention intent on raising scores to the 35th percentile or above (see Table 1) would first calculate the percent of residents in the intervention group that achieved this benchmark and divide it by the percent who did not. (If .90 passed and .10 failed, the odds of reaching the 35th percentile for members in the treatment group is 9.0.) Next, the percent of residents in the control group that achieved this benchmark (.75) is divided by the percent that did not (.25); the control group’s odds of reaching the 35th percentile without intervention is 3.0. The last step is to divide the treatment group odds (9.0) by the control group odds (3.0) to produce the odds ratio (3.0).
Table 1: Sample Odds Ratios and Relative Risk Ratios for ABSITE Performance
Treatment Group (%)
(larger to smaller)
> 35th percentile
1.20 (risk ratio)
< 35th percentile
2.50 (risk ratio)
3.00 (odds ratio)
Note: These are not real data
An odds ratio of 1.0 indicates the odds of passing the ABSITE are equal for both groups (the odds are 1 to 1). It means there is no relationship between treatment group and outcome. The magnitude of treatment effect is therefore expressed by the distance from 1.0; numbers below 1.0 indicate a negative relationship (effect) and numbers above 1.0 indicate a positive relationship (effect). If an odds ratio resulting from the above experiment is 3.00, that means the odds of residents in the study group intervention reaching the 35th percentile at the next ABSITE administration are three times as great as those in the control group. As with raw effect sizes, confidence intervals can also be placed around the odds ratio statistic to indicate the upper and lower bounds of this prediction. Like standardized effect sizes on treatment means, odds ratios can be used in meta-analysis to compare relative efficacy of interventions across multiple studies.
Relative risk ratios focus on the comparative probabilities across groups of achieving or failing a criterion. (Think rows, not columns.) As shown in Table 1, the relative risk of failing the 35th percentile on the ABSITE is two-and-a-half times greater for the control group than the study group (.25 / .10 = 2.50). The probability of achieving the 35th percentile or greater is 1.20 times greater for the study group. You can also say the risk difference is 15 percent.
- Confidence intervals. Last but not least, confidence intervals represent the inductive analogue to the deductive reasoning of null hypothesis significance testing. Confidence intervals address the real question that Cohen says we want to know the answer to: given these sample data, how confident are we that the same results would be found in the population? What are the upper and lower limits within which the “true” population value (e.g., a mean, correlation, or percentage) can be found?
Confidence intervals can be expressed with varying degrees of probability, or “assuredness.” Most typically and stringently they are set at .95, but lower levels (e.g., .80) may also be appropriate. The more certain we wish to be, that our sample value could be found in the population, the wider the band (upper and lower levels) around our value. Confidence intervals are affected by sample size; the larger the sample, the narrower the band (the more precise the estimate). This is because confidence interval calculations standard errors (i.e., standard deviations divided by the square root of the sample size), which decrease as sample size increases. To calculate a 95 percent confidence interval, start by multiplying the standard error (SE) by .196. The confidence interval equals your obtained value +/- .196 SE.
Making Meaning Out of Results
In sum, interpretation of surgical and educational research findings should be based on a sensible interpretation of the size of the effect and estimates of error (accuracy), given the sample size and the probability of the result occurring if the null were true. To decide whether findings are important involves questions of judgment that go beyond a mechanistic “above/below p=.05” criterion. It is particularly difficult to determine the importance of findings if one cannot translate statistical results back into the instrument’s original units of measure, into English, and then into practical terms, as in, “for every additional week devoted to studying, ABSITE scores increased by two points.” Often we fail to get to the practical interpretation of results, either by design (they reveal a statistically significant but unimportant result) or by default (we don’t know how to get from our data to the statistics and back again). A recent quip I read and enjoyed goes like this:
Always keep in mind the advice of Winifred Castle, a British statistician, who wrote that, “We researchers use statistics the way a drunkard uses a lamp post, more for support than illumination.”22
May you reach for illumination by thinking through your numbers and your reasoning behind them. Otherwise, I’ll be seeing you down by the lamp post.
Acknowledgements: thanks to Michael G. Luxenberg, PhD, and Matt Christenson, of Professional Data Analysts, Inc., for reading a draft of this paper.
- Abelson RP. Statistics as Principled Argument. 1995; Hillsdale (NJ): Lawrence Erlbaum Associates.
- Kirk RE. Practical significance: A concept whose time has come. Educational and Psychological Measurement, 1996;56(5):746-759.
- Cohen J. Things I have learned (so far). American Psychologist, 1990;45(12):1304-1312.
- Cohen J. The earth is round (p<.05). American Psychologist, 1994;49(12):997-1003.
- Cohen J. 1994 (ibid), p. 997.
- Tukey JW. The philosophy of multiple comparisons, Statistical Science, 1991;6: p. 100.
- Howell DC. Statistical Methods for Psychology, 1987; Boston (MA): PWS Publishers, p. 66.
- Cohen J. 1990 (ibid), p. 1308.
- Lord FM. On the statistical treatment of football numbers. American Psychologist, 1953;8:750-751.
- Cohen J. 1990 (ibid), p. 1311.
- Kirk RE (ibid), p. 748.
- Muijs D. Doing Quantitative Research with SPSS, 2004; Thousand Oaks, CA: Sage Publications, p. 80.
- “Small” vs. “large” may be determined relative to the range of differences seen within a set of comparison (e.g., multiple items on a survey), or relative to previous findings, to a priori expectations or requirements, or to other guidelines, as described in subsequent sections.
- To calculate a pooled standard deviation, add the variance of group A to the variance of group B and divide by 2 (to get a mean variance), and then take the square root of the mean variance.
- WP Vogt, Dictionary of Statistics & Methodology: A Non Technical Guide for the Social Sciences, 3rd ed., 2005; Thousand Oaks (CA): Sage Publications, p. 103.
- Kirk, ibid, p. 750.
- Glass GV. Primary, secondary, and meta-analysis of research. Educational Researcher, 1976;5:3-8.
- Muijs D (ibid), p. 145.
- Muijs D (ibid), p. 166.
- Howell DC (ibid), p. 307.
- Liberman AM. How much more likely? The implications of odds ratios for probabilities. American Journal of Evaluation, 2005;26(2):253-266.
- Normand GR, Streiner DL. PDQ Statistics, 2nd ed., 1999; Hamilton, London, St. Louis: B.C. Decker Inc., p. x.