As an educational psychologist and a recent newcomer to the world of surgical education, I was bemused and more than a little confused to hear words in my first few weeks that sounded like English (my native language), but registered as making little sense. I am not talking about fearsome and wholly unfamiliar terms like anastomotic leaks and jejuno-jejunostomies, but about recognized parts of my vocabulary that were being used in novel ways. A categorical resident? Who was that, and why did it matter? As for matching—well thank goodness we "matched well," but was it a close call? And what about those M+M's? I wanted to send in my cereal box tops for a magical translation device to better understand the speech patterns of these alien creatures.
Physician faculty members who are newcomers to the Accreditation Council for Graduate Medical Education (ACGME) world of outcomes assessment may feel similar unease. Words like "reliable" and "valid" are common enough in everyday speech, yet their precise meanings for assessment and education research (and their importance for how you conduct your business) may elude you. When it comes to interpreting reliability and validity data—some of you probably just want to draw a line and say, "This is as far as I go."
That would be unfortunate, because educators and physician faculty need each other if we are to avoid making serious errors in estimating the competence of our residents and evaluating the efficacy of our training programs. This article decodes the jargon, helps you to think more critically about these unfamiliar terms, and provides some rules of thumb for decisions you may need to make with regard to assessment. Let me begin with some unsettling ideas.
Reliability and validity would not concern us if there was no such thing as variation in human endeavors and no such thing as measurement error. We live in a world where human beings do vary in their performance and in their perceptions, however, and unfortunately there is no such thing as a perfect assessment system, tool, or person (assessor). To make matters worse, the nature of what we assess in surgical education is abstract. We are not measuring square footage in the dining room for the purpose of estimating the correct number of gallons of paint to buy at the hardware store. We're measuring constructs such as "surgical competence" and "program quality." These things cannot be seen, tasted, or touched directly. They have a high potential for subjective interpretation. We can only infer their status through an often elaborate process of definition, followed by contrived means of observation whereby we invite the object of interest to reveal itself, much as if we were coaxing a unicorn out of hiding. Any time we interpret results of an Objective Structured Clinical Examination (OSCE), a survey, a mock oral, or a global evaluation rating tool, some complement of random and systematic errors have influenced our judgment. The question is, how much, and what do we do about it?
We pursue evidence of reliability because we need to know how consistent our method is for obtaining, recording, and classifying information. Would the same information be forthcoming if we surveyed the same patient a second time? Would we get the same count if we reviewed the same charts twice? Do two people looking at the same OSCE performance see and code the same set of behaviors the same way? Do residents who are role playing the same OSCE scenario score similarly, even when a different group of standardized patients is used? Reliability is important because unreliable measures dilute and obscure real differences.1 They underestimate relationships in the data, and therefore lead to underestimations of the competence of residents, or the effectiveness of a residency program.2
Even with the most carefully developed and controlled measurement scenario possible, there will always be a certain amount of random variation (error) in the data. But before we collect any formal reliability data, we can and should reduce potential sources of systematic error. The "usual suspects," when it comes to systematic error, are:
We pursue evidence of validity because we need to know how accurate our assessment is, and what it means in practical terms, and how to avoid making unwarranted conclusions or decisions based on its results. Reliability and validity are interconnected notions, easily confused. Their differences are often explained through metaphors:
If a reliable tape measure is one that comes up with the same dimensions every time you measure the dining room, a valid measure of the dining room's square footage for the purpose of calculating gallons of paint is one that is based on the dining room (not the living room), and leads to only one trip to the hardware store.
If a reliable camera is one that can be adjusted so as to focus clearly and consistently every time the lens is maneuvered, a valid portrait of Aunt Margaret is a picture actually taken of Aunt Margaret (and not Aunt Sally) and about which everyone agrees, "That captures the essence of Aunt Margaret."
If a reliable tape measure is one that comes up with the same dimensions every time you measure the dining room, a valid measure of the dining room's square footage for the purpose of calculating gallons of paint is one that is based on the dining room (not the living room), and leads to only one trip to the hardware store. If a reliable camera is one that can be adjusted so as to focus clearly and consistently every time the lens is maneuvered, a valid portrait of Aunt Margaret is a picture actually taken of Aunt Margaret (and not Aunt Sally) and about which everyone agrees, "That captures the essence of Aunt Margaret."
While it is possible for measures to be reliable but not valid, it is difficult for measures to be valid if they are not reliable. If a picture is fuzzy, it is harder to be sure it is Aunt Margaret, even when it is Aunt Margaret.
The essential validity question therefore concerns whether an assessment is measuring what it was designed to measure. It is easier than one might think to design an assessment to measure one set of skills (say, the resident's ability to relate certain technical information to patients when disclosing a medical complication), only to find that it seems to measure something else (such as the standardized patients' impression of how confident the resident appears in explaining a mistake).
Questions of validity arise with every assessment and evaluation mechanism you can think of in surgical education. For example:
Even when we have good information on the reliability of tools, we should always ask ourselves questions about their validity in a particular setting for particular purposes. Some of the traditional ways we explore validity are:
Readers will notice that the nature of the questions listed above move from being mostly about the content of the assessment to its instrumental use and value. This brings us to a second important, if unsettling topic.
The sad truth is that reliability and validity are really not properties of any instrument per se, although even educational researchers often speak as though it were so. Reliability and validity are constructs. We can only estimate the degree to which they are present in our assessments indirectly, based on trials of these tools in a particular context for a particular purpose. Each time we administer the tools and calculate the statistics, slightly different estimates will result.
If we radically change the population or the assessment context, the estimates will vary markedly. We can predict in advance some things that will affect our estimates. Typically, it is easier to get higher estimates of reliability and validity when a heterogeneous group of respondents or test-takers participates in the assessment, than when a restricted (homogenous) group participates. Typically, it is easier to get higher estimates with larger groups than smaller groups of participants, and with longer rather than shorter tests, and when greater as opposed to smaller numbers of raters are involved.
To support a claim for reliability and validity we have to engage in a systematic process that goes beyond the instrument per se and includes the meaning we give to the scores and the purpose for which we use them.5 Ideally, this process of collecting evidence begins with a literature review and proceeds down a path that reads like the methods section of a research paper. Using as an example a hypothetical OSCE designed to assess how well residents conduct an "end-of-life" family conference for critically ill patients:
Understanding Reliability and Validity Data
The data used to quantify reliability consist of correlations (or other techniques, such as factor analysis, which depend on patterns of correlation among scores). Reliability coefficients run from zero to 1.0. A perfectly reliable test would have a reliability coefficient of r = 1.0. A test that was entirely full of error would have a reliability coefficient of r = 0.0. Different coefficients can be generated to measure different attributes or signs of "consistency." In general, tests of ability, achievement, and overt performance have higher reliability than attitude scales. One seeks reliability estimates in the >.80 range.
The data used to quantify validity also involve correlations (or other techniques, such as multiple regression analysis, that are based on patterns of correlation among scores). In a perfectly valid test, all of the differences reflected in the scores would be attributed to real differences between people in the skills or abilities being assessed. In a completely nonvalid test, all of the differences would be attributed to extraneous factors, unrelated to what the test was designed to measure. Different types of validity data can be collected. Two of the more common forms of "predictive validity" are:
a) Criterion-referenced validity. If the test is being used to certify achievement of a foundation skill that is believed critical for future success, one might reasonably look for evidence of criterion-referenced validity. To do this, scores of a test (e.g., scores of second-year residents on an OSATS of hernia repair) are correlated with a future criterion or outcome of interest (e.g., ratings of the functional outcomes of hernia patients operated on by the same residents during their third and fourth years).
Interpreting the magnitude of validity coefficients is different than interpreting the magnitude of reliability coefficients, because many factors (besides the skills being measured on a particular test) influence future outcomes. At the lower end of what we typically report, a correlation of r = .30 between a test score and a future outcome would be of interest and suggestive of further study. A correlation of r = .60, on the other hand, would be cause for excitement. In practical terms, that means about 36 percent of the variation in hernia patients' functional outcomes are associated with the technical ability being measured in the OSATS.7 That's worth knowing! It tells both the resident and the program director that the OSATS is measuring relevant skills, and the skills are worth enhancing.
b) Decision accuracy. If assessment results are to serve decisions (e.g., require participation in an ABSITE study group, or delay a resident's progression to the next level), then we should be concerned about the extent to which use of the assessment scores helps us make more accurate decisions. If we can already predict with a high degree of accuracy which residents will do well, or not so well, on a future outcome based on current information alone, then the assessment does not add value (increase predictive accuracy) and isn't very useful. But if our ability to predict is not good, or if we are unsure of how high or low a test score needs to be before rendering a decision, then looking at the decision accuracy of a test becomes important.
Essentially, what we are testing is a designated standard of performance on the test—sometimes called the "cut score" or the "passing score." We want to know whether residents who score above the cut score are more likely to do well on a future test or outcome than those who score below the cut score. When this is our question, the ratio of "correct" to "incorrect" decisions based on the cut score becomes the validity statistic of interest.8 This same ratio is compared to our "base rate" of predicting success, when decisions are made on current knowledge alone. Often, this type of investigation will cause us to move the cut score up or down to achieve the best ratio of correct to incorrect decisions.
Physician faculty and educators need to have a healthy respect for the challenges of developing reliable and valid means of assessing residents and evaluating training programs. The principles and statistical techniques being imported to surgical residency training come from the field of educational and psychological testing,9 where highly standardized, norm-referenced tests of ability, achievement, and personality represent the model applications (think SAT, MCATS, the Minnesota Multiphasic Personality Inventory). It is not easy to translate these principles to residency programs for three reasons: we have small groups of uniformly high-achieving individuals; our settings offer few naturally occurring opportunities to control the conditions of assessment; and we have few criterion performances beyond residency training on which to base predictive validity (only the American Board of Surgery board exams come to mind). To conclude, program directors and other physician faculty may wish to keep the following rules of thumb in mind:
Last but not least, walk humbly in the face of these challenges, but do not shy away from them. The statistical expertise for analyzing data and generating reliability and validity coefficients can be borrowed or purchased. Understanding the underlying principles of reliability and validity, however, is crucial for physician faculty who are responsible for assessment. Without that understanding, the statistics alone will have limited value. Hopefully, this article has contributed to your conceptual understanding.