Your Intergalactic Decoder Ring Has Arrived: "Reliability" and "Validity" Defined

As an educational psychologist and a recent newcomer to the world of surgical education, I was bemused and more than a little confused to hear words in my first few weeks that sounded like English (my native language), but registered as making little sense. I am not talking about fearsome and wholly unfamiliar terms like anastomotic leaks and jejuno-jejunostomies, but about recognized parts of my vocabulary that were being used in novel ways. A categorical resident? Who was that, and why did it matter? As for matching—well thank goodness we "matched well," but was it a close call? And what about those M+M's? I wanted to send in my cereal box tops for a magical translation device to better understand the speech patterns of these alien creatures.

Physician faculty members who are newcomers to the Accreditation Council for Graduate Medical Education (ACGME) world of outcomes assessment may feel similar unease. Words like "reliable" and "valid" are common enough in everyday speech, yet their precise meanings for assessment and education research (and their importance for how you conduct your business) may elude you. When it comes to interpreting reliability and validity data—some of you probably just want to draw a line and say, "This is as far as I go."

That would be unfortunate, because educators and physician faculty need each other if we are to avoid making serious errors in estimating the competence of our residents and evaluating the efficacy of our training programs. This article decodes the jargon, helps you to think more critically about these unfamiliar terms, and provides some rules of thumb for decisions you may need to make with regard to assessment. Let me begin with some unsettling ideas.

Error Is Everywhere

Reliability and validity would not concern us if there was no such thing as variation in human endeavors and no such thing as measurement error. We live in a world where human beings do vary in their performance and in their perceptions, however, and unfortunately there is no such thing as a perfect assessment system, tool, or person (assessor). To make matters worse, the nature of what we assess in surgical education is abstract. We are not measuring square footage in the dining room for the purpose of estimating the correct number of gallons of paint to buy at the hardware store. We're measuring constructs such as "surgical competence" and "program quality." These things cannot be seen, tasted, or touched directly. They have a high potential for subjective interpretation. We can only infer their status through an often elaborate process of definition, followed by contrived means of observation whereby we invite the object of interest to reveal itself, much as if we were coaxing a unicorn out of hiding. Any time we interpret results of an Objective Structured Clinical Examination (OSCE), a survey, a mock oral, or a global evaluation rating tool, some complement of random and systematic errors have influenced our judgment. The question is, how much, and what do we do about it?

What Is Reliability and Why Is It Important?

We pursue evidence of reliability because we need to know how consistent our method is for obtaining, recording, and classifying information. Would the same information be forthcoming if we surveyed the same patient a second time? Would we get the same count if we reviewed the same charts twice? Do two people looking at the same OSCE performance see and code the same set of behaviors the same way? Do residents who are role playing the same OSCE scenario score similarly, even when a different group of standardized patients is used? Reliability is important because unreliable measures dilute and obscure real differences.¹ They underestimate relationships in the data, and therefore lead to underestimations of the competence of residents, or the effectiveness of a residency program.²

Even with the most carefully developed and controlled measurement scenario possible, there will always be a certain amount of random variation (error) in the data. But before we collect any formal reliability data, we can and should reduce potential sources of systematic error. The "usual suspects," when it comes to systematic error, are:

Flaws associated with the measurement device, such as: ambiguous questions, unclear instructions, inappropriate length (a survey that is so long that fatigue overwhelms the respondent), a confusing scoring key, and poor "sampling" of the intended domain of interest. An example of this would be a situation whereby residents who are taking their mock orals score differently from each other, not because of differences in preparation or ability, but because the cases were not standardized (selection was left up to the examiners), and one resident was given only "easy" cases, and the other was given only "hard" cases.
Flaws associated with the conditions for assessment, such as: no established guidelines for administering the mock oral, or lack of training for the OSCE examiners; lack of space or time for responding to a written survey; lack of resident anonymity in submitting evaluations of faculty, thereby inflating their responses; and any number of environmental mishaps, from fire drills in the middle of a skills lab exam to a computer virus that shuts down and permanently erases an online pretest.
Characteristics associated with the learners / respondents, such as: lack of motivation by faculty to complete global evaluations carefully, or discomfort with giving low scores; resident test anxiety, or its converse (test "wiseness);" resident fatigue after a night of being on call; and other human attributes that cause deviance in the given scores from participants' "true" scores.

What Is Validity and Why Is It Important?

We pursue evidence of validity because we need to know how accurate our assessment is, and what it means in practical terms, and how to avoid making unwarranted conclusions or decisions based on its results. Reliability and validity are interconnected notions, easily confused. Their differences are often explained through metaphors:

If a reliable tape measure is one that comes up with the same dimensions every time you measure the dining room, a valid measure of the dining room's square footage for the purpose of calculating gallons of paint is one that is based on the dining room (not the living room), and leads to only one trip to the hardware store.

If a reliable camera is one that can be adjusted so as to focus clearly and consistently every time the lens is maneuvered, a valid portrait of Aunt Margaret is a picture actually taken of Aunt Margaret (and not Aunt Sally) and about which everyone agrees, "That captures the essence of Aunt Margaret."

If a reliable tape measure is one that comes up with the same dimensions every time you measure the dining room, a valid measure of the dining room's square footage for the purpose of calculating gallons of paint is one that is based on the dining room (not the living room), and leads to only one trip to the hardware store. If a reliable camera is one that can be adjusted so as to focus clearly and consistently every time the lens is maneuvered, a valid portrait of Aunt Margaret is a picture actually taken of Aunt Margaret (and not Aunt Sally) and about which everyone agrees, "That captures the essence of Aunt Margaret."

While it is possible for measures to be reliable but not valid, it is difficult for measures to be valid if they are not reliable. If a picture is fuzzy, it is harder to be sure it is Aunt Margaret, even when it is Aunt Margaret.

The essential validity question therefore concerns whether an assessment is measuring what it was designed to measure. It is easier than one might think to design an assessment to measure one set of skills (say, the resident's ability to relate certain technical information to patients when disclosing a medical complication), only to find that it seems to measure something else (such as the standardized patients' impression of how confident the resident appears in explaining a mistake).

Questions of validity arise with every assessment and evaluation mechanism you can think of in surgical education. For example:

Do the global evaluations reflect resident strengths and deficiencies that we believe are really there and can be documented in other ways?
Does the rotation evaluation form reflect the "real" learning outcomes of a rotation, or does it rely on "generic" outcomes that "came from the med school," and aren't as relevant for surgery?
For which operations should we structure OSATS (Objective Structured Assessments of Technical Skills)? On whatever operations come up by chance in April and are supervised by surgeons who happen to be best friends of the program director, or owe the program director a favor, or can be otherwise persuaded to participate? [Not a real-life example, thank goodness.] Or should we build OSATS for a small representative sample of common and complex operations that embody transferable procedures and reflect mastery of core skills expected at junior through senior levels?
Can American Board of Surgery In-Training Examination (ABSITE) scores be used to successfully differentiate between residents who need remediation to improve their basic science and clinical knowledge and those who do not?

Even when we have good information on the reliability of tools, we should always ask ourselves questions about their validity in a particular setting for particular purposes. Some of the traditional ways we explore validity are:

Face validity. Does the measure make sense to the people who are being assessed, as well as those who have to complete it, score it, or base decisions on it? Does it resonate with informed, involved people in a basic, holistic way? Although face validity is the least respected of forms by psychometricians, acceptance by appropriate stakeholders can be critical for authentic participation and use.
Content validity. Does the measure include items that sufficiently cover a designated subject area (e.g., curriculum topics) or field of activity (e.g., cases seen on a rotation, steps in a procedure)? Since sampling of topics, cases, and even steps is often necessary to make the assessment logistically feasible, is the resulting sample representative of the domain of interest?
Concurrent validity. Do the results from one type of assessment seem consistent with results from a similar, but different assessment of the same thing? For example, do residents' scores from a newly developed OSCE on interpersonal and communication skills and professionalism correlate with ratings submitted by senior nurses who know the residents well and can observe such behaviors?
Construct validity. If the main construct we are measuring is "resident competency" (as defined, for example, in the Dreyfus model),³ we may well want to know if an assessment successfully distinguishes between residents who are novices vs. those who are advanced beginners vs. those who are competent. Assuming that assessments are "competency-based," do they show expected progression by PGY level? Are the OSATS scores higher for residents who have gone through a structured skills lab than for those who have not?
Criterion-related or predictive validity. Does performance on one assessment predict future performance or some other valued outcome? Does knowing that a fifth-year resident has a standard score of 500 on the ABSITE⁴ mean that he has a poor chance of passing the qualifying exam for the boards on his first attempt?

Readers will notice that the nature of the questions listed above move from being mostly about the content of the assessment to its instrumental use and value. This brings us to a second important, if unsettling topic.

There Is No Such Thing as The Reliability, or The Validity of an Assessment Tool

The sad truth is that reliability and validity are really not properties of any instrument per se, although even educational researchers often speak as though it were so. Reliability and validity are constructs. We can only estimate the degree to which they are present in our assessments indirectly, based on trials of these tools in a particular context for a particular purpose. Each time we administer the tools and calculate the statistics, slightly different estimates will result.

If we radically change the population or the assessment context, the estimates will vary markedly. We can predict in advance some things that will affect our estimates. Typically, it is easier to get higher estimates of reliability and validity when a heterogeneous group of respondents or test-takers participates in the assessment, than when a restricted (homogenous) group participates. Typically, it is easier to get higher estimates with larger groups than smaller groups of participants, and with longer rather than shorter tests, and when greater as opposed to smaller numbers of raters are involved.

To support a claim for reliability and validity we have to engage in a systematic process that goes beyond the instrument per se and includes the meaning we give to the scores and the purpose for which we use them.⁵ Ideally, this process of collecting evidence begins with a literature review and proceeds down a path that reads like the methods section of a research paper. Using as an example a hypothetical OSCE designed to assess how well residents conduct an "end-of-life" family conference for critically ill patients:

A literature review uncovers the main topics that patients value in an "end of life" family conference; these topics become the framework for a behaviorally anchored rating tool (content validity). These topics are reviewed by a panel of surgeons with decades of experience in critical care (content validity).
The rating tool is pre-piloted using videotapes from a previous OSCE involving similar end-of-life scenarios: results uncover ambiguous items, too many descriptors in the anchors, and poor sequencing, which makes it difficult to score. Revisions are made and the rating tool is then piloted in a live session in which staff members role play the part of residents, and actors play the part of family members. More revisions are made. Two master teaching videotapes are made of "residents" (fellows from the critical care unit) performing "satisfactory-good" and "unsatisfactory-poor" end-of-life conferences. In a training session, actors rehearse their scripts and raters are trained to use the rating form by viewing and scoring the master videotape performances. Scores are discussed and some adjustments in raters' perceptions are made. (All of these steps improve reliability, in that they represent attempts to reduce systematic error and increase consistency of the actors' performance and the raters' use of the tool.)
After the real OSCE takes place, residents are asked to complete evaluation forms in which they rate the extent to which the case scenarios, and the family actors' questions and responses, seemed "true to life" and represented scenarios they had encountered on the wards (face validity).
The rating data are collected and analyzed; the inter-correlations between responses across the items⁶ suggests that collectively, the items seem to be measuring the same core set of skills. Additionally, most of the individual items correlate with the total global score (high internal consistency reliability).
The ratings given by both the professional raters (physicians, nurses) and the family actors are correlated to demonstrate consistency across raters (inter-rater reliability).
After expanding the OSCE to several institutions and using it for several years with second and fourth-year residents and with fellows, differences between the scores are found to co-vary with year of training (construct validity). When intensive education is delivered to the next class of PGY-2 residents, they score at the PGY-4 level (construct validity and a publishable paper!).

Understanding Reliability and Validity Data

The data used to quantify reliability consist of correlations (or other techniques, such as factor analysis, which depend on patterns of correlation among scores). Reliability coefficients run from zero to 1.0. A perfectly reliable test would have a reliability coefficient of r = 1.0. A test that was entirely full of error would have a reliability coefficient of r = 0.0. Different coefficients can be generated to measure different attributes or signs of "consistency." In general, tests of ability, achievement, and overt performance have higher reliability than attitude scales. One seeks reliability estimates in the >.80 range.

The data used to quantify validity also involve correlations (or other techniques, such as multiple regression analysis, that are based on patterns of correlation among scores). In a perfectly valid test, all of the differences reflected in the scores would be attributed to real differences between people in the skills or abilities being assessed. In a completely nonvalid test, all of the differences would be attributed to extraneous factors, unrelated to what the test was designed to measure. Different types of validity data can be collected. Two of the more common forms of "predictive validity" are:

a) Criterion-referenced validity. If the test is being used to certify achievement of a foundation skill that is believed critical for future success, one might reasonably look for evidence of criterion-referenced validity. To do this, scores of a test (e.g., scores of second-year residents on an OSATS of hernia repair) are correlated with a future criterion or outcome of interest (e.g., ratings of the functional outcomes of hernia patients operated on by the same residents during their third and fourth years).

Interpreting the magnitude of validity coefficients is different than interpreting the magnitude of reliability coefficients, because many factors (besides the skills being measured on a particular test) influence future outcomes. At the lower end of what we typically report, a correlation of r = .30 between a test score and a future outcome would be of interest and suggestive of further study. A correlation of r = .60, on the other hand, would be cause for excitement. In practical terms, that means about 36 percent of the variation in hernia patients' functional outcomes are associated with the technical ability being measured in the OSATS.⁷ That's worth knowing! It tells both the resident and the program director that the OSATS is measuring relevant skills, and the skills are worth enhancing.

b) Decision accuracy. If assessment results are to serve decisions (e.g., require participation in an ABSITE study group, or delay a resident's progression to the next level), then we should be concerned about the extent to which use of the assessment scores helps us make more accurate decisions. If we can already predict with a high degree of accuracy which residents will do well, or not so well, on a future outcome based on current information alone, then the assessment does not add value (increase predictive accuracy) and isn't very useful. But if our ability to predict is not good, or if we are unsure of how high or low a test score needs to be before rendering a decision, then looking at the decision accuracy of a test becomes important.

Essentially, what we are testing is a designated standard of performance on the test—sometimes called the "cut score" or the "passing score." We want to know whether residents who score above the cut score are more likely to do well on a future test or outcome than those who score below the cut score. When this is our question, the ratio of "correct" to "incorrect" decisions based on the cut score becomes the validity statistic of interest.⁸ This same ratio is compared to our "base rate" of predicting success, when decisions are made on current knowledge alone. Often, this type of investigation will cause us to move the cut score up or down to achieve the best ratio of correct to incorrect decisions.

Rules of Thumb

Physician faculty and educators need to have a healthy respect for the challenges of developing reliable and valid means of assessing residents and evaluating training programs. The principles and statistical techniques being imported to surgical residency training come from the field of educational and psychological testing,⁹ where highly standardized, norm-referenced tests of ability, achievement, and personality represent the model applications (think SAT, MCATS, the Minnesota Multiphasic Personality Inventory). It is not easy to translate these principles to residency programs for three reasons: we have small groups of uniformly high-achieving individuals; our settings offer few naturally occurring opportunities to control the conditions of assessment; and we have few criterion performances beyond residency training on which to base predictive validity (only the American Board of Surgery board exams come to mind). To conclude, program directors and other physician faculty may wish to keep the following rules of thumb in mind:

There is no such thing as a perfectly reliable assessment, but one that attends to possible sources of error in construction, administration, and preparation of participants will be more reliable than one that does not. Even if you can't calculate reliability coefficients, you can reduce errors of measurement.
Just because a survey, evaluation form, or OSCE rating tool was found to be reliable in another setting does not mean it will be reliable in yours, unless you attend to the possible sources of error listed above and are using it with similar populations under similar conditions.
It is generally easier to establish reliability than validity. A variety of information needs to be considered before one can determine that an assessment is measuring what you think it is measuring, and that its scores have meaning for a designated purpose. Until you have confidence in the data, use the results of assessments with caution. Emphasize the teaching and learning value of assessment. Score the performance, but grade generously: "satisfactory" vs. "needs improvement," or "complete" vs. "incomplete." Do not use the individual scores for making high-stakes decisions. In general, never base high-stakes decisions on one score alone, no matter how reliable or valid you think it is.
Recognize that high-stakes decisions do need to be made during residency training, and that you will therefore need a complement of assessments that you can trust. Recognize that you will need to commit serious time and resources to developing the instruments, raters, and conditions whereby the results will be valid. Importing a "valid tool" from another setting will not solve your problem, unless you attend to and can certify the underlying reliability issues in your own setting. You may also need to engage in a process of review and dialogue whereby stakeholders can attest to the face validity of the tool and mode of application.

Last but not least, walk humbly in the face of these challenges, but do not shy away from them. The statistical expertise for analyzing data and generating reliability and validity coefficients can be borrowed or purchased. Understanding the underlying principles of reliability and validity, however, is crucial for physician faculty who are responsible for assessment. Without that understanding, the statistics alone will have limited value. Hopefully, this article has contributed to your conceptual understanding.

References

Rossi RH & Freeman HE. Evaluation: A systematic approach. Newbury Park (CA): Sage Publications 1979: 230-34.
Weiss CH. Evaluation: Methods for studying programs and policies (2^nd ed.). Upper Saddle River (NJ): Prentice Hall; 146.
Dreyfus HL. Intuitive, deliberative, and calculative models of expert performance. In: Zsambok, CE, Klein, G, eds. Naturalistic decision making. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc; (year), 17-28.
A standard score of 500 represents the mean of all residents who took the test that year, regardless of PGY level. It is roughly equivalent to the mean percent correct score expected of residents at the end of their second year. See Rusucci DA. A brief guide for program directors on how to assess mean changes in ABSITE performance. American College of Surgeons Residency Assist Page; February 16, 2005.
Messick S. The meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18(2): 1989; 5-11.
This is calculated with a reliability correlation coefficient called "Cronbach's alpha."
The proportion represents the square of the correlation.
Schmitz CC & DelMas RC. Determining the validity of placement exams for developmental college curricula. Applied Measurement in Education, 4(1): 1991; 37-52.
eg, see Brown FG. Principles of educational and psychological testing (2nd ed.) New York (NY): Holt, Rinehart and Winston; 1970.