Are Your Assessment Scores and Feedback Reliable? A Statistical Review for the Surgical Educator

Figure1: Illustration of Sampling Across a Test Blueprint

Assessments that generalize a trainee’s knowledge, skills, and behavior measured can offer inferences for a trainee’s performance in non-exam, authentic clinical situations. Central to the quality of these assessments is their high reliability and the evidence supporting them High reliability is needed for the defensibility of the assessment and for the decisions made about trainees based on the assessment results. This article provides an overview of reliability, types of reliability indices, and applications of reliability to educational assessments within surgery.

Foundations of Reliability

Reliability refers to the consistency of information (scores, feedback, data) gathered from assessments, typically estimated as a ratio of signal (true score variance) relative to noise (error variance). A trainee’s “true” knowledge or skills can never be directly measured. Instead, their observed performance on assessments serves as a proxy for this true knowledge and skill. Items that test their knowledge or performance should follow a sampling strategy to gather information across predetermined test specifications (blueprint) targeting the content measured. Increasing the number of items strengthens the precision, i.e., reliability, of information inferred. Think of it as similar to sampling a population for epidemiological studies.

Figure 1. Illustration of Sampling Across a Test Blueprint

The trainee’s knowledge on a topic is represented by green marbles. The topic knowledge the trainee does not know is represented by red marbles. In sampling across these topics with a test blueprint we can get a representation of the trainee’s true knowledge. The larger the sample of questions per topic, the more precise the estimation of the percentage of green versus red marbles in each topic.

In Figure 1, if the green marbles represent facts that a resident knows about a surgical topic and red marbles represent facts they do not yet know, we could sample their knowledge on each topic with a multiple-choice test. If we only ask one question per topic, there is a chance we could pull out the red marble or the item they do not know the answer to, even if they know most facts about that topic. However, if we systematically sampled 10 marbles or items from each topic, we would be more likely to estimate the percent of facts they know correctly about that topic.

Classical Test Theory

Whenever a trainee’s performance is observed, either through a multiple-choice test or an objective structured assessment of technical skills (OSATS), error is introduced. Classical Test theory summarizes this as

Observed Performance = True Performance + Error

Our goal as surgical educators is to reduce this error as much as possible in order to increase the reliability or consistency of our assessments.³ Figure 2 depicts the distributions of obtained and true scores showing a decrease in variance when we exclude variance due to error. When assessing a group of trainees, we can plot their true scores for knowledge or skills by subtracting error from each trainee’s obtained score. These scores will be their true knowledge or performance if we could remove error from the testing situation. When comparing obtained scores for two trainees, some of the difference is a true difference in knowledge or skill (i.e. one trainee truly knows more than another trainee about a topic). Some of the difference is error or noise. This error is a product of the test sampling, random error, and other test-day factors. If the true difference in trainees’ performances is the signal, this error is the noise.⁴

Figure 2: Variance in trainees’ true performances versus the total variance amongst trainees is observed once assessment error is introduced.

Assessment reliability tells us the variance between trainees because of true differences in knowledge or performance and how much is because of random error introduced during assessment measurement. In other words, reliability is a signal to noise ratio:

Reliability = (True Variance) / ( Total Variance)
Reliability = Signal / (Signal + Noise)

Reliability versus Validity

Assessments with sufficient validity evidence measure what they are intended to measure, and nothing else.⁴ For example, if a surgical educator wanted to assess whether or not a surgical resident understood how to triage ill surgical patients, they would likely construct a blueprint of various surgical patient case scenarios. Then the resident would have to rank the scenarios in order of which patient they would attend to first. The educator would not give the resident a multiple choice test on biliary anatomy. While anatomical knowledge is also important for surgical residency, it is not the construct of interest here—understanding the trainee’s triage skills is the purpose. Reliability contributes to validity by ensuring that the construct being measured (the signal) is detected amidst potential measurement error (the noise).⁵

Figure 3: Reliability vs Validity

For assessment, scores should be both consistent (reliable) and accurate (valid). Reliability and validity are related. For example, the “unreliable, but valid” bullseye might represent a series of exams where the average performance accurately reflects the class’s average knowledge. However, the consistency of trainee performance from one exam to the next is not reliable and thus, unlikely to be an accurate representation of each trainee’s true knowledge.

Types of Reliability Indices

Test-Retest Reliability

Test-retest reliability evaluates consistency in assessment results over time. This reliability evaluates for error that is introduced when testing at two (or more) different time points.⁶ For instance, if we were to test surgical residents on their knowledge of biliary anatomy, we could have them label a 2-D picture depicting anatomy of the right upper quadrant. We could give them the same 2-D picture to label again two weeks later, and again, two weeks after that. Barring any educational interventions during that time period, if they labeled the structures the same way and got the same score, each time, the test is a reliable measure, based on test-retest, of their ability to identify 2-D biliary anatomy. Their scores might improve slightly from one test to the next because of the learning effect. Even if they doesn’t study for each test, they are likely to think about their results and learn from prior interactions with the testing instrument. However, if there was a strong correlation between his first, second, and last performance, their test-retest reliability would be high. If the 2-D labeling was difficult for the resident to interpret or the instructions were not clear, this could lead to trainees guessing each time they took the assessment.

This variance, resulting from guessing, would decrease the reliability of the test. If the reliability of the test was never checked, such as through test-retest, we would never know that the instructions were unclear. Instead we may assume that the trainee did not know biliary anatomy and would have encouraged them to study more. This example highlights the importance of checking assessment results’ reliability. It should motivate the surgical educator to evaluate which portions of an assessment or an assessment’s instructions could be improved to differentiate trainees’ true capabilities from performance error, i.e., increase reliability. To calculate if performance data is sufficiently reliable, a test-retest should be done with multiple trainees over time. This allows the educator to compare trainees’ individual data variability between tests to the results of their peers. This calculation can be done with Intraclass Correlation Coefficients (ICC), which is further described below.⁷

Internal-Consistency Reliability (Cronbach’s Alpha)

The internal-consistency reliability (Cronbach’s alpha) compares responses on individual items to one another to calculate how much error each item contributes to the total exam score variance. To calculate, the assessment is hypothetically divided in half to create two simulated assessments (split-half reliability) multiple different times. The responses across these two hypothetically separate exams are then compared again and again. Each time the assessment is divided in half in a different way. This allows the educator to see which questions contribute the most error variance (item error variance) relative to the other test items. The test items contributing the highest error variance can then be removed to improve the reliability of the assessment data. The higher the item-error variance, the lower the Cronbach’s alpha. When comparing trainees’ exams, poor reliability, or low Cronbach’s alpha, would make it difficult to accurately assess if one trainee outperformed another trainee.⁸ A large item-error variance creates the noise that drowns out the signal or the true variance between trainees.

Reliability = True Variance / Total Variance
Cronbach's alpha = (Total Variance - Item Error Variance) / (Total Variance)

Whether a Cronbach alpha is considered acceptable is determined by the impact of the test results: 0.9 is used for high-stakes assessments such as national licensure exams, 0.8 for moderate-stakes assessments such as residency in-service exams or medical student clerkship tests, and 0.7 for lower-stakes assessments such as local medical school quizzes.^3,9

As an example, you might use this reliability measurement when asking your surgical resident to complete a self-assessment form or questionnaire. You could have a surgical resident assign a Likert Scale number to indicate their confidence on procedural performances as part of this self-assessment. Using Cronbach’s alpha, you can arbitrarily divide the assessment form and see how often the resident assigns the same Likert Scale score to questions assessing a similar construct, for example, confidence with procedural skills. If they select “mostly confident” on all of the procedural confidence questions, that questionnaire is a reliable assessment of procedural confidence. The Cronbach’s alpha would likely be greater than 0.9, indicating excellent internal reliability.¹⁰ If some of the time they respond, “somewhat unconfident” and other times they respond “mostly confident” on the procedural confidence questions, the Cronbach’s alpha would likely be low. The internal consistency assessing for procedural confidence would be poor because it varied from one questionnaire response to the next. This inconsistency would make it difficult to discriminate between trainees. There is too much within-questionnaire variation per trainee that the noise drowns out the signal. Item-error variance can also be used to calculate reliability for multiple-choice exams, such as the national certifying exam. Within these exams, items or test questions can be compared to one another to assess whether they make the exam results more or less reliable. If an item contributes a large amount of error variance to the total variance, it is often removed. Once removed, this increases the Cronbach’s alpha or test results’ reliability. This is one of the purposes of pilot testing new items on the in-service (ABSITE) or board exams prior to deciding if the items should be included in future trainees’ overall scores.

Interrater Reliability (Intraclass Correlation)

Intraclass correlation is one measure of interrater reliability that calculates the agreement between raters. It accounts for the magnitude of how much they disagree—their rater variance—and also accounts for the likelihood that their agreement is due to chance alone.

ICC = (Shared Rater Variance) / [(Shared Rater Variance) + (Rater Error)]

When residents receive ratings from faculty, such as through work-place evaluations, grades on OSCES, or technical skills ratings on OSATS, we can calculate interrater agreement to ensure the faculty ratings are reliable. For example, a surgical resident may be evaluated by multiple faculty members on whether or not she can complete components of a laparoscopic cholecystectomy using a competency-based checklist. If the faculty member decides she is competent on one of the procedural components, they will give her a check mark if. If they feel she is not yet competent at that task, they will not give her a check mark. There is no partial credit. Given this all-or-nothing checklist scoring system, we can calculate the Intraclass Correlation Coefficient (ICC), to assess the consistency between faculty graders.¹¹

Technical Skill Component	Rater 1	Rater 2
Appropriate Placement of Trocars	X	X
Dissects Peritoneum off Infundibulum
Achieves Critical View of Safety		X
Clips Cystic Artery & Cystic Duct	X	X
Dissects Gallbladder off Liver Bed	X	X
Removes Specimen	X	X
Inspects for Hemostasis	X
Closes Port Sites	X	X

Table 1: Rater Agreement with Intraclass Correlation Coefficients

For this laparoscopic cholecystectomy technical performance, the two raters agreed 75% of the time on the competency of the resident’s performance. Rater 1 felt the resident was not competent on items: “dissects peritoneum” and “achieving the critical view of safety.” Rater 1 gave 6/8 check marks. Rater 2 felt the resident was not competent at “dissects peritoneum” or “inspecting for hemostasis.” Thus, they also gave 6/8 checkmarks, but a different 6/8. When accounting for rater error, their 75% agreement, measured by ICC, decreases to 0.53.

Rater agreement is often deemed poor if <0.40, good if 0.40-0.75, and excellent if >0.75.¹¹ However, the goal agreement differs depending on how the assessment outcomes will be utilized. If it is a low-stakes assessment, such as formative feedback provided on technical skills (e.g. OSATS), greater than 0.75 is often the goal ICC threshold. For summative or high-stakes assessments, such as oral boards certifying exam, the target agreement is 0.9. Establishing high agreement for rotation evaluations, OSCEs, and OSATS can be difficult, because the agreement is dependent on faculty raters. If faculty are not trained on how to rate certain behaviors consistently and are not given behaviorally anchored grading scales, it is unlikely that each faculty member will give the same ratings to a resident. Prior to rater training, each faculty member has their own unique framework of what they consider a “good,” “average,” or “poor” performance. These differing frameworks can introduce error into the assessment results. Training faculty on how to use a particular framework for OSCES or workplace assessments and giving them an example “poor” versus “average” versus “good” performance to anchor their grading can help to generate consistent or reliable results.

Summary

Whether measuring internal consistency, reliability over time, or agreement between assessors, it is important to calculate assessment reliability to evaluate whether trainees are being provided with consistent evaluations; information provided to the trainee serves as the basis for feedback that should be consistent. These reliability metrics can be difficult to calculate by hand, and we recommend surgical educators work with their local statistician (psychometrician) or a statistical software program to enter the assessment data and calculate the reliability indices. As surgical educators, it is important to understand that all of a trainee’s exam score is not necessarily the trainee’s true performance. Rather error variance is introduced with assessment. It is impossible to eliminate all assessment error. However, by calculating assessment reliability and reviewing trainees’ external assessment reports for reliability indices, we can work towards decreasing the error that is introduced and provide trainees with more reliable, consistent feedback on their surgical knowledge and skills.

References

Pagano M, Valadez JJ. Commentary: Understanding practical lot quality assurance sampling. International Journal of Epidemiology. 2010;39(1):69-71.
Birnbaum A, Lord F, Novick M. Statistical theories of mental test scores. Some latent trait models and their use in inferring an examinee’s ability Addison-Wesley, Reading, MA. 1968.
Downing SM. Reliability: on the reproducibility of assessment data. Medical education. 2004;38(9):1006-1012.
Yudkowsky R, Park YS, Downing SM. Assessment in health professions education. Routledge; 2019.
Cook DA, Zendejas B, Hamstra SJ, Hatala R, Brydges R. What counts as validity evidence? Examples and prevalence in a systematic review of simulation-based assessment. Advances in Health Sciences Education. 2014;19(2):233-250.
Rousson V, Gasser T, Seifert B. Assessing intrarater, interrater and test–retest reliability of continuous measurements. Statistics in medicine. 2002;21(22):3431-3446.
Vaz S, Falkmer T, Passmore AE, Parsons R, Andreou P. The case for using the repeatability coefficient when calculating test–retest reliability. PloS one. 2013;8(9):e73990.
Geoffrion R, Lee T, Singer J. Validating a self-confidence scale for surgical trainees. Journal of Obstetrics and Gynaecology Canada. 2013;35(4):355-361.
Nunnally JC. Psychometric Theory 2nd ed. In: Mcgraw hill book company; 1978.
Association AER, Association AP, Education NCoMi, Educational JCoSf, Testing P. Standards for educational and psychological testing. Amer Educational Research Assn; 1999.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. biometrics. 1977:159-174.