Machines for Surgical Education? A Review of Applications of Deep Learning for Training and Assessment

Introduction

Artificial intelligence (AI) is the study of mathematical algorithms that strive to replicate cognitive functions, such as reasoning, problem-solving, decision making, and object and speech recognition.^1,2 Machine learning is a branch of AI that involves training an algorithm to learn and make future predictions by recognizing patterns without explicit programming. These trained algorithms are subsequently able to predict an outcome (for example, identifying an object) based upon its “experience” when presented with new data to which it has not been exposed.

Artificial neural networks and deep neural networks are the basis for some of the latest advancements in machine learning, inspired by biological nervous systems in their ability to process large volumes of data and parametrize them on many levels (Figure 1). Neural networks process signals in layers of simple computational units (neurons). Connections between each of these units are weighted in such a way that after many rounds of training, the network creates a digital footprint that is able to derive an output (for example, whether or not an object is present in a video) from a given input of data (for example, pixels from the video feed). Deep learning (neural networks that contain many layers of hidden neurons between the input and output) are extremely “data-hungry.” Their ability to create these digital footprints that can make accurate and meaningful inferences from data that they have never been exposed to before is derived by training them on a large set of data with many variations and instances. Through repeated iterations of training (for example, through techniques like forward propagation and backpropagation), the network will progressively fine tune the weight of connections between each layer of neurons until it minimizes the discrepancy between predicted and actual results.

Figure 1: Schematic representation of a neural network. Each circle represents a “neuron”, which is composed of complex arithmetic operations that receives multiple inputs from other neurons. Through many rounds of training using a training dataset, the connections and weights between the neurons become fine-tuned in order to optimize the predicted output accuracy from a novel set of input data.

Using AI for Training and Assessment

Perhaps one of the most exciting aspects of deep learning and AI is its ability to identify objects and scenes, especially in the operating room to make clinically meaningful inferences. Deep computer vision uses deep neural networks with visual data as its input. Pixels from videos and images are fed into a network and processed through many hidden layers to detect specific and relevant features for object detection and image classification. Common architectures to find patterns in spatial data are convolutional neural networks, residual neural networks, and recurrent neural networks. To this day, deep learning has been used for various diagnostic and prognostic applications in medicine, including suspicious breast lesions on mammography, diabetic retinopathy, and suspicious skin lesions.

Deep computer vision could have monumental applications in surgical education—both for training and assessment purposes. Theories of professional expertise suggest that skill development for complex tasks is acquired through focused and deliberate practice of specific competencies.⁶ This requires repeated practice focusing on one’s weaknesses while receiving immediate feedback by a coach with ample opportunities for repetition until performance reaches the target value. While this process works nicely in athletics, chess, classical music, or other fields, in surgery, it is often difficult to recreate many realistic operative scenarios back-to-back where a trainee can practice in an experiential manner and assess the outcome of their performance. More importantly, deliberate practice requires not only specific, measurable, and reproducible metrics but also a coach that is available to provide the necessary feedback in real time during training. AI can potentially fill this void by using deep learning and computer vision for automated coaching. For example, virtual reality surgical simulators generate large amounts of kinesthetic and haptic data that can be applied to deep learning to identify patterns based on different levels of expertise (novice versus expert) and yield information regarding operative performance and provide automated feedback.^7–9 A recent study by Malpani et al. showed the value of an automated coaching platform integrated within the virtual reality simulator for the da Vinci Surgical System that is capable of providing either continuous or on-demand guidance as a graphical overlay during practice of technical skills for robotic surgery.¹⁰

Deep learning can also be used to provide on-demand automated coaching for intraoperative decision making. Recently, our group developed a model that can autonomously identify safe zones of dissection (GO zone), dangerous zones of dissection (NO-GO zone), and other anatomical structures during a laparoscopic cholecystectomy based on expert surgeons’ annotations (Figure 2). Using the e-learning platform, Think Like a Surgeon, trainees are asked to make annotations on surgical videos to answer specific questions (for example, where would you dissect next; Figure 3). The trainees’ annotated pixels are subsequently compared to the AI’s segmentation of the surgical field, where they receive an objective accuracy score and many opportunities for repetition and deliberate practice without the presence of an actual coach. Ongoing studies are looking to assess improvement in performance on this platform.

Figure 2: Examples of a convolutional neural network that is capable of providing real-time predictions during a laparoscopic cholecystectomy for the GO zone (safe area to dissect) and NO-GO zone (dangerous area to dissect in with a high likelihood of major bile duct injury). Figures courtesy of Dr. Amin Madani.

Figure 3: E-learning platform, Think Like a Surgeon, where trainees are asked to make annotations on surgical videos to answer specific questions (for example, “Where would you dissect next?”; “Where do you expect to find a particular anatomical plane/structure?”). Green annotation denotes the trainee’s mental model of the course of the left recurrent laryngeal nerve during a thyroidectomy. Expert responses are shown as a heat map for immediate real-time feedback. Figure courtesy of Dr. Amin Madani.

Similarly, given the widespread availability of advanced digital cameras, image guided surgery, and high definition video feeds, video-based assessment has gained increasing momentum as a method for assessing surgical performance.^11–13 New data suggests that computer vision can be used to identify specific phases of an operation or to assess specific events during a surgical video (for example, achievement of a Critical View of Safety) with moderate accuracy.^14,15 Ongoing multicenter studies are validating the performance of AI systems to assess surgical performance and it is reasonable to expect that this will be a fundamental component of credentialing and licensing in the future.

Despite the incredible potential of this technology to benefit surgical education, it must be noted that few, if any, studies have been performed to specifically evaluate the use of AI for video based review of technical performance. The lack of literature in the area of automated assessment is not surprising. Modern AI techniques, such as deep learning, do have potential to identify features in the video that could be predictive of performance; however, such results are dependent on having a clear definition of the target performance to be achieved. Some work has been published on potential performance targets that could be automated. Hung et al. published work suggesting that automated measures such as kinematic data from robotic systems can correlate with outcomes¹⁶ and can differentiate expert and “super expert” surgeons performing robotic prostatectomy.¹⁷ Combined with work from Birkmeyer et al. and Curtis et al. suggesting correlation between surgical skill and patient outcomes, it is reasonable to hypothesize that analysis of objective measures of performance could be automated.^11,13 However, as discussed below, valid metrics will need to be identified through large scale research collaborations that can generate the representative data to allow for generalization of results.

Challenges for Using AI for Surgical Education

Despite the promises of AI for digitizing and improving surgical training, there are many pitfalls. Any automated surgical coach needs to undergo robust studies to demonstrate validity evidence, and the higher the stakes of the assessment (for example, licensing and credentialing), the more this data needs to be valid, reliable, and legally defensible. Moreover, if a neural network is designed to reproduce a specific human behavior or aspect of an operation, there needs to be evidence that this specific item is representative of the underlying construct of surgical expertise. As Russel and Norvig famously stated:

“If a conclusion is made that an AI program is one that thinks like a human, it becomes imperative to understand exactly how humans think. Human behaviour can therefore be used as a ‘map’ to guide the performance of an algorithm.”¹⁸

Qualitative and mixed method studies are often useful to establish the competencies that are the most high yield when designing an AI model.^19-21

There are also many nuances in surgery that make computer vision more challenging than other fields. For instance, detecting anatomical structures and planes of dissection (semantic segmentation) is not simple since most surgical anatomy is covered by fatty and fibrous tissue, and object detection for hidden structures without clear boundaries is difficult. Furthermore, training a network on a dataset that has already been labeled or annotated by experts needs to have a required gold standard and experts will often have opposing assessments and conclusions. While it is logistically easier to achieve expert consensus for classification tasks (for example, whether or not there is a suspicious cancerous lesion), obtaining expert consensus on the boundaries of objects within the surgical field is more challenging. To overcome this, various tools, such as the Visual Concordance Test, have been used to achieve consensus based on a panel of expert annotations on the surgical field, and this convergence of annotations is the gold standard consensus that is subsequently used to train a network.^22,23

Another challenge is obtaining adequate data to train a network to achieve high level performance. The lack of availability of surgical videos can make it difficult to train a network with enough instances and examples so that it can make accurate predictions in new and novel cases. Finally, as these applications have shown potential in other areas, a common criticism is the difficulty to understand how an algorithm makes a decision, which can be problematic in the context of education where a successful relationship between a teacher and a learner is based on trust, transparency, and feedback.²⁴ However, growing interest in explainable AI (i.e., AI for which an algorithm can also present relevant examples justifying its decision) is leading to improved methods of “opening the black box” of deep learning.²⁵

Ethical and Legal Considerations

As physicians, we do not have an outstanding record of proper introduction of new technology or innovation into our field. We often take things at face value and start making conclusions from findings without rigorous assessment. Furthermore, we traditionally undervalue our own bias and need to protect our intellectual property. As AI and deep learning permeate throughout various facets of surgical care, it behooves us to reverse this trend.²⁶

Surgeons need to understand that there is tremendous value and interest in raw, unedited operative videos. As we make decisions in this field, we need to recall two things: first, such video needs to be recorded, shared, and used in an appropriate manner for research, education and quality improvement.²⁶ Second, there is a tremendous opportunity for all members of the surgical community to learn from such multimedia, both on a personal and on a communal basis. As this data becomes increasingly available and is applied for many purposes to improve surgical care, it is also critical to be cognizant that there is much industry demand and access needs to be regulated with a high degree of scrutiny. Procedures need to be followed to maintain data security and integrity of data, often complicating the management of the data itself. Surgeons need to be actively involved in the use, maintenance, and production of algorithms that are derived from these large repositories of datasets.

In addition, the actions taken from the application of algorithmic use of machine learning and AI need to be done wisely. There is tremendous potential when done correctly, for formative feedback, technical assessment, minimizing work and learning from big data. Nevertheless, if used for high-stakes decision making, such as credentialing or patient care, AI needs to be done in a programmatic and evidence-based fashion similar to the introduction of other innovative technology.²⁷ For example, despite the fact that deep computer vision can detect radiological findings, such as suspicious breast lesions, with a high degree of accuracy, it does not imply that such algorithms can simply replace radiologists.²⁸ Variations in technique, patient factor, healthcare systems, disease states, and severity of illness dictate the need to train algorithms on a large dataset, eliminate potential algorithmic biases and validate them using best practices.

Disclosures

Drs. Amin Madani, Adnan Alseidi, and Maria Altieri have no proprietary or commercial interest in any product mentioned or concept discussed in this article. Dr. Daniel Hashimoto is a consultant for Johnson & Johnson Institute, Verily Life Sciences, Worrell, and Mosaic Research Management. He has received grant funding from Olympus. He has a pending patent on deep learning technology related to automated analysis of operative video.

References

Bellman RE. An Introduction to Artificial Intelligence: Can Computers Think? San Francisco: Boyd & Fraser; 1978.
Hashimoto DA, Rosman G, Rus D, Meireles OR. Artificial Intelligence in Surgery: Promises and Perils. Ann Surg. 2018 Jul;268(1):70-76. doi: 10.1097.
Gulshan V, Peng L, Coram M, et al. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA. 2016;316(22):2402-2410. doi:10.1001/jama.2016.17216
Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115-118. doi:10.1038/nature21056
McKinney SM, Sieniek M, Godbole V, et al. International evaluation of an AI system for breast cancer screening. Nature. 2020;577(7788):89-94. doi:10.1038
Ericsson KA, Hoffman RR, Kozbelt A, Williams AM. The Cambridge Handbook of Expertise and Expert Performance. Cambridge, United Kingdom: Cambridge University Press; 2018.
Mirchi N, Bissonnette V, Ledwos N, et al. Artificial Neural Networks to Assess Virtual Reality Anterior Cervical Discectomy Performance. Oper Neurosurg (Hagerstown). 2020;19(1):65-75. doi:10.1093/ons/opz359
Winkler-Schwartz A, Bissonnette V, Mirchi N, et al. Artificial Intelligence in Medical Education: Best Practices Using Machine Learning to Assess Surgical Expertise in Virtual Reality Simulation. J Surg Educ. 2019;76(6):1681-1690. doi:10.1016/j.jsurg.2019.05.015
Winkler-Schwartz A, Yilmaz R, Mirchi N, Bissonnette V, Ledwos N, Siyar S, Azarnoush H, Karlik B, Del Maestro R. Machine Learning Identification of Surgical and Operative Factors Associated With Surgical Expertise in Virtual Reality Simulation. JAMA Netw Open. 2019 Aug 2;2(8):e198363. doi: 10.1001/jamanetworkopen.2019.8363.
Malpani A, Vedula SS, Lin HC, Hager GD, Taylor RH. Effect of real-time virtual reality-based teaching cues on learning needle passing for robot-assisted minimally invasive surgery: a randomized controlled trial. Int J Comput Assist Radiol Surg. 2020;15(7):1187-1194. doi:10.1007/s11548-020-02156-5
Birkmeyer JD, Finks JF, O'Reilly A, et al. Surgical skill and complication rates after bariatric surgery. N Engl J Med. 2013;369(15):1434-1442. doi:10.1056/NEJMsa1300625
Lendvay TS, White L, Kowalewski T. Crowdsourcing to Assess Surgical Skill. JAMA Surg. 2015;150(11):1086-1087. doi:10.1001/jamasurg.2015.2405
Curtis NJ, Foster JD, Miskovic D, et al. Association of Surgical Skill Assessment With Clinical Outcomes in Cancer Surgery. JAMA Surg. 2020;155(7):590-598. doi:10.1001/jamasurg.2020.1004
Hashimoto DA, Rosman G, Witkowski ER, et al. Computer Vision Analysis of Intraoperative Video: Automated Recognition of Operative Steps in Laparoscopic Sleeve Gastrectomy. Ann Surg. 2019;270(3):414-421. doi:10.1097/SLA.0000000000003460
Korndorffer JR, Hawn MT, Spain DA, et al. Situating Artificial Intelligence In Surgery: A Focus On Disease Severity. Ann Surg. 2020 Sept;272(3):523-528.
Hung AJ, Chen J, Che Z, et al. Utilizing Machine Learning and Automated Performance Metrics to Evaluate Robot-Assisted Radical Prostatectomy Performance and Predict Outcomes. J Endourol. 2018;32(5):438-444. doi:10.1089/end.2018.0035
Hung AJ, Oh PJ, Chen J, et al. Experts vs super-experts: differences in automated performance metrics and clinical outcomes for robot-assisted radical prostatectomy. BJU Int. 2019;123(5):861-868. doi:10.1111/bju.14599
Russell. Artificial Intelligence: A Modern Approach, Global Edition. Pearson; 2016.
Madani A, Watanabe Y, Vassiliou M, et al. Defining competencies for safe thyroidectomy: An international Delphi consensus. Surgery. 2016;159(1):86-101. doi:10.1016/j.surg.2015.07.039
Madani A, Watanabe Y, Feldman LS, et al. Expert Intraoperative Judgment and Decision-Making: Defining the Cognitive Competencies for Safe Laparoscopic Cholecystectomy. J Am Coll Surg. 2015;221(5):931-940.e8. doi:10.1016/j.jamcollsurg.2015.07.450
Madani A, Grover K, Kuo JH, et al. Defining the competencies for laparoscopic transabdominal adrenalectomy: An investigation of intraoperative behaviors and decisions of experts. Surgery. 2020;167(1):241-249. doi:10.1016/j.surg.2019.03.035
Madani A, Grover K, Watanabe Y. Measuring and Teaching Intraoperative Decision-Making Using the Visual Concordance Test: Deliberate Practice of Advanced Cognitive Skills [published online ahead of print, 2019 Nov 13]. JAMA Surg. 2019;10.1001/jamasurg.2019.4415. doi:10.1001/jamasurg.2019.4415
Madani A, Keller DS. Assessing and improving intraoperative judgement. Br J Surg. 2019;106(13):1723-1725. doi:10.1002/bjs.11386
Conati C, Porayska-Pomsta K, Mavrikis M. AI in Education needs interpretable machine learning: Lessons from Open Learner Modelling. 2018. arXiv:1807.00154
Gordon L, Grantcharov T, Rudzicz F. Explainable Artificial Intelligence for Safe Intraoperative Decision Support. JAMA Surg. 2019;154(11):1064-1065. doi:10.1001/jamasurg.2019.2821
Bittner JG 4th, Logghe HJ, Kane ED, et al. A Society of Gastrointestinal and Endoscopic Surgeons (SAGES) statement on closed social media (Facebook®) groups for clinical education and consultation: issues of informed consent, patient privacy, and surgeon protection. Surg Endosc. 2019;33(1):1-7. doi:10.1007/s00464-018-6569-2
McCulloch P, Altman DG, Campbell WB, et al. No surgical innovation without evaluation: the IDEAL recommendations. Lancet. 2009;374(9695):1105-1112. doi:10.1016/S0140-6736(09)61116-8
Schaffter T, Buist DSM, Lee CI, et al. Evaluation of Combined Artificial Intelligence and Radiologist Assessment to Interpret Screening Mammograms. JAMA Netw Open. 2020;3(3):e200265. Published 2020 Mar 2. doi:10.1001/jamanetworkopen.2020.0265

About the Authors

Amin Madani, MD, PhD, is an endocrine and acute care surgeon in the department of surgery at University Health Network in Toronto, Ontario, Canada.

Daniel A. Hashimoto, MD, MS, is an association director of research in the department of surgery, surgical AI and innovation laboratory at Massachusetts General Hospital, Boston, MA.

Adnan Alseidi, MD, EdM, FACS, is a professor of surgery in the department of surgery at the University of California, San Francisco.

Maria S. Altieri, MD, MS, is an assistant professor of surgery in bariatric and minimally invasive surgery, department of surgery at East Carolina University Brody School of Medicine.