Introduction
Surgical skill acquisition is significantly enhanced by timely and accurate feedback, yet this is often lacking in surgical practice. The potential of artificial intelligence (AI) to automate feedback provision has led to the development of surgical AI systems (SAIS) that assess surgeon skill from video recordings and highlight key video frames relevant to the assessment. While previous research demonstrated the reliability of SAIS in assessing surgeon skill and the fairness of its assessments, the reliability and fairness of the AI-generated explanations (highlights) remained unaddressed. This study directly investigates this crucial aspect. Inconsistent or biased feedback from an AI system can negatively impact surgeon development and create unfair disparities among different surgeon groups (e.g., novices vs. experts). The research aims to quantify the reliability of AI-based explanations, compare their consistency across various surgeon groups, and introduce a method to improve both reliability and fairness. The study focuses on robot-assisted radical prostatectomy (RARP), specifically the vesico-urethral anastomosis (VUA) step, analyzing needle handling and needle driving skills.
Literature Review
Existing literature highlights the importance of feedback in accelerating surgical skill acquisition. Several studies demonstrate the positive impact of video-based coaching and feedback on laparoscopic skills and other surgical techniques. However, the automation of feedback using AI systems is a relatively new area. While some studies explored the reliability of AI-based explanations in other medical image analysis and clinical variable prediction, these evaluations lacked systematic quantitative comparison to human explanations and often ignored the potential for bias. The study also addresses the gap in research on quantitatively comparing AI-based explanations with human explanations in the context of surgical videos and proposes a novel strategy to enhance their reliability and fairness.
Methodology
The study uses data from three hospitals across two continents. Surgical videos of VUA were segmented into video samples focusing on individual stitches and the specific activities of needle handling and needle driving. Human raters, trained using the End-to-End Assessment of Suturing Expertise (EASE) taxonomy, provided both skill assessments (binary: low/high skill) and explanation annotations (identifying relevant time segments within each video sample). Inter-rater reliability was ensured to exceed 80% for both assessments and explanations. The Surgical AI System (SAIS), a vision transformer model utilizing RGB frames and optical flow, was used to generate AI-based explanations (attention scores). The area under the precision-recall curve (AUPRC) was used to quantify the reliability of explanations, with the worst-case AUPRC representing the reliability for the most disadvantaged surgeon sub-cohort. To improve explanation reliability and mitigate bias, a new strategy called Training with Explanations (TWIX) was introduced. TWIX incorporates human explanation annotations into the training process of SAIS, explicitly teaching it to identify important frames by adding a binary classification module to the model. The study evaluated SAIS both with and without TWIX across the three hospitals, assessing different surgeon sub-cohorts based on factors like prostate gland volume, Gleason score, surgeon caseload (live surgery data), and gender (training environment data). The study also includes an ablation study investigating the effect of TWIX under different SAIS configurations (different modalities, multi-class skill assessment).
Key Findings
The study reveals that while SAIS’ attention-based explanations align with human explanations, a significant explanation bias exists. The reliability of AI-generated explanations varied across different surgeon sub-cohorts. For instance, SAIS exhibited bias against surgeons operating on larger prostate glands or surgeons with lower caseloads. TWIX significantly improved the reliability of AI-based explanations, as measured by higher AUPRC values, across all hospitals and for both needle handling and needle driving tasks. Notably, TWIX effectively mitigated explanation bias, especially for needle handling, leading to improved reliability for previously disadvantaged surgeon groups. The inclusion of TWIX often improved the overall skill assessment performance of SAIS, indicated by higher Area Under the Curve (AUC) values. The positive impact of TWIX was consistent across various experimental settings, including those with altered SAIS configurations or different data modalities. When applied to a training environment with medical students, the study confirmed the transferability of these findings. The AI system showed a bias against male surgical trainees, and TWIX effectively mitigated this bias. The study demonstrated that AI-generated explanations can be improved significantly by using human explanations as training data.
Discussion
This study addresses critical limitations in current literature. First, it quantifies the reliability of AI-based explanations for surgical skill assessment in video data and compares them systematically to human explanations. Second, it highlights the existence of explanation bias and its implications. Third, it offers a novel and effective strategy (TWIX) to address both reliability and fairness concerns in AI-based explanations. The findings are particularly relevant given the increasing integration of AI into surgical training and credentialing. Unreliable or biased feedback can significantly hinder surgeon development and lead to unfair disparities. The methodology and findings contribute directly to the development of trustworthy AI systems for surgical training, ensuring fairness and accelerating skill acquisition for all surgeons. The applicability of the TWIX framework extends beyond the surgical domain, promising improvements in video-based assessments and feedback in education and workplace training.
Conclusion
This multi-institutional study demonstrates that AI-based explanations in surgical skill assessment, while often aligning with human explanations, can exhibit significant bias. The proposed TWIX method effectively improves the reliability and fairness of these explanations, leading to improved AI performance across diverse settings. Future research will investigate whether these improved explanations actually accelerate skill acquisition in surgical trainees. Addressing the limitations identified by the study is critical to ensure the ethical and effective use of AI in surgical training.
Limitations
The study focused on a single surgical task (suturing) within a specific procedure (RARP). The generalizability of the findings to other surgical tasks and procedures needs further investigation. The TWIX method relies on human-generated explanations, which can be time-consuming and potentially subjective to obtain. Furthermore, while the study showed improvements in fairness, the long-term impact of these improvements on reducing disparities in surgical skill needs further evaluation.
Related Publications
Explore these studies to deepen your understanding of the subject.