Medicine and Health

A multi-institutional study using artificial intelligence to provide reliable and fair feedback to surgeons

D. Kiyasseh, J. Laca, et al.

Surgeons can quickly master surgical skills with reliable performance feedback! A groundbreaking AI system assesses surgeon skills from surgical videos and enhances training methods through 'Training with Explanations' (TWIX). Conducted by authors including Dani Kiyasseh and Animashree Anandkumar, this research reveals how AI can improve surgical education and reduce bias across hospitals.

00:00

~3 min • Beginner • English

Index

Introduction

Surgeons rarely receive systematic feedback on their operative performance despite evidence that feedback accelerates skill acquisition such as suturing. Recently, the authors developed a surgical AI system (SAIS) that assesses surgeon skill from intraoperative video and highlights video frames most relevant to the assessment. For AI-augmented feedback to be trustworthy, these AI-based explanations must align with expert expectations (reliability) and be equally reliable for all surgeons (fairness). Prior work has shown promise in explanation methods across domains, but quantitative comparisons of AI-based explanations to human explanations in surgical videos, and assessments of fairness, are lacking. This study asks whether AI-generated explanations are reliable and fair across surgeon sub-cohorts and proposes a training strategy, TWIX, to improve both alignment with human explanations and fairness. The work evaluates SAIS across multiple hospitals and examines whether incorporating human explanations during training can enhance explanation reliability, mitigate explanation bias, and improve skill assessment performance, including in a trainee simulation environment.

Literature Review

The study situates itself within prior work on attention-based transformer models used to produce explanations in NLP and protein modeling, and in medical AI where explanation methods highlight salient image regions or clinical variables. Prior evaluations of explanations have often been qualitative or have not compared to human explanations, and fairness aspects of explanations are underexplored, especially for video-based surgical contexts. Preliminary fairness-in-explanations studies have appeared in non-surgical domains and without leveraging human expert rationales. The authors' prior work established SAIS for surgeon skill assessment and examined fairness of assessments, but not quantitative reliability and fairness of video-based explanations aligned to human expert annotations. This gap motivates a method that integrates human explanations into training to improve both reliability and fairness.

Methodology

Study design and data: The authors analyze videos of the vesico-urethral anastomosis (VUA) step of robot-assisted radical prostatectomy (RARP) from three hospitals: University of Southern California (USC), St. Antonius Hospital (SAH), and Houston Methodist Hospital (HMH). They focus on two suturing sub-activities: needle handling (grasping the needle) and needle driving (pushing the needle through tissue). From each RARP VUA step (~20 minutes), the start/end times of stitches (up to 24) are identified, and within each stitch, the start/end of needle handling and needle driving episodes are extracted, yielding 20–30s video samples per episode. Data quantities (Table 1): For needle handling, USC: 78 videos, 912 samples, 19 surgeons; SAH: 60 videos, 240 samples, 18 surgeons; HMH: 20 videos, 184 samples, 5 surgeons; Training environment (LAB): 69 videos, 328 samples, 38 participants. For needle driving, USC: 78 videos, 530 samples, 19 surgeons; SAH: 60 videos, 280 samples, 18 surgeons; HMH: 20 videos, 220 samples, 5 surgeons. Training environment: Medical students without prior robotic experience performed suturing on a gel model with 16 marked entry/exit points, yielding 16 video samples per participant; analysis focuses on needle handling. Ethics: IRB approval (USC HS-17-00113). Informed consent obtained. Annotations: Skill assessments are based on the EASE taxonomy via trained raters (medical students/residents). Raters trained to >80% inter-rater reliability; final binary labels (low vs. high skill) use the lowest score in case of disagreement. Needle handling skill is determined by number of needle repositions (fewer implies higher skill). Needle driving skill is determined by smoothness of driving through tissue (smoother implies higher skill). Explanation annotations: Two trained raters annotated segments of frames relevant to low-skill assessments per EASE criteria: for needle handling, three or more grasps; for needle driving, four or more needle adjustments during driving or complete removal backward through tissue. Raters trained until intersection-over-union (IoU) >0.80. Only low-skill samples were annotated to focus on corrective feedback. In disagreements, the intersection of annotations was used to avoid superfluous frames. Explanation visualization normalized time indices (0 to 1) across samples; on average, ~30% of a sample is marked important. Bias sub-cohorts: For live videos, groups include patient prostate volume (≤49 ml vs >49 ml), Gleason score severity, and surgeon caseload (≤100 vs >100 lifetime robotic cases). For the training environment, sub-cohorts are gender. AI system (SAIS): SAIS ingests RGB frames and optical flow (motion) and uses a vision transformer (ViT) backbone pre-trained self-supervised on ImageNet to extract spatial features. Temporal relations are modeled with attention-based transformer encoders. A classification token embedding yields a video representation used to predict low vs high skill. Frame-level attention scores from the final transformer layer provide attention-based explanations (higher attention indicates higher importance). Training and evaluation protocol: SAIS is trained exclusively on USC data using 10-fold Monte Carlo cross-validation with disjoint train/val/test splits by video. Class-balanced splits for both training and evaluation prevent sampling bias. For external evaluation, all 10 models are deployed to SAH and HMH data and metrics are averaged across folds. TWIX module: Training With eXplanations (TWIX) augments SAIS with a per-frame classifier p mapping frame representations h_t to frame importance probabilities in [0,1]. Human frame-importance labels (binary) supervise this module via a binary cross-entropy importance loss, added to the supervised InfoNCE loss from prior work. The importance loss is computed only for low-skill samples (those with explanation labels). TWIX is architecture-agnostic and can be used with or without attention mechanisms. After training, explanations may be derived from attention (attention w/ TWIX) or directly from the TWIX per-frame importance predictions (TWIX). Metrics: Explanation reliability is measured by area under the precision-recall curve (AUPRC) comparing binarized AI explanations to human annotations across thresholds. Explanation bias is defined as disparity in AUPRC between sub-cohorts; mitigation is tracked via improvements in worst-case AUPRC (the disadvantaged group). Skill assessment performance is measured by AUC for binary classification. All results are averaged over the 10 folds with 95% confidence intervals where reported. Ablations: Variants include removing optical flow (RGB only) and multi-class skill assessment (low vs intermediate vs high). The impact of TWIX on explanation reliability and bias is assessed across these settings.

Key Findings

- Alignment and bias: SAIS attention-based explanations often align with human explanations but are imperfect and exhibit explanation bias across surgeon sub-cohorts (e.g., groups by prostate volume, Gleason score, caseload). In the training environment, an explanation bias against male trainees was observed. - TWIX improves explanation reliability (needle handling): Attention AUPRC improved with TWIX across hospitals. Examples: USC attention AUPRC increased from approximately 0.488 to 0.595; SAH from ~0.629 to 0.687; HMH from ~0.551 to 0.617. Direct TWIX frame-importance estimates achieved higher AUPRC than attention in several cases: USC 0.595–0.677; SAH 0.629–0.724; HMH 0.551–0.697. - Needle driving explanations: TWIX had minimal benefit on explanation reliability for needle driving, likely due to higher variability in the task, though classification performance still improved. - Bias mitigation: TWIX increased worst-case AUPRC (disadvantaged sub-cohort) for needle handling across hospitals, indicating effective mitigation of explanation bias. Example for prostate volume groups: USC worst-case AUPRC improved from 0.50 to 0.60; SAH 0.62 to 0.75; HMH 0.64 to 0.80. Similar mitigation observed across other groupings. For needle driving, mitigation was less evident. - Skill assessment performance gains (AUC): TWIX improved SAIS classification AUC across hospitals. Needle driving AUC: USC 0.822→0.850; SAH 0.800→0.837; HMH 0.728→0.757. Needle handling AUC: USC 0.849→0.859; SAH 0.873→0.885; HMH 0.795→0.794 (marginal change). Gains demonstrate positive transfer to external hospitals. - Ablation robustness: TWIX consistently improved explanation reliability and mitigated bias across configurations, including RGB-only and multi-class assessment. In Multi-Skill, average AUPRC improved from ~0.48 to ~0.67 and worst-case AUPRC from ~0.50 to ~0.68, highlighting versatility. - Training environment: Findings transferred to trainee setting. AI explanations aligned with human annotations; TWIX improved reliability and mitigated bias against male trainees; SAIS performance improved (higher AUC) with TWIX.

Discussion

The study demonstrates that while SAIS can provide frame-level explanations that align with expert judgments, these explanations are not uniformly reliable across surgeon sub-cohorts, risking inequitable feedback. Incorporating human explanations via TWIX improves alignment (reliability) and fairness (mitigating worst-case AUPRC gaps) and enhances skill assessment accuracy, with generalization across hospitals and to a trainee simulation environment. These advances support the feasibility of AI-augmented surgical training and credentialing, where trustworthy, temporally localized feedback can accelerate learning and promote equitable skill development. The results also suggest that explicitly supervising explanation generation helps models focus on clinically relevant frames and avoid spurious cues. Broader implications include applicability to other video-assessment domains where expert rationale can be codified, such as education and workplace training.

Conclusion

This work quantitatively evaluates and improves the reliability and fairness of AI-based explanations for surgical skill assessment. By training SAIS with human explanation supervision (TWIX), the authors achieve higher explanation AUPRC, mitigate explanation bias across surgeon sub-cohorts, and improve skill classification performance across institutions and in a trainee environment. These findings lay groundwork for trustworthy AI feedback in surgical training and credentialing. Future directions include extending to additional surgical activities beyond suturing, evaluating whether AI explanations measurably accelerate trainee learning in prospective trials, exploring outcome-linked explanation reliability (e.g., alignment with intraoperative events or patient outcomes) to reduce dependence on human annotations, and leveraging AI explanations to discover novel operative behaviors that inform surgical best practices.

Limitations

- Scope limited to suturing activities (needle handling, needle driving); generalization to other technical skills (e.g., dissection) remains to be shown. - TWIX requires human explanation annotations, which can be time-consuming and may be subjective; this study mitigated subjectivity via strict EASE criteria and high inter-rater agreement, but scalability remains a concern. - Reliability was defined as alignment with human explanations, which may overlook other valid definitions (e.g., association with clinically meaningful intraoperative events or outcomes) and may constrain discovery of novel behaviors. - For needle driving, explanation reliability improvements and bias mitigation were limited, likely due to high task variability; methods to address such variability are needed. - It remains unproven whether providing SAIS explanations accelerates trainee learning in practice; prospective evaluations are needed. - While disparities in explanation reliability were measured, the downstream clinical or educational harms from such disparities were not directly assessed.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

The unreliability of crackles: insights from a breath sound study using physicians and artificial intelligence

C. Huang, C. Chen, et al.

Interdisciplinary Studies

Gender and feminist considerations in artificial intelligence from a developing-world perspective, with India as a case study

S. Kumar and S. Choudhury

Psychology

The Emotional Landscape of Pregnancy and Postpartum during the COVID-19 Pandemic in Italy: A Mixed-Method Analysis Using Artificial Intelligence

Ravald, Mosconi, et al.

Computer Science

Freud and the algorithm: neuropsychoanalysis as a framework to understand artificial general intelligence

L. M. Possati

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny