Medicine and Health
A vision transformer for decoding surgeon activity from surgical videos
D. Kiyasseh, R. Ma, et al.
Discover the cutting-edge machine learning system, SAIS, developed by a talented team including Dani Kiyasseh, Runzhuo Ma, and others, which decodes surgical activity from robotic surgery videos with remarkable accuracy. This innovative tool provides vital insights into surgical skills and can potentially transform surgeon feedback and improvement methods.
~3 min • Beginner • English
Introduction
Recent evidence shows that postoperative outcomes are strongly influenced by intraoperative surgical activity—the specific actions taken by surgeons and how well they are executed. Yet, for most procedures, intraoperative activity is not measured, hindering the ability to quantify variability across time, surgeons and hospitals, to test links between activity and outcomes, and to provide feedback on operating technique. Manual video review by experts is subjective, unreliable and unscalable. Emerging AI methods can decode elements like steps, gestures and skill from surgical videos but typically handle only a single element at a time and are rarely evaluated for generalization across surgeons, procedures and hospitals. This study proposes SAIS, a unified vision transformer-based AI system that decodes multiple elements of intraoperative activity (subphases, gestures, skills) from surgical videos and rigorously evaluates its generalizability across videos, surgeons, hospitals and procedures.
Literature Review
Prior computational work in surgical AI has used robot kinematics and video to analyze workflow, gestures and skills. Video-based models include 3D convolutional neural networks (I3D), temporal convolutional networks (e.g., MA-TCN), and more recently transformers for phase recognition. Gesture and skill recognition has been studied in controlled datasets (e.g., JIGSAWS) and specific clinical steps (e.g., DVC UCL for dorsal vascular complex suturing), but prior evaluations often risk information leakage (e.g., leave-one-user-out optimized via cross-validation, overlapping surgeons between train and test in DVC UCL). Existing systems typically target a single element (steps, gestures or skills) and provide limited explainability. There is a need for unified, explainable models evaluated under rigorous, real-world generalization settings across surgeons, hospitals and procedures.
Methodology
SAIS is a unified, vision-and-attention-based system operating solely on surgical videos, designed to decode multiple elements of intraoperative activity: subphase recognition, gesture classification and skill assessment. Data: Surgical videos from three hospitals (USC, SAH, HMH); activities include suturing (VUA in RARP) and dissection (nerve-sparing, NS; and hilar dissection, HD). Annotations follow published taxonomies for suturing subphases (needle handling, needle driving, needle withdrawal), suturing gestures (R1, R2, L1, C1), dissection gestures (cold cut, hook, clip, camera move, peel, retraction), and binary skill labels (low vs high) for needle handling and needle driving. Raters were trained to inter-rater reliability >0.8. Evaluation design: Ten-fold Monte Carlo cross-validation with splits at video (case) level to test generalization to unseen videos; additional external generalization to unseen hospitals (SAH, HMH) and procedures (HD vs NS). For inference on entire videos, ensemble over folds and test-time augmentations with entropy-based abstention and temporal aggregation. Model architecture: Two parallel modality streams ingest RGB frames and optical flow (RAFT-derived). Spatial features are extracted per frame using a frozen self-supervised ViT (DINO) to produce D=384-dimensional representations; ViT attention emphasizes instrument tips, needles and anatomical edges. Temporal modeling uses shared-parameter transformer encoders (4 layers) with temporal positional embeddings and a learnable classification token; outputs are modality-specific video embeddings which are summed and passed through projection heads to an E=256 video representation. Learning objective: Supervised contrastive learning with category-specific learnable prototypes; InfoNCE encourages attraction to the correct prototype and repulsion from others. Classification at inference uses cosine similarity to prototypes with softmax. Test-time augmentation offsets starting frames; ensemble averages probabilities across folds and TTAs; uncertainty quantified via entropy; predictions aggregated over time to reflect longer activities. Frame selection: For gestures, sample 10 equally spaced frames within each labeled segment (1–5 s typical). For subphases/skills, sample every 10th frame over 10–30 s segments; optical flow uses pairs 0.5 s apart aligned with RGB sampling. Training details: Features pre-extracted offline; SGD optimizer, LR=1e-4, batch size=8; masking handles variable-length sequences with zero-padding. Ablations remove test-time augmentation, RGB or flow modality, or self-attention (average frames) to quantify component contributions. Baselines: I3D (RGB+flow) fine-tuned from Kinetics with 16-frame inputs and optimized heads; external datasets JIGSAWS and DVC UCL used for comparative gesture recognition with appropriate cross-validation protocols. Ethics approvals obtained; USC de-identified data available on request; code released at the provided repository.
Key Findings
Subphase recognition (suturing VUA, trained on USC): Generalization across unseen USC videos achieved AUCs: needle withdrawal 0.951, needle handling 0.945, needle driving 0.925. Cross-hospital generalization: SAH AUCs—needle driving 0.898, needle withdrawal 0.870, needle handling 0.857; HMH AUCs—needle withdrawal 0.964, needle driving 0.957, needle handling 0.937. Benchmarking on entire VUA videos: SAIS outperformed I3D with F1^10 scores of 50 vs 40, respectively. Ablation study: Removing self-attention reduced PPV by about 20 points; removing RGB or flow reduced PPV by about 3 points each, indicating complementary modalities and critical role of attention. Gesture classification: Within USC videos, suturing gestures: AUCs—R1 0.837, C1 0.763; dissection (NS) gestures: AUCs—clip (k) 0.974, camera move (m) 0.909; lower performance for retraction (r) (AUC 0.701), likely due to overlap with other gestures. Cross-hospital SAH for NS: AUCs—camera move (m) 0.899, clip (k) 0.831; performance drop attributed to distribution shift (e.g., cold cut from 0.823 at USC to 0.702 at SAH). Cross-procedure effects: Hook (h) degraded from AUC 0.768 (NS) to 0.615 (HD), reflecting anatomical/contextual differences. External datasets: JIGSAWS—SAIS achieved 87.5% accuracy vs best video-based 90.1%; DVC UCL—SAIS improved accuracy ~4× over naive majority baseline (vs ~3× for MA-TCN). Unannotated full-video gesture decoding (NS, USC): Manual spot-checking (n=800 predictions) showed good precision and robustness to left/right neurovascular bundle; hook (h) precision 0.75 in both locations; identified an additional untrained “hot cut” category within cold cut predictions; detected a 60 s camera move outlier (vs 1 s average) corresponding to camera removal/inspection. Skill assessment (needle handling/driving, trained on USC): USC AUCs—needle handling 0.849, needle driving 0.821; cross-hospital—SAH: handling 0.880, driving 0.797; HMH: handling 0.804, driving 0.719. SAIS consistently outperformed I3D (USC: handling 0.849 vs 0.681; driving 0.821 vs 0.630; SAH: handling 0.880 vs 0.730; driving 0.797 vs 0.656; HMH: handling 0.804 vs 0.680; driving 0.719 vs 0.571), with lower variance across folds (e.g., USC driving SD 0.05 vs 0.12 for I3D). Explainability: Attention maps focused on frames depicting needle repositioning (handling) and adjustments/withdrawal (driving), aligning with expert skill criteria. Deployment insights: Generated per-case skill profiles and low-skill ratios across cases (SAH), enabling identification of exemplary cases and targets for training. Association with patient outcomes: At USC, logistic regression controlling for surgeon caseload and patient age showed higher odds of 3-month urinary continence recovery when SAIS assessed needle driving as high skill—OR 1.31 (95% CI 1.08–1.58, P=0.005) at sample level; aggregated per case OR 1.89 (95% CI 0.95–3.76, P=0.071).
Discussion
SAIS addresses the need for objective, reliable and scalable decoding of intraoperative surgical activity by unifying subphase recognition, gesture classification and skill assessment in a single architecture. The system generalizes across unseen videos, surgeons, hospitals and procedures, outperforming state-of-the-art 3D CNN baselines and offering frame-level reasoning via attention. Dual-modality inputs (RGB and optical flow) and temporal self-attention are key to performance, as shown by ablations. The ability to infer from unannotated full-length surgical videos enables generation of gesture/skill timelines, detection of outliers and provision of targeted feedback. Preliminary associations between SAIS skill assessments and patient outcomes (urinary continence) support clinical relevance. Compared with prior work focused on single elements or controlled datasets, SAIS demonstrates broader generalization and deeper explainability, positioning it as a dependable foundation for future surgical AI applications, including feedback, education, workflow optimization and outcomes research.
Conclusion
This work introduces SAIS, a unified vision transformer-based system that decodes surgical subphases, gestures and skills from routine robotic surgery videos. SAIS generalizes across institutions and procedures, outperforms strong baselines, provides interpretable reasoning and can operate on unannotated full-length videos to deliver actionable feedback. Code is released to facilitate adoption. Future directions include multi-task training to exploit interdependencies among tasks, curriculum learning (transfer from easier to harder tasks), continual learning to adapt to evolving taxonomies, large-scale multi-institutional validations linking decoded activity to outcomes, deployment studies assessing impact on training and practice, and comprehensive bias and fairness analyses.
Limitations
SAIS, like other supervised systems, can only decode predefined elements from existing taxonomies and cannot discover novel activities without new annotations; it currently lacks continual learning to incorporate new elements over time. Performance may degrade under distribution shifts (e.g., across hospitals/procedures) and depends on video quality and sampling. Associations with patient outcomes are preliminary and need larger, multi-center studies for validation. External datasets may contain data leakage or limited public splits, complicating benchmarking. Some datasets are not publicly shareable due to patient/surgeon privacy, constraining reproducibility. Baselines like I3D exhibited sensitivity to hyperparameters and data splits; while SAIS was more robust, further robustness and uncertainty calibration work is warranted.
Related Publications
Explore these studies to deepen your understanding of the subject.

