logo
ResearchBunny Logo
Impact of data on generalization of AI for surgical intelligence applications

Medicine and Health

Impact of data on generalization of AI for surgical intelligence applications

O. Bar, D. Neimark, et al.

This study explores how data volume influences AI's ability to generalize in surgical applications. Researchers developed a deep learning system using a diverse cholecystectomy video dataset, achieving impressive accuracy and demonstrating robustness across different medical settings. This work emphasizes the critical role of large datasets for advancing AI in surgery, conducted by Omri Bar, Daniel Neimark, Maya Zohar, Gregory D. Hager, Ross Girshick, Gerald M. Fried, Tamir Wolf, and Dotan Asselmann.

00:00
00:00
~3 min • Beginner • English
Introduction
Surgery is widely performed and outcomes depend significantly on surgeon performance, yet training and credentialing largely follow traditional models. Minimally invasive surgery produces comprehensive video data, and prior work shows key performance metrics can be derived from video to predict outcomes. This motivates AI systems that can provide real-time recommendations and support training. However, ML requires large, well-labeled datasets, and surgical data are difficult to collect, annotate, and share; most prior studies use far fewer than 100 videos. Given that surgeons typically need over 100 repetitions to reach expertise and that state-of-the-art video models in other domains use hundreds of thousands of videos, prior small-scale results likely underestimate AI’s potential in surgery. This study asks: (1) How many surgical videos are needed to train an AI system to recognize phases of a procedure? (2) How robust is the model to new sources (surgeons/medical centers)? Using 1,243 laparoscopic cholecystectomy videos, the authors benchmark surgical phase recognition, a foundational task for surgical video analysis and decision support.
Literature Review
The paper situates surgical phase recognition within video action recognition literature, noting large-scale datasets like Kinetics-400 and YouTube-8M have enabled advances for single-label video classification, while surgical tasks require dense, per-second labeling and fine-grained distinctions across temporally adjacent phases. Public surgical datasets have been small (e.g., Cholec80 with 80 videos), limiting generalization. Prior phase definitions (e.g., Twinanda et al.) are adapted to better reflect varied clinical workflows (splitting adhesiolysis and dissection, combining final visualization and extraction). The authors highlight challenges: privacy/regulatory constraints limit data access; annotation requires skilled personnel; and generalization across centers/surgeons is insufficiently studied in previous works.
Methodology
Task: Per-second surgical phase recognition for laparoscopic cholecystectomy, defining seven phases: 0 Preparation; 1 Adhesiolysis; 2 Dissection; 3 Division; 4 Separation; 5 Packaging; 6 Final inspection. Phase guidelines were created with expert input and validated through annotation practice. Phases can vary in order and may recur. Data and preprocessing: 1,243 labeled videos from six medical centers and >50 surgeons, including 80 Cholec80 videos, totaling >2.3M labeled seconds. Videos standardized with FFmpeg (25 FPS, width 480, aspect ratio preserved; audio removed). Non-relevant beginning/end segments trimmed using a background detection model. Annotation: two-stage process by trained medical students/surgeons with adjudication; coverage enforced (each second labeled with one phase). Exclusions included retrograde approach and conversions to open surgery. Experimental splits: 25% of videos held out as an independent test set. Remaining split 80/20 into train/validation. Final counts: train 745, val 187, test 311 (splits at video level). Metrics: per-second accuracy and mean phase accuracy (per-class accuracy averaged across phases). Phase transition timing accuracy also evaluated using temporal windows. Model architecture: Two-stage framework. 1) Short-term spatio-temporal model: I3D architecture derived by inflating a 2D ResNet-50 to 3D with non-local blocks to capture long-range dependencies. Initialization: ImageNet pre-trained ResNet-50; pre-trained further on Kinetics-400 for action recognition; final layer replaced with 7-way classifier for phases. Input: 2.56 s clips (64 frames at 25 FPS) centered on each second; each second treated independently. Training: batch size 16 on 4 GPUs; random spatial crops to 224×224; normalization with ImageNet mean/std; temporal augmentation by random anchor frame within the second. Optimization: SGD, cross-entropy, initial LR 0.01, momentum 0.9; validation every 250k samples; LR reduced ×0.1 after 10, 20, 25 validations; stop after 30 (total up to 7.5M samples). To reduce distant errors, concatenated a progress feature (Second Duration Ratio: current second / total video length) to the final layer before classification. Output: per-second SoftMax probabilities (L×7 matrix per video). 2) Long-term temporal model: A bidirectional single-layer LSTM (128 hidden units) over the entire sequence of per-second phase probabilities. Before LSTM, each second’s phase is sampled from a softened distribution via SoftMax with temperature T=11 to enable data augmentation and reduce overfitting. An embedding layer (size 32) maps phase indices to vectors; LSTM output is passed to a linear classifier; cross-entropy computed per second. Optimization: SGD, LR 0.1, momentum 0.9; batch size 1 (variable-length sequences), 30 epochs on 1 GPU. For smaller training sets (<250k seconds) validation/evaluation done per epoch with same LR schedule. For fine-tuning experiments on unseen medical center, baseline training used 15 epochs with LR drops after 5, 10, 13 validations. Generalization experiments: Assessed performance across medical centers and individual surgeons. To test adaptation to a new center, trained baseline models excluding Medical Center 1 (MC1), then fine-tuned with varying numbers of MC1 videos (5–200) and evaluated on an MC1-only test subset. Implementation: Python 3.6, PyTorch 1.1.0.
Key Findings
- Overall performance: On validation set, accuracy 90.4% and mean phase accuracy 86.1%; on independent test set, accuracy 91.7% and mean phase accuracy 87.5%. - Error characteristics: Most confusion occurs between temporally adjacent phases. Errors include small temporal shifts, added/dropped segments, and random label flips. - Phase transition timing: Percentage of correctly aligned phase starts within ±45 s is 92.03% (validation) and 91.81% (test). For videos above median per-video accuracy, alignment exceeds 98% at ±45 s; for videos above 10th percentile accuracy, >93% at ±45 s. - Data scaling (asymptotic performance): • Short-term model reaches >80% accuracy with ~100 training videos and >85% near 745 videos. • Long-term model achieves ~80% accuracy with ~50 training videos and >90% at 745 videos. Large gains are realized when scaling from tens to hundreds of videos; beyond ~1,000, projected gains taper (estimated marginal 1–2% with an order-of-magnitude increase). - Generalization across sources: • Medical centers: Consistent performance across centers in both validation and test sets; MC4 lower, likely due to only 23 training videos. • Surgeons: Accuracy generally consistent across surgeons and comparable to overall performance; subsets with few videos have higher variance. - Adaptation to new center (fine-tuning): • Training excluding MC1: test accuracy 83.15% (short-term) and 89.6% (long-term) on non-MC1 test set but only 73.9% (short-term) and 79.2% (long-term) on MC1-only test set. • Fine-tuning with MC1 videos: With 50 MC1 videos, short-term reaches >80% and long-term ~88% accuracy; with 200 MC1 videos, long-term ~90% and short-term recovers original performance. Indicates hundreds of videos can restore high performance on a new center.
Discussion
The study demonstrates that robust surgical phase recognition benefits substantially from scaling training data from tens to hundreds of videos, with diminishing returns approaching ~1,000 videos for the given model capacity. The two-stage architecture leverages short-term spatio-temporal cues and long-term sequence modeling to achieve high per-second accuracy and precise phase transition timing. Generalization analyses show the model maintains similar performance across different medical centers and surgeons, suggesting resilience to variability in style, equipment, and technique. However, when confronting an unseen center with distributional differences, performance drops can be mitigated effectively by fine-tuning on a modest number of local videos (tens to a few hundred), enabling rapid deployment. The findings support the feasibility of deploying AI tools for post-operative debrief (e.g., highlight reels, event timing) and suggest that while higher accuracy will be required for real-time decision support, current performance meaningfully advances surgical intelligence applications.
Conclusion
This work introduces a high-performing, generalizable deep learning system for per-second surgical phase recognition in laparoscopic cholecystectomy, trained on the largest dataset to date in this domain. Contributions include: (1) a scalable two-stage method with reported hyperparameters enabling reproduction on smaller public datasets; (2) an asymptotic analysis quantifying the data required to achieve high accuracy, highlighting significant gains from hundreds of videos; and (3) demonstration of effective fine-tuning for rapid adaptation to new medical centers. Future research should explore increased model capacity with larger datasets, real-time causal variants for intraoperative support, transferability to more complex or less structured procedures, and finer-grained bias analyses at the patient and equipment levels.
Limitations
- Procedure scope: Results limited to laparoscopic cholecystectomy, a relatively structured procedure; generalization to less structured or longer surgeries is untested. - Offline design: Models are non-causal and evaluated offline; real-time deployment would require causal adaptations. - Dataset skew: Data are skewed toward a single medical center (MC1), though generalization tests mitigate concerns. - Outliers: Reduced robustness to unusual cases (e.g., conversions to open surgery, single-port setups), very low video quality, or unseen tools. - Bias analysis granularity: High-level center/surgeon bias evaluated; patient-level factors (age, BMI, sex, anatomy), equipment type, and time-period effects were not analyzed due to labeling constraints.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny