Introduction
Surgery is a critical element in disease treatment, with millions of procedures performed annually worldwide. The quality of surgeon performance significantly impacts patient outcomes and healthcare costs. Traditional surgical training relies on mentored, one-on-one training, a method that has remained largely unchanged for a century. Artificial intelligence (AI) offers the potential to revolutionize surgical training, improve surgeon performance, and ultimately enhance patient care, especially in minimally invasive surgery (MIS) where video feeds capture essential aspects of the surgical process. AI-based systems capable of providing real-time recommendations during surgery could significantly support surgeons' training and decision-making, leading to reduced complications and improved outcomes. A major challenge in developing such AI systems is the need for large, correctly labeled, and representative datasets. Current research in AI for surgery is often limited by small datasets (typically fewer than 100 videos), hindering the development of robust and generalizable AI models. This study addresses this limitation by utilizing a significantly larger dataset to investigate two key questions: (1) How much surgical video data is necessary to train an AI system accurately recognizing major surgical procedure phases? and (2) How robust is the learned model to new sources of data (surgeons and medical centers)?
Literature Review
Previous studies applying machine learning to surgical performance have been limited by small datasets (typically far less than 100 video recordings). The largest publicly available dataset, Cholec80, contains only 80 videos. This small sample size contrasts sharply with the established need for extensive practice (well over 100 repetitions) for human surgeons to achieve expert performance levels, and the scale of datasets used in other domains, such as video classification models trained on over 100,000 videos. The limited dataset size in previous studies raises concerns about the generalizability and true potential of AI in surgery. This study aims to address these limitations by using a substantially larger dataset.
Methodology
The study utilized a dataset of 1243 laparoscopic cholecystectomy videos from six different medical centers and over 50 surgeons, significantly larger than previously used datasets. The videos were annotated by two specialists, with expert surgeon consultation for ambiguous cases, ensuring high-quality labels. The annotation process involved defining seven laparoscopic cholecystectomy phases: Preparation, Adhesiolysis, Dissection, Division, Separation, Packaging, and Final inspection. The definitions were based on discussions with expert surgeons, iterative annotation processes, and consideration of both clinical relevance and algorithmic feasibility. The video preprocessing involved using FFmpeg for encoding, scaling the video width to 480 pixels, removing audio, and trimming irrelevant segments using a trained background detection model. The phase classification model consisted of a two-step framework: a short-term spatio-temporal model and a long-term model. The short-term model, based on an inflated 3D ConvNet (I3D) architecture using a pre-trained ResNet-50 model, analyzed short-term spatio-temporal information from 2.56 s clips and generated phase probabilities for each second. The long-term model, a Long Short-Term Memory (LSTM) network, processed the short-term model's output sequentially, capturing long-term dependencies to predict phase labels for each second. Data augmentation techniques, such as SoftMax with temperature, were used to improve model robustness and generalization. Model training and evaluation were performed using cross-entropy loss and Stochastic Gradient Descent (SGD), with hyperparameters optimized on a validation set. The Kinetics-400 and ImageNet datasets were used for pre-training.
Key Findings
The two-step phase classification system achieved 90.4% accuracy and 86.1% mean phase accuracy on the validation set and 91.7% accuracy and 87.5% mean phase accuracy on the independent test set. Analysis of classification errors revealed that the model primarily confused temporally adjacent phases with similar surgical contexts. Assessment of phase transition accuracy showed over 90% alignment at a temporal threshold of 45 seconds for all videos and over 98% for videos with above-median accuracy. The study examined the impact of training dataset size on accuracy, revealing that the model achieved over 80% accuracy with 100 videos and near-asymptotic performance with around 745 videos. Generalization analysis showed consistent performance across different medical centers and surgeons, suggesting that the model was not significantly biased towards specific medical centers or surgeons. Fine-tuning the model on a small number (50-200) of videos from a new medical center demonstrated successful adaptation and recovery of high performance, indicating the model's adaptability to new environments.
Discussion
The results demonstrate that a deep learning system can achieve high accuracy in detecting surgical phases with sufficient data, addressing the limitation of previous studies using small datasets. The high accuracy and generalization capabilities of the system, supported by a large and diverse dataset, suggest that the model can be successfully adapted for use in various surgical settings. The asymptotic performance analysis indicates that the greatest improvements in accuracy are obtained by increasing the training dataset from tens to hundreds of videos, with diminishing returns thereafter. The findings highlight the critical importance of large, diverse datasets for developing robust and generalizable AI models in surgery, paving the way for the translation of AI applications from research into clinical practice.
Conclusion
This study demonstrates the feasibility of creating a highly accurate and generalizable AI system for surgical phase recognition using a large, diverse dataset. The achieved accuracy of over 90% and the successful fine-tuning on a new medical center highlight the potential of AI to assist surgeons in their daily routine. Future research should focus on extending the approach to less structured procedures and incorporating causality constraints for real-time applications. Investigating patient-level and medical-center-level bias factors to further refine the model's robustness and expanding its use to other surgical procedures is also crucial for broader clinical translation.
Limitations
The study focused on a single surgical procedure (laparoscopic cholecystectomy), which may limit the generalizability of the findings to other procedures with less linear phase progressions. The model currently operates offline; integrating causality constraints and developing real-time versions would enhance clinical applicability. While the dataset included videos from multiple medical centers and surgeons, it was skewed towards a single center, potentially impacting the assessment of generalization. The model's ability to handle outliers (e.g., very low video quality or novel surgical tools) also requires further investigation.
Related Publications
Explore these studies to deepen your understanding of the subject.