logo
ResearchBunny Logo
Development, deployment and scaling of operating room-ready artificial intelligence for real-time surgical decision support

Medicine and Health

Development, deployment and scaling of operating room-ready artificial intelligence for real-time surgical decision support

S. Protserov, J. Hunter, et al.

This groundbreaking research by Sergey Protserov and colleagues tackles the challenges of generalizability and scalability in surgical guidance systems. They present a real-time, equipment-agnostic framework for laparoscopic cholecystectomy that shows promising performance metrics and operates seamlessly even on low-bandwidth connections.

00:00
00:00
~3 min • Beginner • English
Introduction
Complications from surgery are a major source of morbidity, mortality, and cost, with particular impact in remote and resource-limited regions that lack access to surgical expertise. Many adverse events arise from cognitive errors such as lapses in perception or situational awareness, for example dissecting in the wrong anatomical plane and injuring critical structures. Real-time, data-driven assistance via computer vision–based AI could help surgical teams avoid such errors. Despite numerous promising algorithms for intraoperative perception tasks, their translation to the operating room has been limited by (1) lack of generalizability across heterogeneous surgical video inputs (varying resolution, aspect ratio, frame rate, camera platforms) and (2) dependence on heavy computational resources and high-bandwidth, low-latency connectivity that are often unavailable in typical or resource-poor hospitals. The study’s objective is to develop and evaluate a generalizable, scalable, equipment-agnostic framework for real-time intraoperative decision support accessible from any edge device, using laparoscopic cholecystectomy Go/No-Go zone semantic segmentation as the use case.
Literature Review
Prior work has shown computer vision models can approach expert-level identification of anatomy and surgical tools in minimally invasive surgery. Earlier Go/No-Go segmentation efforts used PSPNet with a ResNet-50 backbone to predict safe and dangerous dissection zones. Enhancements such as label-relaxation and self-supervision have been explored, as well as tool classification via ensembles and surgical phase recognition (e.g., EndoNet). However, these studies primarily demonstrate algorithmic performance and not deployment readiness. Key gaps include handling heterogeneous input formats without geometric distortion, ensuring generalizability across institutions and camera platforms, and achieving real-time inference without reliance on GPUs in hospitals or ultra-fast internet. Heavy backbones (e.g., ResNet, AlexNet variants) can boost accuracy but hinder real-time deployment due to latency and compute demands. No prior solution had convincingly scaled to intraoperative decision support without specialized on-site hardware, limiting dissemination especially to low-resource settings.
Methodology
Study design: Two phases. (1) Train and validate lightweight semantic segmentation models (U-Net and SegFormer) to predict Go and No-Go zones in laparoscopic cholecystectomy images. (2) Build a web platform and data pipeline enabling real-time inference on live surgical video streams from pervasive edge devices, and evaluate performance under varying network speeds. Models and preprocessing: Task formulated as pixel-level semantic segmentation with three classes (Go, No-Go, background). Inputs are downscaled to height 128 while preserving aspect ratio. Two architectures: - U-Net: Architecturally simple, no heavy backbone; includes dropout, batch normalization, and PReLU activation. To handle varying frame shapes and reduce frame-level bias, frames are split into overlapping fixed-size patches for training and inference; predictions on overlaps are averaged and reassembled into full-frame masks. - SegFormer: Transformer-based SegFormer (MiT) implemented via HuggingFace Transformers with random initialization. Inputs downscaled to height 128 and padded to a fixed shape. Datasets and splits: Two datasets. Dataset 1: 289 retrospective open-source videos from 37 countries, 153 surgeons, 136 institutions. Dataset 2: 25 prospectively collected videos from 5 countries, 9 surgeons, 7 institutions, with per-frame annotations by an independent panel (SAGES Safe Cholecystectomy Task Force). Dataset 1 was split per-case within each institution 70/15/15 (train/val/test) to prevent leakage and balance representation. Models trained and tuned on Dataset 1; performance reported on Dataset 1 test and on the independent Dataset 2 to test generalization. Frames were uniformly sampled for relevance; irrelevant frames were excluded. Annotation and ground truth: Expert surgeons provided freehand pixel-level annotations for Go and No-Go zones. Due to variability in complex anatomy, visual concordance was used to aggregate labels into a final mask per frame serving as ground truth. Training and selection: Class-weighted cross-entropy (weights inversely proportional to class frequency) used. Hyperparameter tuning performed. Best U-Net used bias terms in conv layers, 16 initial features doubling at each encoder stage, PReLU activation, dropout p=0.2. Best SegFormer used MiT-60 configuration. Model selection based on validation loss. Web platform and pipeline: Frontend in ReactJS; backend in Flask. Two modes: synchronous live inference (streams from built-in camera, screen share, or capture card from laparoscopic tower) and asynchronous batch inference (upload video/image, run on server, download results). For each pixel, probabilities for Go/No-Go/background are computed; overlays display green (Go) and red (No-Go) with transparency reflecting confidence. A user-adjustable slider sets the Go-zone probability threshold for display to align with surgeon risk tolerance. Infrastructure: The primary server cluster (UHN, Toronto) uses four GPU workers (NVIDIA Tesla P100) coordinated via a round-robin queue; Socket.IO enables bidirectional streaming, with workers pushing directly to clients to reduce hops. A flow-control algorithm dynamically throttles frame transmission to minimize end-to-end delay under constrained bandwidth. Client/server rendering optimizations reduced client-side rendering latency to <4 ms. Network optimization: Optional downscaling halves width and height on the client before upload (4× data reduction), with upscaling at server for inference and downscaled predictions returned and upscaled on client. This reduces bandwidth needs while minimally affecting segmentation quality (<2% drop in Dice/precision/recall in tests). Evaluation metrics: Pixel-level precision, recall, Dice Similarity Coefficient (DSC), and Relative Area Error (RAE). Metrics computed per frame and averaged. Network tests simulated bandwidths of 1, 2, 4, 8, 16, 32 Mbps using Chrome DevTools throttling; outcomes were FPS and round-trip delay (ms), measured on a 2020 MacBook Pro over WiFi from outside the UHN intranet. Round-trip delay measured from frame send to prediction receipt. Ethics and availability: UHN Research Ethics Board approval (20-5349); consent waiver for secondary use of anonymized data. Annotated datasets available via the Global Surgical Artificial Intelligence Collaborative; demo server, weights, and code open source at https://surg-ai.uhndata.io/.
Key Findings
Model performance (Dataset 2, default 33% Go threshold): - U-Net: Go — Dice 57% ± 0.05, Precision 45% ± 0.04, Recall 82% ± 0.07, RAE +92% ± 0.17; No-Go — Dice 76% ± 0.04, Precision 68% ± 0.05, Recall 92% ± 0.04, RAE +47% ± 0.17. Error rates: Go pixel misclassified as No-Go 12%; No-Go misclassified as Go 4%. - SegFormer: Go — Dice 60% ± 0.05, Precision 53% ± 0.05, Recall 75% ± 0.07, RAE +48% ± 0.15; No-Go — Dice 76% ± 0.05, Precision 68% ± 0.05, Recall 92% ± 0.04, RAE +46% ± 0.16. No-Go misclassified as Go limited to 1%. Confusion matrices (Dataset 2, 33% threshold) indicate strong No-Go detection (≈0.91 true positive rate) with limited cross-class confusion. Network performance (with flow-control): - At ≥32 Mbps: ~65.9 fps; round-trip delay ~77–78 ms; inference time 20.75 ms (U-Net) and 25.63 ms (SegFormer). - At 8 Mbps: ~48 fps original size, delay ~178–183 ms; with downscaling: ~66 fps, delay ~70–72 ms. - At 2 Mbps: original ~13 fps, delay ~386–387 ms; with downscaling: ~63–65 fps, delay ~121–124 ms. - At 1 Mbps: original ~6.7 fps, delay ~664–671 ms; with downscaling: ~46–48 fps, delay ~176–193 ms. Downscaling frames/predictions by 2× per dimension reduced transmitted data 4×, enabling ≥60 fps with <150 ms delay at bandwidths as low as 2 Mbps. Quality degradation from downscaling was <2% for Dice, recall, and precision. Without flow-control, delays exceeded 1500 ms at <8 Mbps and ~423 ms even at 32 Mbps with downscaling, underscoring the importance of flow-control for real-time usability. Overall, both lightweight models generalized to an independent multicenter dataset and, integrated into the web platform, delivered real-time inference across a wide range of network conditions.
Discussion
The study demonstrates that AI-based intraoperative decision support can be deployed in a scalable, equipment-agnostic manner without relying on on-site high-performance hardware or ultra-fast connectivity. By choosing lightweight architectures and designing a patch-based and padding-aware preprocessing pipeline, the models handle heterogeneous input resolutions and aspect ratios typical of surgical video. The web platform integrates human-in-the-loop controls (probability threshold slider) to align visualization with surgeons’ varying risk tolerance and clinical context, potentially improving adoption. Performance on an independent expert-annotated dataset indicates acceptable and generalizable segmentation, particularly for critical No-Go regions where misclassification as Go was kept very low (1–4%). Network optimizations (flow-control and 4× data reduction via downscaling) allowed high FPS and low latency even at 1–2 Mbps, supporting use in remote, resource-limited environments and helping address global surgical inequities. These findings address the core research goal: enabling real-time, generalizable surgical AI guidance accessible from any edge device and network condition. The results are relevant to broader surgical AI deployments, suggesting a pathway from algorithmic benchmarks to practical, clinician-centered, real-world systems.
Conclusion
This work introduces and validates a generalizable, scalable framework for deploying real-time surgical AI via a web platform, using lightweight U-Net and SegFormer models for Go/No-Go zone segmentation in laparoscopic cholecystectomy as an exemplar. The system achieves robust, expert-validated performance and maintains high frame rates with low latency across diverse network conditions through flow-control and bandwidth-saving downscaling. The platform’s human-in-the-loop design supports surgeon risk calibration and clinical usability. Future directions include incorporating temporal modeling to stabilize predictions across frames, exploring optical flow–based client-side updates to further reduce latency, expanding datasets with challenging edge cases, migrating to geo-distributed cloud infrastructure to minimize geographic latency, and conducting user studies and clinical trials to evaluate workflow integration, outcomes, and cost-effectiveness.
Limitations
- Models trained on individual frames lack temporal context, causing occasional instability in predictions across consecutive frames; adding temporal components may improve consistency but could increase inference time. - Although datasets were diverse, additional challenging cases (e.g., severe inflammation, bleeding, complications) are needed to further improve robustness. - Standard computer vision metrics may not fully capture clinical utility; expert annotation variability persists despite visual concordance aggregation. - Network evaluations used a controlled, static environment with on-site compute; geographic distance–induced latency remains and requires cloud migration for mitigation. - Real-time performance at extremely low bandwidths still degrades without downscaling; downscaling, while minimal in impact here, could affect other models more strongly. - Current deployment uses GPU-backed servers; true edge-only inference without servers was not assessed. - Clinical impact (e.g., reduction in adverse events, operative time) was not evaluated and warrants prospective studies.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny