Engineering and Technology
Multimodal graph representation learning for robust surgical workflow recognition with adversarial feature disentanglement
L. Bai, B. Ma, et al.
Discover how a team of researchers, including Long Bai and Boyi Ma, tackled the challenges of surgical workflow recognition. Their innovative GRAD approach combines vision and kinematic data to enhance automation and decision-making while overcoming data corruption issues. Experience a leap in surgical technology!
~3 min • Beginner • English
Introduction
The paper addresses the need for robust surgical workflow recognition in robot-assisted minimally invasive surgery (RMIS), where visual occlusions, lighting changes, and data corruption degrade performance. While vision-based methods capture rich semantic context and kinematic signals provide precise motion cues robust to visual artifacts, single-modality systems are insufficient under complex surgical conditions. The research question is how to develop a multimodal approach that (i) deeply exploits intra-modal characteristics of vision and kinematics, (ii) effectively models inter-modal relationships, and (iii) remains robust under domain shifts and corrupted data. The authors propose a graph-based multimodal framework that disentangles visual features across spatial and frequency domains and aligns cross-modal representations adversarially, aiming to improve accuracy, stability, and calibration in real-world surgical environments.
Literature Review
The related work spans three areas. Multimodal graph learning: Prior studies demonstrated effective multimodal fusion via GCNs, hypergraphs, masked graph networks, and hybrid models to capture intra- and inter-modal dependencies across domains like sentiment analysis, survival prediction, and emotion recognition. Existing surgical settings have begun using graph learning but often employ simple fusion without deep modality-specific exploration or robust graph solutions. Surgical workflow recognition: Early HMM-based and hand-crafted feature methods were surpassed by deep CNNs with temporal models (LSTM, TCN). Methods like SV-RCNet, TMRNet, TeCNO, and Trans-SVNet enhanced long-range temporal modeling. Multimodal vision-kinematics approaches and graph-based fusion improved robustness by leveraging complementary information. Network calibration: Conventional calibration methods (Platt scaling, isotonic regression, entropy/focal regularization, uncertainty estimation) are not directly suited to GNNs, which often exhibit miscalibration (under-confidence). Recent GNN-specific calibration strategies include attention temperature scaling and topology-aware calibration. Robustness literature highlights susceptibility to common corruptions and the positive link between good calibration and robustness.
Methodology
The proposed GRAD framework comprises: (1) Multimodal Disentanglement Graph Network (MDGNet), (2) Vision-Kinematic Adversarial (VKA) training, and (3) a Contextual Calibrated Decoder.
- Kinematic temporal representation: For each frame, 14-16 DoFs per timestep (positions, orientations, gripper angles; MISAW also grip voltage) are processed over time using a parallel LSTM and TCN, whose outputs are averaged to form robust temporal kinematic embeddings capturing long-range and multi-scale dependencies.
- Visual representation disentanglement: Video frames are processed in three domains—spatial, wavelet, and Fourier. A ResNet-18 backbone extracts spatial features per frame, followed by TCN for temporal encoding. Wavelet transforms capture multiscale local structure via approximation and detail coefficients, while Fourier amplitude spectra capture global structural and textural patterns. Identical CNN+TCN pipelines generate temporal features in wavelet and Fourier domains, producing complementary representations.
- Multimodal graph learning with GAT: Four nodes (spatial, wavelet, Fourier, combined kinematic embedding) are connected via a Graph Attention Network to model intra- and inter-modal relationships with learned attention weights, enabling dynamic importance assignment and improved cross-modal message passing versus fixed-adjacency GCN/RGCN.
- Vision-Kinematic Adversarial training (VKA): To align modality distributions, visual domain features (spatial, wavelet, Fourier) act as source modalities and kinematics as the target modality. A discriminator attempts to distinguish modality origin while the feature generators aim for a shared, modality-invariant embedding. The adversarial loss includes fake/true terms and an L2 alignment term encouraging proximity of visual and kinematic embeddings.
- Contextual Calibrated Decoder: Heterogeneous embedding fusion combines pre-graph vision-kinematic embeddings (Ev) with graph outputs (Eg) using E = α Ev + β Eg. A prediction head produces class logits. To improve calibration and robustness, a calibrated cross-entropy loss adds a minimal-entropy regularization term weighted by α_calib (best λ=0.02 reported), encouraging confident, well-aligned predictions. The final loss is a weighted sum of calibrated classification and adversarial losses (best γ:δ = 0.9:0.1). Implementation details: frames resized to 320×256 then cropped to 224×224; ResNet-18 backbone with TCN; GAT hidden/output 64-d with dropout 0.5; discriminator MLP (64→64, 64→16, 16→1); Adam with lr 1e-4, batch size 64. Datasets: MISAW (27 recordings, 6 gesture classes) and CUHK-MRG (24 sequences, 5 steps; 3-fold CV).
Key Findings
- State-of-the-art performance:
- MISAW: Accuracy 86.87%, outperforming the next best by 3.12%; Edit score 97.50%; strong OP/OR/OF1 (86.83/88.23/87.52). Average per-class recall and F1 are highest; per-class precision slightly below MRG-Net.
- CUHK-MRG: Accuracy 92.38% with competitive to perfect Edit score (100% in some settings), and best overall precision/recall/F1 across metrics compared to both single-modality and other multimodal baselines.
- Robustness to corruptions (18 types; 5 severity levels, MISAW): GRAD consistently outperforms MRG-Net across all severities with slower degradation; at highest severity, GRAD typically remains ≥75% accuracy for most corruption categories, while MRG-Net falls below 75% across all types.
- Ablations:
- Components: Adding calibration, visual disentanglement, and VKA generally improves metrics; best results when combined.
- Kinematics modeling: Joint modeling of left+right arms as a single node outperforms separate nodes (e.g., Acc 86.87% vs 83.43%).
- Visual disentanglement: Wavelet alone improves performance; Fourier alone less effective; combining wavelet+Fourier complements spatial features and improves robustness.
- Graph choice: GAT outperforms GCN and RGCN on MISAW (e.g., Acc 86.87% vs 84.79% GCN, 83.42% RGCN; higher Edit and class-level metrics).
- Fusion ratio α,β (Ev vs Eg): Best trade-off around α=0.3, β=0.7 for overall metrics (accuracy peak at α=0.7, β=0.3 but slightly worse on other metrics). Final choice α=0.3, β=0.7.
- Adversarial source/target: Best when visual (Spatial+Wavelet+Fourier) are sources and Kinematics is target, indicating easier alignment toward kinematic space.
- Loss ratios: Best with γ=0.9 (calibrated CE) and δ=0.1 (adversarial). Calibration coefficient λ best at 0.02.
- Alternative fusion baselines (add, concat, co-/self-attention, gated, BAN, TDA, AFF/iAFF): GRAD’s graph-based fusion yields the best overall performance.
- Qualitative: Temporal segmentation visualizations show improved phase ordering and boundaries on MISAW relative to baselines (MRG-Net, Trans-SVNet, RL-TCN).
Discussion
The findings show that explicitly disentangling visual features across spatial and frequency domains, combined with graph attention-based multimodal fusion and adversarial alignment, yields superior recognition and robustness. Aligning vision and kinematics in a shared space enables compensation under corrupted vision, while the calibrated decoder improves confidence and per-class metrics, particularly recall and F1. The method outperforms single-modality approaches by leveraging complementary semantics (vision) and motion cues (kinematics), and exceeds prior multimodal techniques through dynamic attention-based graph modeling instead of fixed adjacency. Robustness experiments across diverse corruptions confirm that calibration and multimodal alignment jointly improve stability under distribution shifts and visual degradation, supporting clinical applicability in real-world operating environments.
Conclusion
GRAD introduces a robust, graph-based multimodal framework for surgical workflow recognition that integrates visual feature disentanglement (spatial, wavelet, Fourier), GAT-based multimodal fusion, adversarial cross-modal alignment (VKA), and a contextual calibrated decoder. The approach achieves state-of-the-art accuracy on MISAW (86.87%) and CUHK-MRG (92.38%), with strong temporal consistency and robustness under extensive corruption tests. Ablations validate each component’s contribution, including the benefit of joint kinematics modeling, GAT over GCN/RGCN, and calibrated loss design. The method demonstrates slower performance degradation under corruption, indicating practical robustness. Future work includes addressing class imbalance and short-duration gestures, improving efficiency and reducing annotation burden, and validating on larger, real clinical datasets to further assess generalization and applicability.
Limitations
- Class imbalance and short-duration gestures hinder accurate learning, biasing toward frequent categories and causing misclassifications.
- Multimodal training requires substantial resources and precise temporal alignment of modalities and labels, increasing annotation and computational costs; trade-offs between performance gains and resource usage must be evaluated.
- Experiments are conducted on benchmark datasets rather than large-scale real surgical scenes; lack of comprehensive clinical multimodal datasets limits assessment under real-world conditions (lighting variability, occlusions, sensor differences).
Related Publications
Explore these studies to deepen your understanding of the subject.

