Medicine and Health
A fully autonomous robotic ultrasound system for thyroid scanning
K. Su, J. Liu, et al.
Ultrasound diagnosis is operator-dependent and patient-specific factors can introduce variability, leading to inconsistent results. Increasing autonomy in medical robots has progressed from teleoperation to systems capable of task and conditional autonomy, yet fully autonomous ultrasound (US) scanning remains challenging due to anatomical variability, motion, and safety concerns. Existing US robotic systems often depend on preoperative data or external sensors and can fail when image features are lost. The research question addressed here is whether a fully autonomous robotic ultrasound system can reliably locate, scan, and analyze the thyroid without human intervention, producing clinician-quality images and clinically relevant nodule classifications. The study introduces FARUS, integrating thyroid search, in-plane and out-of-plane scanning, real-time image analysis, and TI-RADS-based risk stratification, aiming to bridge the gap between research prototypes and clinical application.
Autonomy in medical robotics spans levels 0–5, with current ultrasonic inspection robots mostly at levels 0–3. Level 0 includes manual and teleoperated systems; level 1 adds visual servo assistance; level 2 performs autonomous acquisition along predefined paths; level 3 autonomously plans and executes acquisitions under supervision. Prior works have used global information from preoperative images or external sensors for path planning, visual servoing for motion compensation, and online image-guided strategies (e.g., in-plane navigation for thyroid volumetry). However, reliance on persistent image features can lead to failure when features disappear. Clinical practice requires multiple views (transverse and longitudinal) and safety measures to mitigate injury risk. Despite numerous implementations, overall success rates remain limited due to inter-patient variability. This study advances beyond prior in-plane navigation by fusing in-plane and out-of-plane scans, integrating reinforcement learning for target localization, Bayesian optimization for probe orientation, and deep learning for real-time gland and nodule segmentation to support TI-RADS classification.
System design and workflow: FARUS comprises a UR3 six-DOF manipulator carrying a high-frequency 2D linear ultrasound probe mounted via a spring-loaded fixture and a Robotiq FT300-S 6-axis force/torque sensor. A Microsoft Azure Kinect DK provides 3D skeleton tracking for coarse localization and 2D visual feedback to a supervising operator. The fully autonomous workflow mirrors clinical practice and includes four phases: (1) Thyroid searching (TS), (2) In-plane scanning (IPS, transverse), (3) Out-of-plane scanning (OPS, longitudinal and multi-view), and (4) Multi-view scanning (MVS). The scanning range is defined as 6.47 cm (length) by 5.48 cm (width). Contact force is maintained between 2.0 and 4.0 N for adequate coupling while avoiding anatomical distortion; scanning halts if movements exceed predefined thresholds or safety conditions trigger. Thyroid search with reinforcement learning: Coarse localization uses human skeletal keypoints from the Kinect to position the probe. Due to anatomical variability, a fine localization step uses a Deep Q-Network (DQN) trained in a panoramic simulation environment. Training data sequences of labeled thyroid images are aligned into panoramas, augmented by random blending and simulated imperfect contact via shadow masks. A sliding-window agent learns actions (move left/right/hold) to navigate to an ideal imaging position. Training shows increasing average rewards and episode lengths, and the trained agent guides the robot effectively even with patient motion or transient absence of thyroid features in the frame. Probe orientation optimization with Bayesian optimization (BO): After thyroid localization, BO optimizes probe pitch/roll using image entropy as an objective function owing to its sensitivity to image texture and coupling. A Gaussian process surrogate guides selection of candidate orientations within a budget of N=5 iterations to minimize computation and time. Empirically, maximal entropy is often found within the first few iterations; entropy correlates positively with a confidence map of contact condition, supporting entropy as a fast, proxy quality metric. Deep learning segmentation: Real-time segmentation uses two networks: (1) Thyroid gland segmentation with a ResNet-18 encoder and U-Net decoder trained on SCUTG8K using Dice loss. (2) Nodule segmentation VariaNet with a ResNeXt-50 encoder and U-Net decoder. To improve robustness with limited and heterogeneous data, training proceeds in two stages: pre-train on source datasets (TN3K and SCUTNIOK), then transfer learn on SCUTNIK collected with the deployed probe. VariaNet integrates an iso-hybrid loss combining IoU loss, a feature loss (Gaussian-masked to emphasize isoechoic regions within nodules) and a distance loss (penalizing predictions distant from thyroid lobe boundaries derived from pseudo gland labels). This leverages spatial priors that nodules are within thyroid tissue and combats false positives, especially for hypoechoic and isoechoic nodules. Scan planning and control: The Kinect is placed ~1.2 m from the seated participant, at ~1.5 m height and 45° angle; color at 1280×720 (90°×59° FOV), depth WFOV 2×2 binned at 512×512 (120°×120° FOV), 30 FPS. The robot approaches the pre-estimated neck region at 10 mm/s, slowing to 5 mm/s near contact; scanning initiates when force is 2.0–4.0 N. IPS proceeds from upper to lower poles while maintaining gland centering via image-based control using segmentation masks and intensity distribution to avoid shadowing. OPS emulates multi-view scanning with ~60° probe rotation and ±6 mm coverage along the principal axis. Degrees of freedom are allocated as follows: force control governs translation along Yp; view switching controls rotation around Yp; BO controls rotation around Zp; image-based control adjusts translations along Xp and Zp and rotation around Xp to maintain centering and optimal contact. Visual servo and force control laws are formulated to regulate contact force and orientation. TI-RADS scoring: Automated TI-RADS assigns scores for composition, echogenicity, margin, shape (taller-than-wide), and echogenic foci. Pixel intensity distribution within gland and nodule regions informs echogenicity and composition; boundary statistics and ellipse IoU estimate margin regularity; aspect ratio estimates shape; calcification proxies contribute echogenic foci scoring. Scores map to TI-RADS levels (TR1–TR5) to recommend follow-up or FNA. Human participants and safety: With IRB approval (Guangzhou First People’s Hospital K-2021-131-04), three cohorts were recruited: (1) 70 volunteers (college-aged) scanned autonomously (13 also manually by 5 doctors); (2) 29 community middle-aged/elderly for training data; (3) 19 patients for diagnostic validation. Safety measures included gel allergy test, UR3 collaborative robot with collision stop, constrained workspace (R<45 cm), contact force 2.0–4.0 N with auto-stop at >4.5 N, and wheeled chair for participant repositioning. Scanning terminated on large participant motion (>2 cm lateral in 1 s or >5 cm AP).
- Autonomous operation and image quality: FARUS completed fully autonomous thyroid scanning (TS, IPS, OPS/MVS) with real-time segmentation and TI-RADS scoring. On 19 patients, image quality metrics (confidence, centering error, orientation error, entropy) improved rapidly after contact; centering error approached 0 after ~25 s and remained stable during IPS; variability increased during OPS due to multi-view scanning.
- Bayesian optimization for orientation: Across 89 participants, image entropy increased significantly after BO (paired two-sided t-test, p<0.0005). With a budget of 5 iterations, the maximum entropy (optimal orientation) was often achieved by iteration 2; example entropy values ranged 7.170–7.173 across iterations. Entropy showed a positive correlation with confidence of contact (Pearson r=0.6779). Force beyond 2 N did not materially change median entropy; comfort/safety motivated a 4 N cap.
- Reinforcement learning for thyroid search: DQN training in a panoramic simulator produced rising average rewards and longer episodes; in deployment, the agent correctly commanded lateral probe motions to reach ideal thyroid views and adapted to participant motion, including cases where the thyroid was temporarily absent in the frame.
- Segmentation performance: The proposed VariaNet with weak supervision and tailored losses outperformed baselines on SCUTNIK, improving IoU by 0.97% over VariaNet-B; ROC AUC reached 0.8862 (UNet 0.8327, DeepLabv3+ 0.8827). Feature and distance losses improved performance on isoechoic and hypoechoic nodules, respectively; transfer learning enhanced robustness for hyperechoic nodules.
- Clinical TI-RADS comparison: In 19 patients, doctors reported nodules in 17 and none in 2. FARUS identified 13 with nodules and 6 without. For 24 nodules detected by both FARUS and doctors, scoring matched exactly in 10; 8 differed by 1 point; 4 differed by 2 points; 1 differed by 4 points. Management recommendations largely agreed, e.g., a 19 mm nodule (Patient 9 #20) recommended for follow-up by both. One case (Patient 10 #21) differed (doctor TR5 vs FARUS TR3), attributed to differing echo characteristics and probe differences.
- Robot vs manual scans: Five doctors scanned 13 participants for comparison. FARUS produced image quality metrics comparable to doctors; robotic IPS showed smaller centering error, likely due to image-based control. Probe motion analysis showed the robot achieved more stable force and velocity, though scanning time per lobe was longer for FARUS (213.0 ± 85.3 s, n=70) versus doctors (67.2 ± 27.6 s, n=13), reflecting conservative speed and dynamic path/force control for safety.
- User feedback and safety: Questionnaire (n=70) indicated most participants felt safe and experienced no pain; some reported anxiety; most did not believe robots can replace doctors. Safety protocols prevented adverse events; entropy and confidence analyses confirmed adequate contact within the 2–4 N range.
The study demonstrates that a fully autonomous robotic system can localize the thyroid, optimize probe orientation, perform multi-view scanning, and generate clinically relevant TI-RADS assessments without human manipulation. Reinforcement learning overcomes failures of feature-dependent controllers by enabling coarse-to-fine search even when the thyroid is not initially visible. Bayesian optimization efficiently tunes probe orientation using entropy as a fast, effective image quality proxy, improving image texture detail and coupling with minimal iterations. Deep learning segmentation with spatial-feature priors (VariaNet) supports reliable nodule detection across echogenicity types and feeds downstream TI-RADS scoring. Comparative analyses show FARUS can produce image quality on par with experienced sonographers, with better centering in IPS, and stable force/velocity profiles. Although FARUS identified fewer nodules than clinicians overall, agreement on shared nodules was strong (10/24 exact, most others within 1 point), indicating clinically meaningful alignment. The system’s autonomous, contact-minimizing workflow positions it as a patient-centered tool for rapid screening and deployment in outpatient and resource-limited settings. Differences in probes and patient postures (upright for FARUS vs supine for clinicians) likely contributed to some scoring discrepancies, particularly for echogenicity/composition. Overall, the findings address the research question by validating feasibility, safety, and preliminary clinical concordance of fully autonomous thyroid US in humans.
This work presents the first in-human study of a fully autonomous robotic ultrasound system for thyroid scanning (FARUS), integrating DQN-based thyroid search, Bayesian optimization for probe orientation, force and image-based control for multi-view scanning, and deep learning segmentation to support automated ACR TI-RADS risk stratification. FARUS achieved high-quality images comparable to manual scans, stable and safe probe interactions, and substantial agreement with clinical TI-RADS scoring on shared nodules. The approach demonstrates translational potential for autonomous US screening in clinics and remote settings. Future work will focus on improving detection of small and isoechoic nodules, expanding and diversifying datasets, modeling and mitigating ultrasound artifacts, incorporating video stream analysis, harmonizing probe characteristics with clinical systems, and conducting larger-scale, prospective clinical validations to assess diagnostic accuracy and safety for higher-risk nodules.
- Detection challenges persist for very small nodules (<~4 mm) and low-contrast/isoechoic lesions; some nodules were missed and a few possible false positives occurred.
- Nodule datasets lack sufficient diversity in size and appearance; more heterogeneous data are needed to improve generalization.
- Ultrasound artifacts were not explicitly modeled or handled in the current algorithmic pipeline.
- Analysis focused on frame-wise images; incorporation of temporal (video) information could improve robustness and detection.
- Differences in probe hardware and patient posture (upright vs clinicians’ supine) affected echogenicity/composition and may contribute to scoring discrepancies.
- FARUS scans were slower than manual scans due to conservative speed and safety controls.
- Recruitment was geographically localized; the diagnostic validation cohort was small (19 patients) with predominantly low-risk nodules; no a priori sample size calculation.
- Image quality assessment lacks a gold standard metric; entropy and confidence maps serve as proxies.
Related Publications
Explore these studies to deepen your understanding of the subject.

