logo
ResearchBunny Logo
SynthVision - Harnessing Minimal Input for Maximal Output in Computer Vision Models using Synthetic Image Data

Medicine and Health

SynthVision - Harnessing Minimal Input for Maximal Output in Computer Vision Models using Synthetic Image Data

Y. Kularathne, P. Janitha, et al.

Discover SynthVision, an innovative approach developed by Yudara Kularathne, Prathapa Janitha, Sithira Ambepitiya, Thanveer Ahamed, Dinuka Wijesundara, and Prarththanan Sothyrajah to create disease detection models. With just 10 real images, this method generated 500 clinically validated synthetic images to accurately detect HPV genital warts, achieving remarkable precision and recall. This groundbreaking technique could revolutionize rapid medical responses.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the challenge of rapidly developing computer vision models for detecting visual symptoms of emerging diseases when large, labeled datasets are unavailable. Traditional data collection is slow, hindering timely deployment of diagnostic tools during outbreaks or bioterror events. The authors propose SynthVision, an approach that uses diffusion models to generate clinically accurate synthetic medical images from a minimal number of guide images, enabling fast dataset expansion and model training. The purpose is to demonstrate that models trained predominantly on synthetic images can achieve high diagnostic performance, improving responsiveness to emergent healthcare challenges.
Literature Review
Prior synthetic image generation in healthcare often relied on GANs (e.g., Deep Pix2Pix GAN) to produce realistic images but suffered from limited diversity, unstable training, and large data requirements. Diffusion probabilistic models (e.g., DDPMs) have emerged as strong alternatives, generating diverse, high-fidelity medical images (e.g., MRI/CT) and improving downstream tasks under data scarcity. Nichol and Dhariwal improved denoising diffusion models for efficient high-quality image synthesis. Diffusion approaches have also been applied to mixed-type EHR synthesis, underscoring their versatility. These advances motivate using diffusion models to rapidly create clinically relevant images for training robust medical vision models under limited real data conditions.
Methodology
Two-phase experimental design. Phase 1: Fine-tuning diffusion models and synthetic image generation. - Data: 10 clinically validated guide images of HPV genital warts covering the disease spectrum, each paired with detailed clinical text prompts. - Personalization: DreamBooth fine-tuning of Stable Diffusion 1.5 using a unique identifier and class name (e.g., detailed clinical descriptions instead of generic prompts). Class-specific prior preservation loss was applied to encourage diverse but class-consistent images. - Super-resolution fine-tuning: Trained on paired low- and high-resolution images to preserve subtle clinical details. - Clinical guidance: Early trials showed difficulty rendering the broader genital area; the focus was shifted to warts to improve clinical accuracy. - Training setup: Tesla P100 (16 GB). UNet steps: 2000 (lr 2e-6); text encoder steps: 350 (lr 4e-7); image resolution: 512; checkpoints every 500 steps. - Generation and curation: 630 synthetic images produced via multiple prompts (30–50 per prompt). A physician reviewed all images, excluding 130 as clinically inaccurate, yielding 500 clinically accurate synthetic images for training. Phase 2: Computer vision model development. - Task: Binary classification of HPV genital warts vs. normal (disease negative). - Datasets: • Training: 500 synthetic HPV images (clinically validated) + 500 real normal images. • Validation: 50 real HPV + 50 real normal. • Test: 70 real HPV + 70 real normal (never seen during training/validation), curated to cover variation and severity. - Model architecture: Vision Transformer ViT-Base-Patch16-224 with added attention dropout; input size 224×224. - Hyperparameter tuning: Epochs explored 20–200; best at 150. Learning rates from 1e-2 to 1e-5; best 1e-4. Optimizers compared (Adam, RMSprop); RMSprop selected. - Final training settings: image size 224×224; batch size 64; epochs 150; learning rate 1e-4; optimizer RMSprop.
Key Findings
On the held-out test set (140 real images: 70 HPV, 70 normal): - Confusion matrix: TP=66, FN=4, FP=0, TN=70. - Accuracy: 97% (136/140). - Precision/Recall/F1: • HPV: precision 1.00, recall 0.94, F1 0.97 (support 70). • Normal: precision 0.95, recall 1.00, F1 0.97 (support 70). - ROC AUC: 0.993 (~0.99). These results indicate high sensitivity to HPV cases with zero false positives (perfect specificity in the test set) and balanced precision-recall across classes, despite using exclusively synthetic images for the HPV training class.
Discussion
The findings validate the hypothesis that minimal real data can be leveraged to generate high-quality synthetic images for training effective diagnostic models. By fine-tuning diffusion models (DreamBooth) with only 10 guide images and curating outputs with clinical oversight, the authors trained a ViT classifier that achieved high accuracy, sensitivity, and specificity on real held-out data. This supports the feasibility of rapid dataset synthesis and model deployment during medical emergencies, addressing the bottleneck of data scarcity. The strong performance across metrics and the high AUC underscore the method’s robustness and potential clinical utility, contingent upon further validation on broader, diverse real-world datasets.
Conclusion
SynthVision demonstrates that diffusion model–generated synthetic images from minimal inputs can effectively train high-performing medical vision models. Using only 10 guide images to produce a curated set of 500 synthetic HPV images, the resulting classifier achieved 97% accuracy, high F1-scores, and near-perfect AUC on real test data. This approach enables rapid development and deployment of diagnostic tools in data-limited, emergent scenarios. Future work should validate generalizability across institutions and modalities, extend to other diseases (e.g., CXR, CT, MRI, ultrasound), and further optimize efficiency and computational cost.
Limitations
Generalizability may be limited by the small number of initial guide images and potential biases in the synthetic generation and curation process. The training used synthetic data for the positive class and real data for the negative class, which may introduce domain shift. Results are based on a relatively small, curated test set; broader multi-center validation and assessment against more varied real-world conditions are needed. Computational demands for diffusion fine-tuning and super-resolution may also constrain rapid deployment in resource-limited settings.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny