
Engineering and Technology
A robust synthetic data generation framework for machine learning in high-resolution transmission electron microscopy (HRTEM)
L. R. Dacosta, K. Sytwu, et al.
Explore how the innovative Python package Construction Zone enables the generation of complex nanoscale atomic structures, significantly enhancing the creation of diverse synthetic datasets for training machine learning models to analyze HRTEM images. This groundbreaking research from Luis Rangel DaCosta, Katherine Sytwu, C. K. Groschner, and M. C. Scott achieves state-of-the-art nanoparticle image segmentation using solely simulated data.
~3 min • Beginner • English
Introduction
The paper addresses how to train robust, high-accuracy machine learning models for atomic-resolution HRTEM analysis when large, well-annotated experimental datasets are costly, biased, or impractical to obtain. Supervised ML performance depends strongly on the training data distribution and often generalizes poorly out-of-distribution, making dataset design central to success. Synthetic data from physics-based simulations offer reproducible labels, scalability, and controlled coverage of experimental conditions, but prior TEM ML efforts often relied on limited sets of simple or periodic structures, restricting applicability. The authors propose to overcome these limitations by creating a flexible atomic structure generation tool, Construction Zone (CZ), enabling high-throughput sampling of complex, realistic nanoscale structures and pairing it with HRTEM simulation to build large synthetic datasets. They investigate how simulation fidelity, structural distributions, and imaging-condition distributions in curated synthetic datasets affect neural-network segmentation performance on experimental nanoparticle HRTEM images, aiming to establish robust data curation strategies for experimental deployment using only simulated training data.
Literature Review
The study situates itself within prior work that used simulated data to train deep learning models for various TEM modalities (STEM, diffraction, HRTEM denoising and segmentation). These advances leveraged high-throughput, realistic TEM simulations but typically targeted narrow structure classes (e.g., periodic crystals, few nanostructure types), limiting generalization. Classical segmentation approaches (e.g., Fourier filtering, unsupervised k-means) perform poorly at atomic-resolution HRTEM due to complex contrast and noise, motivating supervised ML. The authors highlight modern simulation tools (e.g., Prismatic, abTEM, MULTEM) and materials toolkits (PyMatgen, ASE, WulffPack) as enabling infrastructure, while noting the gap in general-purpose tools for sampling arbitrary distributions of complex nanoscale structures. Their work aims to broaden this capability and systematically study how curated synthetic data choices impact experimental performance.
Methodology
Data generation and simulation pipeline:
- Structure generation: Use Construction Zone (CZ) to algorithmically sample complex nanoscale structures. For benchmarks, generate several thousand spherical Au nanoparticles with random radii, orientations, positions, and planar defects (twin defects or stacking faults) placed on unique amorphous carbon substrates. CZ modules: Generators (atom placement in crystalline/non-crystalline arrangements), Volumes (convex regions; unions allow non-convex objects), Transformations (symmetry, strains, chemistry modifications), and Scenes (object interactions, precedence, export). Auxiliary modules include surface analysis (alpha-shape derivatives), utilities (RDF, orientation sampling), and prefabs (e.g., Wulff nanoparticles, multigrain structures).
- HRTEM simulation: Use Prismatic multislice at 300 kV, 0.02 nm/pixel resolution. Compute exit wavefunctions for full structure and for nanoparticle in vacuum. Apply imaging conditions via aberration function (defocus, residual aberrations) and focal spread by incoherent averaging over focal points. Include thermal effects via frozen-phonon approximation (atomic displacements from Debye–Waller factors; eight frozen phonons used). Approximate plasmonic losses by contrast reduction (weighted mean of original intensity and background-adjusted intensity). Model dose-dependent noise via scaled Poisson sampling. Camera MTF effects are not included (datasets used scintillator detectors).
- Ground-truth labels: Threshold the phase of the nanoparticle-in-vacuo exit wave (averaged over frozen phonons) to generate segmentation masks. Phase thresholding was found more stable across defocus than intensity thresholding.
- Data curation: From each simulated structure, sample multiple images across varied imaging conditions (defocus distribution, residual aberrations, focal spread), thermal/plasmonic configurations, and noise levels (dose). Extensively track metadata at each stage to enable targeted sampling of training distributions.
Neural network training and evaluation:
- Architecture: UNet with ResNet-18 encoder/decoder, three pooling/upsampling stages (~14M parameters).
- Loss: Mixed categorical cross-entropy + F1-score loss on segmentation masks.
- Optimization: Adam with initial learning rates 0.01 or 0.001, learning rate decay 0.8, batch size 16, trained for 25 epochs without early stopping; best validation loss checkpoint saved.
- Augmentations: Per-image min-max normalization to [0,1], orthogonal rotations, random flips. Experimental images preprocessed with 3×3 median filter and min-max normalization.
- Experimental benchmarks: Three datasets at ~0.02 nm/pixel: (i) large Au nanoparticles (~5 nm), (ii) small Au nanoparticles (~2.2 nm), (iii) small CdSe nanoparticles (~2 nm). Prior best F1 using experimental training data reported as 0.89 (large Au), 0.75 (small Au), 0.59 (CdSe).
- Training protocol: For each dataset condition, draw training samples i.i.d. from simulation databases; train at least five networks with different random initializations and data subsets. Save training histories and metadata. Evaluate with nondiscretized F1 metric.
Computational considerations:
- Benchmarks on a workstation (Xeon Gold 6130 CPU, Quadro P5000 GPU) and on Perlmutter (CPU node). Representative timings: structure generation ~15.2 s/structure pair (workstation) vs 4 s (Perlmutter); HRTEM simulation ~1.125 s/frozen phonon (workstation) vs 0.294 s (Perlmutter); image generation ~1.4 s/image (GPU) vs 0.108 s (Perlmutter); model training ~15 s per 512 images per epoch (GPU); inference 0.014 s per 512×512 patch (GPU). Pipeline is embarrassingly parallel; FFT operations dominate simulation and image generation cost.
Key Findings
- Synthetic-only training achieves state-of-the-art experimental performance:
- Maximum F1 on large Au benchmark: 0.9189 (surpasses prior 0.89).
- Maximum F1 on CdSe benchmark: 0.7516 (surpasses prior 0.59).
- Optimized mixed Au/CdSe dataset yields strong cross-dataset generalization; e.g., small Au reaches F1 ≈ 0.863.
- Dataset size effects:
- On large Au, performance increases with dataset size and saturates around F1 ≈ 0.90 with ~8000 images; variability across random initializations decreases with size.
- Noise (dose) sensitivity:
- Performance depends strongly on noise; training with lower doses (noisier images) can improve robustness and yields models with stable performance across higher doses. Models trained at the lowest dose often generalize well across noise levels. Camera noise deviations from pure Poisson likely contribute to dataset-specific trends.
- Simulation fidelity boosts accuracy:
- Including thermal averaging (frozen phonons), residual aberrations, and plasmonic losses can improve F1 by ~0.10–0.15 compared to a baseline without these effects, especially in data-scarce regimes (N=512). Thermal averaging is the most computationally expensive; aberrations and plasmonic modeling are relatively cheap and beneficial.
- Composition and diversity:
- Varying substrate thicknesses in training data improves mean performance and reduces variance versus a single thickness (at fixed N=1024).
- Increasing the number of unique defocus points modestly increases accuracy and strongly reduces variance.
- Increasing the number of unique nanoparticle structures has limited impact if imaging-condition diversity is already broad.
- Training dynamics:
- Networks rapidly achieve high validation F1 on simulated data (>0.90 in few epochs), while experimental performance improves more slowly and continues rising after simulated validation saturates, indicating simulated validation is not a reliable proxy for experimental performance.
- Table 1 highlights (F1 on experimental datasets; columns: 5 nm Au | 2.2 nm Au | 2 nm CdSe):
- Baseline (512 imgs): 0.710 | 0.740 | 0.681
- Thermal (512): 0.727 | 0.767 | 0.621
- All sim effects (512): 0.822 | 0.814 | 0.647
- Smaller NPs (1024): 0.833 | 0.809 | 0.673
- Varying substrate (1024): 0.885 | 0.842 | 0.620
- Optimized Au (8000): 0.915 | 0.808 | 0.648
- Optimized mixed Au/CdSe (8000): 0.884 | 0.863 | 0.731
- Optimized CdSe (8000): 0.799 | 0.852 | 0.752
- Median epochs to reach validation F1 0.90/0.95 on simulated data are small (e.g., 1–3 epochs for optimized sets), underscoring the gap between simulated and experimental performance.
Discussion
The study demonstrates that carefully curated, fully synthetic training data can power supervised neural networks to achieve and surpass state-of-the-art segmentation performance on atomic-resolution HRTEM of nanoparticles. The results directly address the central challenge of limited, biased, or costly experimental labels by providing a reproducible, controllable pipeline that matches critical experimental factors. Key insights show that:
- Matching experimental distributions via simulation fidelity (thermal, aberrations, plasmonic losses) and appropriate noise models materially improves experimental performance.
- Diversity in relevant structural factors (especially substrate thickness) and imaging conditions (defocus, focal spread) primarily reduces variance and stabilizes optimization, ensuring reliable training outcomes across random seeds.
- Dataset size is important, with performance saturating around several thousand images; however, smart curation can recover similar performance with fewer images, reducing development cost.
- High simulated validation scores do not predict experimental performance; direct benchmarking on experimental data is essential.
These findings highlight practical strategies for mitigating distribution shift between synthetic and experimental data. They also emphasize the importance of priors: when the simulated training distribution approximates the experimental prior, models generalize better and benefit from diversity as a form of regularization. The CZ tool is pivotal for sampling realistic structural priors and enabling iterative tuning (potentially within active learning loops) to bring the synthetic distribution closer to the experimental one.
Conclusion
The authors present Construction Zone (CZ), a general-purpose, programmatic framework for generating complex nanoscale atomic structures, and pair it with high-throughput HRTEM simulation to create large, metadata-rich synthetic datasets. Using this end-to-end workflow, they train UNet-based segmentation models that achieve state-of-the-art performance on multiple experimental HRTEM nanoparticle datasets using only simulated training data. They provide actionable data curation guidelines: use sufficiently large datasets (~4k–8k images), include realistic simulation effects (thermal, residual aberrations, plasmonic losses), sample broad and relevant imaging-condition and structural diversity (especially substrate thickness and defocus), and include noise variations. Future directions include improving noise and detector modeling (e.g., camera MTF), integrating generative or domain-adaptation methods to bridge residual simulation–experiment gaps, leveraging active learning to automatically tune synthetic priors, and exploring problem-specific architectures and loss functions to further enhance robustness and accuracy.
Limitations
- Label mismatch: Training labels are physics-based (simulation-derived) while benchmarks use expert manual labels; discrepancies introduce an artificial performance ceiling and potential bias.
- Noise/Detector modeling: The noise model is Poissonian without explicit camera MTF or detector-specific artifacts; this may limit fidelity for certain datasets.
- Structure approximations: Benchmark simulations approximate nanoparticles as spherical Au with limited planar defects on amorphous carbon; real experimental structures may be more diverse.
- Computational cost: Thermal averaging (frozen phonons) substantially increases simulation time and memory; large datasets (thousands of images) are often needed for peak performance.
- Distribution shift: When simulated and experimental priors diverge, curation effects can be obscured and overall performance degrades; careful matching is required.
- Validation proxy gap: High validation performance on simulated data does not reliably predict experimental performance, necessitating experimental benchmarks during development.
Related Publications
Explore these studies to deepen your understanding of the subject.