Introduction
Machine learning (ML) offers promising advancements in materials characterization, particularly for high-resolution transmission electron microscopy (HRTEM). Supervised learning methods, however, necessitate large, high-quality training datasets, which are often challenging and expensive to obtain experimentally. Manually annotating experimental data is time-consuming and introduces potential human biases. Data simulation provides an attractive alternative, offering high-throughput generation of arbitrarily large datasets with ground-truth annotations, eliminating human error and bias. While previous work has utilized simulated data for TEM analysis, these efforts have often been limited in scope to specific atomic nanostructures or periodic crystals. This work aims to overcome these limitations by developing a flexible framework for generating diverse and complex nanoscale atomic structures, suitable for training robust ML models for HRTEM analysis. The focus is on nanoparticle image segmentation, a crucial preprocessing step for various nanomaterial characterization tasks in HRTEM, where classical image analysis methods often fail due to the complex contrast features at ultra-high magnification.
Literature Review
Recent research has explored the use of simulated training data in TEM analysis. Successful applications include using simulated data to train neural networks for analyzing crystalline scanning TEM (STEM) and STEM diffraction data, segmenting and analyzing HRTEM images of 2D materials and nanoparticles, and denoising HRTEM micrographs. However, these advancements have been limited to specific atomic nanostructures or periodic crystals, limiting the applicability of the developed models. The current software tools for computational materials science are not optimally designed for generating complex, diverse nanoscale atomic structures needed for robust ML model training in HRTEM. This lack of suitable tools hinders the development of generalizable ML workflows for a wide range of experimental use cases.
Methodology
The authors present Construction Zone (CZ), an open-source Python package designed for the algorithmic and high-throughput generation of arbitrary atomic structures. CZ combines atomic placement (Generators), nano-object creation (Volumes), and nano-object interaction (Scenes) modules, with a Transformations module for manipulating objects using symmetry operations and other modifications. CZ seamlessly integrates with existing materials science software packages such as PyMatgen, ASE, and WulffPack. The authors use CZ to generate a large database of realistic nanostructures, simulating HRTEM images using the multislice algorithm implemented in Prismatic. The data generation pipeline incorporates various parameters, including nanoparticle size, orientation, defects, substrate type, imaging conditions (defocus, aberrations), noise levels (electron dose), thermal effects, plasmonic losses, and more. Extensive metadata tracking ensures precise control over data distributions. The generated datasets are then used to train UNet neural networks for nanoparticle segmentation. The authors use a ResNet 18 encoder/decoder architecture and a mixed categorical cross-entropy and F1-score loss function. They train multiple networks under various conditions to study the impact of dataset characteristics on performance. Performance is evaluated on three experimental HRTEM datasets of Au and CdSe nanoparticles, comparing against previous state-of-the-art results obtained with experimentally trained models. The authors also investigate the computational cost of different stages of the pipeline, performing benchmark tests on a workstation and a high-performance computing cluster.
Key Findings
The study demonstrates that high-accuracy supervised models for analyzing atomic-resolution HRTEM images can be trained effectively using large, high-quality simulated datasets. The authors' findings highlight several crucial aspects of data curation:
1. **Dataset Size:** Larger datasets (around 4000–8000 images) are needed for optimal performance. Performance saturates at around 8000 images for the large Au nanoparticle dataset.
2. **Simulation Fidelity:** Incorporating simulation effects, such as thermal vibrations, plasmonic losses, and residual aberrations, significantly improves model performance, particularly in data-scarce regimes. These effects can subtly alter the simulated images but have a substantial impact on model accuracy.
3. **Dataset Composition:** Diversity in atomic structures, particularly varying substrate thicknesses, and imaging conditions (defocus, noise levels) are critical for reducing the variance of model performance and improving generalization across diverse experimental datasets. Varying substrate thickness had a greater impact on reducing performance variance than the number of unique structures.
4. **Noise:** The study finds a non-monotonic relationship between noise levels and model performance. Noisier data seems beneficial, particularly for the small Au and CdSe datasets, suggesting a regularization effect where including noise helps the model generalize better to noisy experimental data. However, this relationship appears more complex than a simple Poisson noise model, potentially due to variations in camera statistics across different experimental datasets.
5. **Performance Benchmarks:** The authors achieve state-of-the-art segmentation performance on three experimental benchmarks (large Au, small Au, and CdSe nanoparticles), surpassing previous results obtained with models trained on experimental data. F1-scores of 0.9189 and 0.7516 were achieved for Au and CdSe, respectively, using an optimized simulated dataset. The optimal dataset included a wide variety of atomic structures, imaging conditions, and noise levels, and leveraged a larger training dataset (8000 images).
Table 1 summarizes the best F1-scores achieved across various training datasets. The rightmost columns indicate the median number of epochs needed to achieve validation F1 scores of 0.90 and 0.95 on simulated data. The results show that while models may optimize very quickly on simulated data, this does not necessarily correlate with optimal performance on experimental data.
Figure 4 shows the relationship between dataset size and noise and their impact on model performance. Figure 5 demonstrates the effects of improving simulation fidelity by including thermal effects, residual aberrations, and plasmonic losses. Figure 6 illustrates how varying the substrate thickness, focal points, and the number of unique structures in the training dataset affect the model performance.
Computational benchmarks (Table 2) show the time required for each step of the process on both a workstation and a high-performance computing cluster, highlighting the efficiency gains possible with parallel processing.
Discussion
The results demonstrate the feasibility of training highly accurate supervised ML models for HRTEM analysis using solely simulated data. The study highlights the importance of a carefully curated dataset, including a substantial number of images, high simulation fidelity, and a broad representation of atomic structures and imaging conditions. While the use of simulated data offers significant advantages in terms of dataset size, control over data distribution, and the availability of physics-based ground-truth labels, the authors acknowledge the limitations of this approach. The perfect match between simulated and experimental data distributions is unlikely, leading to potential extrapolation errors and an artificial performance ceiling due to discrepancies between physics-based and expert-labeled data. Future research could explore methods such as using CycleGANs to bridge the gap between simulated and experimental data, or incorporating active learning strategies to iteratively refine the training datasets. This would involve carefully selecting the most informative data to include in the dataset at each step of the process.
Conclusion
Construction Zone provides a robust and flexible framework for generating synthetic datasets for training ML models for HRTEM analysis. The study demonstrates the ability to achieve state-of-the-art performance using purely simulated data, highlighting the importance of dataset size, fidelity, and composition. Future work should explore techniques to further refine the simulation process and integrate active learning strategies for efficient dataset curation.
Limitations
The study primarily focuses on nanoparticle segmentation on amorphous substrates. The generalizability of the findings to other nanomaterials, imaging modalities, or analysis tasks needs further investigation. The current simulation approach does not fully capture all aspects of experimental noise, particularly camera artifacts, which might further improve performance. The ground truth labels are based on a combination of physics-based simulations and expert manual annotations. The inherent uncertainty and potential biases in these labels could limit the overall accuracy achievable.
Related Publications
Explore these studies to deepen your understanding of the subject.