Introduction
Atomic Force Microscopy (AFM), particularly Frequency Modulation AFM (FM-AFM) with CO-functionalized tips, achieves atomic-scale resolution, revealing molecular internal structures through Pauli repulsion contrast. While HR-AFM has enabled the identification of molecules in specific contexts (e.g., brevitoxin A, asphaltenes), general molecular identification solely from HR-AFM images remains a challenge. This involves disentangling contributions from bonding topology and chemical composition, addressing experimental noise, and handling 3D structures. Previous AI approaches using deep learning (DL) have shown promise but face limitations in chemical identification at the single-atom level and in handling 3D structures. Convolutional Neural Networks (CNNs) have been used for structure determination and electrostatic field prediction, while graph neural networks (GNNs) have been applied for extracting molecular graphs. Prior work by the authors demonstrated accurate molecular classification using DL, but this approach is limited to predefined classes and struggles with variations in molecular structure. This paper introduces a novel approach using a CGAN for molecular identification, leveraging its ability in image translation tasks. The CGAN's architecture, comprising a generator and a discriminator, is well-suited for this task, as the AFM contrast of each atom strongly depends on its local chemical environment.
Literature Review
Existing literature highlights the advancements in AFM techniques for nanoscale imaging and manipulation. The use of CO-functionalized tips in FM-AFM has been instrumental in achieving atomic-scale resolution, revealing detailed molecular structures. Simulation models have aided in understanding the contrast mechanisms and factors influencing image formation, including Pauli repulsion, electrostatic forces, and the role of the CO molecule. HR-AFM has proven valuable in identifying molecules like brevitoxin A and components of asphaltenes. However, general molecular identification from HR-AFM data alone is a largely unsolved problem. Previous AI-based attempts, primarily employing CNNs, have focused on structural aspects, achieving success with planar molecules but facing challenges with 3D structures and distinguishing diverse chemical species. While some progress has been made using VAEs for incorporating experimental image features, a comprehensive solution for general molecular identification remains elusive. This study proposes a new approach using CGANs to overcome the limitations of previous methods.
Methodology
The authors employed a CGAN model to identify molecules from HR-AFM images. The generator takes as input a stack of 10 constant-height HR-AFM images at varying tip-sample distances (spanning 100 pm). The original CGAN architecture was modified by replacing the initial 2D convolutional layers with 3D convolutional layers to process the image stack effectively. The generator outputs a ball-and-stick representation of the molecule, where balls represent atoms (with color and size encoding chemical species) and sticks represent bonds. The discriminator differentiates between real ball-and-stick depictions and those generated by the generator. The QUAM-AFM dataset, containing simulated AFM images for 686,000 organic molecules, was used for training, validation, and testing. The dataset was split into training (581,000), validation (24,000), and test (81,000) sets. During training, AFM simulation parameters were randomized for each input stack to enhance model robustness and generalization. An image data generator (IDG) was employed to augment the training data with transformations such as rotation, shifting, and flipping, ensuring the model's ability to handle variations in image acquisition. The model was trained using a mean absolute error (MAE) loss function, and the training process involved monitoring validation set performance to determine optimal training duration.
Key Findings
The CGAN model demonstrated remarkable accuracy in identifying molecules from both simulated and experimental AFM images. Testing with 3015 randomly selected structures from the test set showed high accuracy in predicting both the structure and chemical composition. The model successfully identified complex molecules, including those with distorted structures due to charge accumulation around electronegative atoms. The accuracy was found to decrease with increasing molecular height differences, likely due to limitations in obtaining information from lower-lying molecular regions using the current AFM setup. However, training with gas-phase structures, rather than adsorbed configurations, proved beneficial, as it allowed the model to learn local relationships between chemical species and height, enhancing its ability to generalize to different adsorption configurations. The analysis of the model's errors revealed some recurring misclassifications, such as mistaking oxygen for fluorine atoms in certain environments, highlighting the challenges in distinguishing atoms with similar charge distributions and sizes in HR-AFM images. The model was also tested on experimental AFM images from published studies. Despite the limitations of limited data availability and variability in experimental conditions, the CGAN demonstrated a remarkable ability to identify molecules. In several cases, it provided accurate predictions of both structure and chemical composition, even surpassing human expert capabilities. The model showed robustness against variations in AFM operation modes and tip asymmetries, successfully identifying molecules in images acquired using different AFM techniques. In some instances, the model's performance with experimental images even exceeded its performance with simulated images, likely due to the introduction of additional structural information from the substrate-molecule interactions.
Discussion
The results demonstrate the potential of the proposed CGAN model for molecular identification using HR-AFM images. The model's ability to translate AFM image data into accurate ball-and-stick depictions represents a significant advancement in the field. The high accuracy achieved with both theoretical and experimental data highlights the model's robustness and its potential for broader application in various research areas. The findings address the challenge of general molecular identification from HR-AFM images, providing a powerful tool for analyzing complex molecular systems. The model's ability to handle variations in experimental conditions and AFM operation modes further strengthens its applicability. The limitations observed, such as decreased accuracy with highly corrugated structures and occasional misclassifications between similar chemical species, suggest areas for future improvements and refinements. The use of gas-phase structures for training, despite limitations in capturing substrate-molecule interactions, proved effective in enhancing the model’s generalization capabilities. Future research should focus on addressing these limitations, potentially by incorporating more sophisticated image processing techniques, integrating additional data sources, or developing novel AFM operation modes.
Conclusion
This work presents a novel CGAN-based approach for accurate molecular identification using HR-AFM images. The model achieves high accuracy in determining both the structure and chemical composition of molecules from both simulated and experimental data, demonstrating its potential as a powerful tool for chemical analysis. Future research could explore strategies for improving the accuracy with highly corrugated molecules, potentially by integrating Bayesian inference or DFT calculations. The development of more advanced AFM operation modes could further enhance the data quality and consequently improve the model's performance.
Limitations
The accuracy of the model decreases with increasing molecular height differences, reflecting limitations in current AFM setups for obtaining information from lower-lying regions in 3D structures. The model occasionally misclassifies similar chemical species (e.g., oxygen and fluorine), highlighting the need for further refinement in distinguishing atoms with similar charge distributions. The availability of experimental AFM image datasets with sufficient image stacks and consistent experimental parameters remains a limitation. The training dataset, while extensive, might not fully represent the diversity of possible molecular configurations and substrate interactions.
Related Publications
Explore these studies to deepen your understanding of the subject.