Biology
TomoTwin: generalized 3D localization of macromolecules in cryo-electron tomograms with structural data mining
G. Rice, T. Wagner, et al.
Cryo-electron tomography (cryo-ET) enables in situ visualization of macromolecular complexes but accurate 3D localization of particles for subtomogram averaging (STA) remains a major bottleneck due to low SNR, missing wedge artifacts, and crowded cellular environments. Existing deep learning pickers for cryo-ET typically rely on supervised training with large, protein-specific annotations and therefore do not generalize to unseen proteins, limiting usability and throughput. The research question addressed is whether a single generalized, retraining-free model can learn a representation of 3D macromolecular shape that enables accurate de novo localization of diverse proteins in tomograms. The study proposes TomoTwin, a deep metric learning approach that embeds tomogram subvolumes into a high-dimensional space where macromolecules cluster by structural similarity, enabling both reference-guided and de novo clustering-based particle picking.
Prior cryo-ET particle picking approaches include template matching (for example EMAN2, Dynamo, PyTom) which are widely used for throughput but can be less accurate, and supervised CNN-based 3D pickers (for example 3D U-Net architectures and tools like DeepFinder, VP-Detector) that require large, protein-specific annotated datasets and retraining for each target, hindering generalization and usability. In single-particle cryo-EM (2D), generalized pickers based on deep metric learning and related embeddings have seen successful adoption, but translation to 3D tomograms has been challenging. Deep metric learning has demonstrated strong generalization in other domains by learning embeddings that arrange classes by similarity without retraining.
Overview and workflows: TomoTwin uses a 3D CNN trained with deep metric learning to embed overlapping tomogram subvolumes into a 32-dimensional hyperspherical embedding space where subvolumes containing the same macromolecule cluster together and different macromolecules are separated by distance. Two workflows are supported: (1) reference-based: a single visually identified example particle per protein is embedded to create a target embedding and used to compute similarity maps and picks; (2) clustering-based: the global embedding is approximated on a 2D manifold (UMAP), clusters are outlined interactively, and the mean embedding of selected points serves as a target, enabling de novo picking without a priori references. Data generation and simulation: Training used simulated tomograms generated with TEM Simulator and reconstructed in IMOD. A curated set of 120 structurally dissimilar proteins (30 kDa–2.7 MDa) plus membranes, fiducials, and noise composed 123 data classes. Eleven protein sets were simulated; 8 tomograms per set (512×512×200 voxels at 1.02 nm), with varying densities. Training/validation split used 800/200 subvolumes per class (total 98,400 training and 24,600 validation subvolumes). Detector and microscope parameters emulated a modern Gatan K3 and contemporary TEM optics. Additional generalization tomograms (including crowded scenarios) and parameter sweeps (tilt range, dose) were simulated for evaluation. Network architecture: A 3D CNN with five convolutional blocks (kernel 3×3×3) employing group normalization and leaky ReLU. Max pooling (2×2×2) follows the first block, adaptive max pooling to 2×2×2 after the final block. A 37×37×37×1 input is transformed to a 2×2×2×1024 tensor, flattened and passed through dropout and fully connected layers to a 32D feature vector, L2-normalized to lie on a 32D hypersphere. Metric learning and triplets: Training with triplet loss on online semi-hard mined triplets of subvolumes (anchor and positive from the same protein class, negative from a different class). Distance is cosine similarity; the loss encourages smaller anchor–positive distances than anchor–negative by a margin. Training ran for 600 epochs with ADAM; best validation F1 model used for localization/generalization. Augmentations and hyperparameters: Online augmentations included random in-plane rotation (xy), voxel dropout (5–20%), small translations (1–2 voxels), and Gaussian noise (σ 0–0.3). Hyperparameter optimization (Optuna) explored learning rate, dropout, optimizer, batch size, weight decay, first kernel size, output nodes, mining strategy, normalization type, loss function, and margins. Group normalization and standard triplet loss outperformed batch normalization and SphereFace/ArcFace variants in this setting. Embedding and picking pipeline: Tomograms are subdivided into overlapping 37^3 subvolumes (stride 2) and embedded. For reference-based picking, a user-selected example per protein is embedded as a target. For clustering-based picking, an interactive UMAP view is used to select clusters, whose mean embeddings act as targets. The Map function computes target-to-subvolume distances to produce similarity maps; Locate identifies similarity peaks with a region-growing maximum detection and non-maximum suppression; Pick interactively thresholds by similarity and size to output coordinates for STA. Evaluation used precision, recall, and F1; for experimental data, reference-free 2D/3D classification of extracted subtomograms assessed picking quality. Experimental datasets included in vitro mixtures (apoferritin, RhsA, TcdA1 with liposomes), cellular tomograms of Mycoplasma pneumoniae (EMPIAR 10499), and Chlamydomonas reinhardtii pyrenoid (EMPIAR 10694) for RuBisCO, as well as Yersinia entomophaga lysis tomograms for de novo clustering. Computing and timings: Embeddings computed on a local workstation (Titan V GPU) and MPI supercomputer (A100 GPUs). Embedding a large tomogram (608×855×148) took ~80 min (local, 2 GPUs) or ~30 min (distributed).
- Validation on simulated data (seen proteins): median F1 across validation tomograms 0.88 (range 0.76–0.98). Median per-protein validation F1 0.92 across proteins sized 30 kDa to 2.7 MDa; occasional outliers corresponded to proteins with especially weak reconstruction contrast.
- Generalization to unseen proteins: On a simulated generalization tomogram containing 7 unseen proteins, median F1 = 0.82 using the reference-based workflow. Increasing training set size improved generalization (mean per-protein F1: 20 proteins 0.49; 50 proteins 0.73; 100 proteins 0.82; 120 proteins model reached mean 0.82 with highest median 0.85). In a crowded generalization tomogram (5× particle density), overall mean F1 remained 0.82.
- Multiple proteins in vitro mixture: In 10 tomograms containing apoferritin, RhsA, TcdA1, and liposomes, TomoTwin produced high-fidelity picks with minimal confounding. Manually assessed metrics: TcdA1 recall 0.81, precision 1.0; apoferritin recall 0.91, precision 1.0. 2D class quality confirmed precision.
- Cellular tomograms: In 65 M. pneumoniae tomograms (EMPIAR 10499), reference-based picking of 70S ribosomes followed by pseudo-subtomogram 3D classification yielded ribosomal classes refined to ~15 Å, indicating accurate cellular picking.
- RuBisCO in C. reinhardtii pyrenoid (EMPIAR 10694): Reference-based workflow achieved ~0.8 recall in a crowded matrix from a single reference; STA on 30,954 picks produced a 13.7 Å in situ reconstruction, surpassing a prior 16.5 Å map from supervised workflows that required extensive training data. Clustering workflow yielded similar results.
- De novo clustering in Y. entomophaga: Clustering-based workflow identified two clusters refined by STA to 15.5 Å (70S ribosome) and 18.6 Å (putative bacterial RNA polymerase) using spherical references, demonstrating de novo structural discovery.
- Robustness across acquisition parameters: Simulated sweeps showed mean F1 decreases with more restricted tilt ranges: −5.4% (−50° to 50°) and −10.3% (−40° to 40°) relative to −60° to 60°; and with lower dose: −2.3% (135 e/Ų) and −8.6% (120 e/Ų) relative to 150 e/Ų. Despite degradation, accuracy remained high.
- Comparison to template matching: Across 120 validation proteins and on the generalization tomogram, TomoTwin outperformed EMAN2 template matching in both accuracy (F1) and consistency.
- Usability and throughput: No manual training data or retraining is required per protein; accurate coordinates for RuBisCO were generated with total working time under 1 hour. TomoTwin picks directly in 3D, yielding accurate z-localization.
- Embedding structure: UMAPs of embeddings showed clusters organized by protein identity and approximate size; clustering-based picking achieved mean F1 ~0.84 vs reference-based ~0.82 on the generalization tomogram. Automated HDBSCAN on the manifold recovered 5/7 protein clusters unsupervised.
TomoTwin addresses the central challenge of generalized, accurate particle localization in cryo-ET by learning a shape-based embedding that separates macromolecules in a high-dimensional space. This design eliminates the need for protein-specific annotation and retraining, enabling both reference-guided and de novo workflows. Empirical results demonstrate strong generalization to unseen proteins, robust performance in crowded environments, and applicability to diverse experimental conditions (varying tilt ranges, doses, and detectors), while maintaining competitive or superior accuracy to template matching and matching the quality of supervised deep learning approaches without the associated training overhead. The 3D nature of the embedding and picking ensures accurate localization along z as well as xy. The clustering workflow enables structural data mining of tomograms, supporting discovery of abundant macromolecules de novo and suggesting a path toward unsupervised STA pipelines. Overall, TomoTwin improves both accuracy and usability, facilitating higher-throughput and broader in situ structural studies.
The study introduces TomoTwin, an open-source, generalized deep metric learning tool for 3D localization of macromolecules in cryo-ET. Trained solely on diverse simulated data, the model accurately picks both seen and unseen proteins in simulated and experimental tomograms, outperforms template matching, matches supervised approaches without per-target training, and enables de novo discovery via clustering. TomoTwin’s usability, generalization, and 3D picking advance the feasibility and throughput of STA across varied experimental conditions. Future directions include integration with automated tomogram reconstruction and STA for end-to-end unsupervised analysis, expanding capabilities to membrane proteins and filaments, handling multiple conformations, and leveraging advances in denoising and missing wedge inpainting to push the limits toward smaller macromolecules.
Current limitations include: (1) not designed to pick membrane proteins or filaments; (2) clustering-based workflow requires sufficient copy number for clusters to form, limiting detection of rare species; (3) trained at ~10 Å pixel size, so it is not intended to distinguish multiple conformations of the same protein at the picking stage; (4) practical lower size limit for generalization is approximately 150 kDa due to downscaling (~10 Å/pixel), with smaller targets being challenging to validate via STA; and (5) performance can degrade with reduced tilt ranges and electron doses, though remains robust overall.
Related Publications
Explore these studies to deepen your understanding of the subject.

