
Computer Science
EPicker: An exemplar-based continual learning approach for knowledge accumulation in cryoEM particle picking
X. Zhang, T. Zhao, et al.
Discover how EPicker, developed by Xinyu Zhang and colleagues, revolutionizes cryo-electron microscopy particle picking by leveraging continual learning. This innovative approach not only enhances performance across new datasets but also prevents the loss of previously learned insights, allowing for improved identification of proteins, vesicles, and more.
~3 min • Beginner • English
Introduction
The paper addresses the challenge of robust, automated particle picking in single-particle cryo-EM, where micrographs contain heterogeneous content (target particles, degraded proteins, impurities, ice contamination) and where both precision and automation are required. Traditional template-based or feature-engineered methods depend on user-provided templates/features and can be biased. Modern CNN-based supervised approaches learn features from labeled data but face two key issues: (1) unpredictable generalization to unseen datasets with differing features, and (2) catastrophic forgetting when adapting models to new data via fine-tuning, which erodes performance on previously learned particles. Joint training on many datasets can broaden applicability but is computationally expensive and storage-intensive. The study proposes an exemplar-based continual learning approach (EPicker) to continually accumulate knowledge from new datasets while maintaining old knowledge, thereby supporting a generalizable, automated cryo-EM pipeline. The goal is to build a model that can learn incrementally from few new samples without catastrophic forgetting and capable of picking diverse biological objects (proteins, vesicles, fibers).
Literature Review
Prior particle picking methods include template or specific-feature matching (FindEM, SIGNATURE, DoGpicker, gAutoMatch, EMAN, RELION), which require user-prepared templates and are prone to bias. Unsupervised or clustering-based methods (DeepCryoPicker, DRPNet) avoid templates but have limitations. CNN-based methods (DeepPicker, DeepEM, Warp, Topaz, crYOLO) demonstrate strong performance and generalization when jointly trained on many datasets (e.g., crYOLO used 53 datasets; Warp recommends centralized data and periodic training). However, joint training is computationally costly and requires large storage. Fine-tuning adapts quickly to new data but yields task-specific models that forget older knowledge (catastrophic forgetting). Continual/incremental learning with knowledge distillation has been effective in natural image object detection. The paper builds on CenterNet (anchor-free detector) and continual learning strategies (e.g., knowledge distillation) to create an efficient, generalizable particle picker for cryo-EM.
Methodology
EPicker architecture and continual learning: EPicker implements an exemplar-based continual learning framework using a dual-path network built upon CenterNet. Two identical branches (A and B) are initialized from the old model. Branch A is frozen (reference for old knowledge); Branch B is trained to become the new model while distilling knowledge from Branch A to avoid forgetting. Training uses both a small exemplar dataset (subset of old datasets) and the new dataset. Exemplar construction: ~200 labeled particles per prior dataset (from one or multiple micrographs), with data augmentation (random flip, random crop). After training, only Branch B parameters are saved.
Loss functions: For continual learning, total loss L_Total = L_OD + λ1·L_Dis + λ2·L_Reg with λ1=0.1 and λ2=0.01 (empirical). L_OD is the CenterNet object detection loss: L_k (focal loss for center heatmap) + λ_off·L_off (offset regression, λ_off=1) + λ_size·L_size (size regression, optional; λ_size=0.1). L_Dis (knowledge distillation) minimizes L2 differences between feature maps and predicted heatmaps from Branch B and frozen Branch A on exemplar data. L_Reg penalizes large parameter deviations between new and old models to prevent overfitting to exemplars. Joint training and fine-tuning use only L_OD.
Base detector and feature extractor: CenterNet (anchor-free, keypoint-based) regresses particle center and optionally size. EPicker uses DLA-34 as feature extractor (outperformed ResNet in tests) with upsampling, followed by an object location sub-network producing heatmaps for center, local offset, and size (output stride R=4). For protein particles, size prediction is often disabled (only centers regressed) to reduce compute and improve localization; for size-sensitive objects (e.g., vesicles), size prediction is enabled (radius estimation).
Preprocessing and efficiency: Input micrographs are downsampled to width 1024 px (aspect ratio preserved), histogram equalized, and converted to 8-bit. Typical picking speed is <0.3 s per micrograph. For typical 10–30 nm particles, this downsampling maintains acceptable centering accuracy (<1 nm tolerance corresponding to several pixels).
Sparse annotation handling: To reduce labeling burden when many positives are unlabeled, EPicker modifies the center loss to (i) ignore hard negatives with high confidence (reduce penalization of potential unlabeled positives) and (ii) generate pseudo-labels from very high-confidence predictions. Thresholds: r1=0.7 (promote high-confidence negatives to unlabeled positives/pseudo labels), r2=0.5 (reduce penalty on potential positives).
Fiber picking and tracing: Fibers are initially detected as discrete points (like particles). A line tracing algorithm links points into fibers with constraints on maximum curvature (angle threshold) and neighborhood radius r=100 (at 1024 px width). The algorithm iteratively connects nearest candidate points respecting angle constraints; smoothing removes points where adjacent segments form angles <0.1 rad.
Datasets and evaluation: For incremental learning experiments, a base model was jointly trained on five datasets: 80S ribosome (EMPIAR-10028), 20S proteasome (EMPIAR-10025), apoferritin (EMPIAR-10146), TccA1 (EMPIAR-10089), Noda-virus (EMPIAR-10203). New datasets: β-galactosidase (EMPIAR-10017), influenza hemagglutinin (EMPIAR-10097), phage MS2 (EMPIAR-10075), CNG (EMPIAR-10081), phosphodiesterase (EMPIAR-10228). From each dataset, 15 micrographs were selected (10 train, 5 test), with manual picks as ground truth. Performance metrics: Average Precision (AP) and Average Recall (AR) at IoU=0.5. Complexity of a new dataset defined as C=100/(AP+AR) using the old model’s performance; forgetting rate defined as reduction of AP/AR on old datasets after training on new data.
Training modes compared: (1) Joint training on multiple datasets from scratch; (2) Fine-tuning of a pre-trained model on new datasets (single-path, no reference); (3) Continual learning (dual-path with exemplars). Time/storage costs were measured for scaling to more datasets.
Key Findings
- Continual learning effectiveness: Incrementally adding new datasets to the base model led to only a small decrease in AP (typically 1–3%) and little to no change in AR compared to joint training on all datasets, indicating minimal forgetting and retained recall. Adding all five new datasets at once showed similar results to adding them successively.
- Catastrophic forgetting with fine-tuning: On an 80S ribosome micrograph (EMPIAR-10028) containing 118 particles, a fine-tuned model (adapted to β-galactosidase) missed 35% of ground-truth particles, whereas continual-learning and joint-training models detected ~96–97% of ground-truth particles, demonstrating that fine-tuning yields task-specific models that forget prior knowledge.
- Generalization and upper bound: Joint training across diverse datasets (molecular weights from ~100 kDa to several MDa, varied shapes/sizes) maintained high AP/AR across all tested sets, reflecting strong generalization of the feature extractor and serving as an upper-bound reference.
- Impact of feature dissimilarity: The authors quantified dataset complexity (C=100/(AP+AR)) and found that higher dissimilarity can influence merging effectiveness (some forgetting), but experiments showed no significant forgetting overall; adding low-complexity datasets can improve the model.
- Efficiency gains: Compared to joint training, continual learning substantially reduced time and storage costs for extending to new features. Example: when adding the 10th dataset sequentially, joint training required ~50 minutes versus ~20 minutes for continual learning; only 1–2 micrographs per dataset are stored as exemplars for future updates.
- Biased vs. unbiased picking: On 26S proteasome datasets, the continual-learning model (general) picked nearly all particle types (CP2RP, CP1RP, CP, including top- and side-views), whereas fine-tuned and from-scratch models picked primarily the targeted CP2RP side-views (more specific/biased) and missed many CP1RP and most CP particles. All three picked similar numbers of CP2RP particles, with fine-tuning most accurate for the specific target.
- General object detection: EPicker accurately detected fibers (curved/straight) via point picking plus line tracing and liposomes with center and radius estimation, including overlapped vesicles, extending beyond protein particles.
- Practical usability: Downsampling, histogram equalization, and 8-bit conversion enabled fast inference (<0.3 s per micrograph). Sparse annotation support reduced labeling effort; 5–10 micrographs per dataset were often sufficient for training.
Discussion
The study addresses the core challenge of maintaining and expanding particle picking capability as new datasets with different features are encountered. By employing a dual-path architecture with knowledge distillation on exemplar data and regularization, EPicker effectively integrates new knowledge while preserving performance on earlier tasks, directly mitigating catastrophic forgetting. Empirical results show that continual learning achieves performance comparable to joint training while dramatically reducing the computational and storage burden associated with retraining on all data. The approach is robust across diverse particle types and sizes, and its general model supports unbiased discovery of heterogeneous particles, which is valuable at early project stages. Conversely, fine-tuning can produce highly specific models when targeted picking is needed, though at the cost of forgetting. The framework also extends to non-particle biological objects (fibers, vesicles) and accommodates sparse annotations, increasing practicality for cryo-EM pipelines with continuous data flow. Overall, EPicker’s continual learning paradigm enhances generalization and scalability, aligning with the needs of automated and evolving cryo-EM workflows.
Conclusion
The paper introduces EPicker, an exemplar-based continual learning system for cryo-EM particle picking built on CenterNet with a dual-path distillation architecture. It supports joint training, fine-tuning, and continual learning, with the latter enabling efficient knowledge accumulation from few new samples while preserving prior capabilities. Experiments demonstrate near–joint-training performance, minimal forgetting, significant reductions in training time and storage, and applicability to diverse biological objects (proteins, fibers, vesicles with size estimation). The method integrates seamlessly into automated pipelines, supports sparse annotations, and delivers fast inference. Code and pretrained models are publicly available, facilitating adoption in real-world cryo-EM workflows.
Limitations
- The authors note that dissimilarity between old and new datasets can affect the effectiveness of merging features, potentially leading to some forgetting, although experiments observed no significant forgetting overall.
- The performance of a general model on completely unseen particles is not guaranteed and depends on the match between accumulated knowledge and new particle features (as discussed in the context of general models).
Related Publications
Explore these studies to deepen your understanding of the subject.