Chemistry

SEMORE: SEgmentation and MORphological fingErprinting by machine learning automates super-resolution data analysis

S. W. B. Bender, M. W. Dreisler, et al.

Discover how the groundbreaking SEMORE framework leverages machine learning for universal analysis of super-resolution data, unlocking new insights into protein assemblies and their morphology evolution. This innovative research was conducted by Steen W. B. Bender, Marcus W. Dreisler, Min Zhang, Jacob Kæstel-Hansen, and Nikos S. Hatzakis from the University of Copenhagen.

00:00

~3 min • Beginner • English

Index

Introduction

Biomolecules form diverse, highly ordered assemblies (e.g., clusters, condensates, aggregates) whose morphology (geometry, size, shape, internal structure) strongly influences their properties and biological functions. Single-Molecule Localization Microscopy (SMLM) techniques such as STORM, PALM, PAINT, DNA-PAINT, and REPLOM overcome the diffraction limit and can probe these assemblies at nanoscale resolution, including temporal evolution in some modalities. However, robust, general, and automated analytical tools to extract, segment, and quantify assemblies from raw localization data are lacking. Existing supervised machine learning approaches require extensive annotated datasets and often generalize poorly across systems, while unsupervised methods (e.g., DBSCAN, OPTICS) typically rely on fixed parameters, struggle with heterogeneous densities and sizes, and do not natively integrate temporal information. The research goal is to develop a universal, unsupervised, and time-aware pipeline that transforms raw spatiotemporal SMLM coordinates into individualized assemblies and quantifies their morphology without a priori knowledge. This study introduces SEMORE to address these challenges, enabling unbiased extraction of assemblies, tracking temporal morphological evolution, and comprehensive quantitative fingerprinting across diverse biological systems and imaging conditions.

Literature Review

The paper reviews limitations of current SMLM analysis approaches. Supervised ML methods can be accurate but demand large, expert-annotated datasets and often specialize to a single system or task, limiting general applicability to novel biological contexts. Unsupervised clustering algorithms commonly used for coordinate data (e.g., DBSCAN, OPTICS, HDBSCAN) mitigate annotation needs but adopt one-size-fits-all parameters, requiring laborious tuning and performing poorly with heterogeneous localization densities, assembly sizes, and overlapping structures. Critically, temporal information is typically excluded from spatial distance metrics, rendering standard clustering ineffective for time-resolved data. Prior advances in ML for quantitative imaging and generalist segmentation (e.g., Cellpose) illustrate potential but do not address the temporal and morphological quantification needs of SMLM protein assemblies. The authors position SEMORE as a general, unsupervised, time-aware alternative that integrates clustering and morphology fingerprinting to overcome these gaps.

Methodology

SEMORE comprises two independent modules: (1) a scale-invariant, self-parameterizing clustering module operating in standardized 3D (x, y, t) space that can also accept 2D (x, y) inputs; and (2) a morphology fingerprinting module computing 40+ features from each extracted assembly. Clustering module (time-aware segmentation): - Standardize coordinates by z-score in x, y, and t to equalize scales and ensure scale-invariant density measures. - Initial clustering: Automatically select either DBSCAN or HDBSCAN based on field-of-view density (DBSCAN if fewer than ~1500 points per standardized area). Extract high-density regions (candidate assemblies) from noise. A topological failsafe uses 2D binned densities with Gaussian blur and contouring to ensure at least one region of interest is proposed if the initial clustering fails. - Temporal refinement: If time is available, perform frame-iterative refinement to separate spatially overlapping assemblies that grow over time. Rescale regions via MinMax to balance axes. Compute a region-specific search range from pairwise distances: take the interquartile range of pairwise distances, derive the standard error of the mean, and set search range to 95% CI (radius_ratio default 1.96). For each frame in ascending time: run DBSCAN (eps = search range, min_samples = investigate_min_sample, default 50) on unlabeled points to detect new seeds; concurrently, assign growth to existing assemblies using a radius-based nearest neighbor (r = search range). Unassigned points move to the next frame; remaining unlabeled after the last frame become noise. - Smart density filter: Estimate noise density from sampled initial-noise points (exclude the lowest-density quartile to avoid bias). A putative assembly must exceed mean noise density by at least one standard deviation and meet a minimum point-count threshold (rough_min_points and final_min_points) to be retained. Filtering strictness can be set (strict/loose/none). - Hyperparameters: Defaults are broadly applicable; key options include investigate_min_sample (default 50), radius_ratio (default 1.96), rough_min_points, final_min_points, and filtering_type. Excessive over-clustering at the initial step can degrade results; otherwise, temporal refinements are robust to small parameter perturbations. Morphology fingerprinting module (40+ features): - Compute features in four classes after standardizing and aligning structures along the major axis: (a) Circularity (5 features) via density-based contouring from Gaussian-blurred 2D histograms thresholded at the 90th percentile; (b) Symmetry (7 features) using quadrant counts around the centered structure; (c) Geometric (3 features) via Delaunay triangulation, pruning edges by lognormal fit to edge lengths, and computing area/size and density measures; (d) Graph network (25+ features) using a radius-to-neighbors graph (radius from geometric class), extracting MST-derived and edge-density statistics (e.g., longest-shortest path). - Features can be embedded (e.g., UMAP) and clustered (e.g., DBSCAN) for classification, post-processing (e.g., noise removal), and mechanistic interpretation. Feature ranking highlights discriminative features. Simulations and evaluations: - Aggregation models: (1) Isotropic growth (Gaussian KDE expansion with time-varying bandwidth); (2) Sterically hindered (random) growth via Monte Carlo-like selection favoring minimal hindrance (hindrance = exp(-distance^2)); (3) Fibril growth (directed diffusion with Gaussian-distributed step lengths and directions; rare branching with Gaussian angle deviations; up to 3 branches). Simulations include homogeneous and heterogeneous noise (uniform and noise seeds), temporal overlap, and varied start/end frames in a 40×40 µm FOV. Each growth type repeated 50 times (10 aggregates for isotropic/random, 25 for fibrils). - Sparse aggregates: Simulated fibrils (temporal) and ellipses (static) with 4, 8, 15, 25 detections to test minimal-data performance; blinking artifacts simulated by assigning 1–6 detections per ground-truth position with localization errors drawn from a lognormal-informed Gaussian. - Degenerative structures: Simulated anisotropic shrinking/depoymerization to test segmentation of shrinking assemblies. - Performance metrics: Accuracy defined per-aggregate as TP/(TP+FP+FN); precision, recall, F1 used; macro-averaging typically reported. Experimental datasets: - REPLOM time-resolved insulin aggregation (x, y, t): SEMORE clustering and fingerprinting to identify and classify aggregates (isotropic vs anisotropic) via UMAP and DBSCAN. - dSTORM Nuclear Pore Complex (NPC): Static images clustered by DBSCAN, then SEMORE fingerprinting classifies individual NPCs, overlapping NPCs, and noise; radii derived from area features and fit with Gaussian to estimate diameter. - Additional datasets: dSTORM of FGFR1 (using frame metadata for temporal refinement), sptPALM of syntaxin 1a (temporal evolution of clustering), live-cell PALM of RyR (fine-grained clustering exceeding DBSCAN alone).

Key Findings

- Robust segmentation across simulated morphologies with high accuracy and F1, even under strong noise and overlap: - Isotropic growth: median accuracy ~83% (mean 81.13% ± 6.74%), F1 = 84 ± 4%. - Sterically hindered (random) growth: accuracy 83 ± 7%, F1 = 90 ± 3%. - Fibril growth (including overlapping fibrils): accuracy 94 ± 2%, F1 = 98 ± 1%. - Baselines: DBSCAN/HDBSCAN accuracies and F1 were much lower (e.g., isotropic 5 ± 1%/36 ± 10% accuracy; fibril 54 ± 13%/30 ± 7%). - Performance remained >85% median accuracy across noise levels and independent of noise homogeneity; at extremely low/no noise, smart density filtering can remove true positives. - Temporal capabilities: - Correctly identifies growth onset times with average offset ~−13 frames. - Accurately segments shrinking/degenerative morphologies. - Dissects fibrils growing within spherulites (otherwise infeasible for state-of-the-art 2D-only methods). - Sparse data performance: - Extracts structures down to 4 detections, maintaining >90% accuracy at biologically relevant noise; fingerprinting fully separates true structures from noise at 8 detections and classifies morphology from 15 detections. - Morphology fingerprinting: - UMAP+DBSCAN of fingerprints cleanly separates isotropic, random, and fibril classes; also isolates noise as a distinct cluster. - Post hoc noise correction yields >99% classification accuracy (average F1 ~99.55% ± 0.21%). - Further resolves fibrils into branched vs non-branched subtypes via circularity features or boosted decision trees. - Robust to blinking: segmentation accuracies >90%; morphology discrimination reliable from 15 detections with or without blinking. - Tracks gradual morphology transitions in dynamic simulations, placing intermediates along gradients between classes. - Experimental validations: - REPLOM insulin aggregation: automatically extracted 139 aggregates and classified into anisotropic (sparse, elongated) vs isotropic (dense, spherical-like) clusters within minutes, matching published observations and avoiding mis-segmentation due to spatial overlap. - FGFR1 dSTORM: accurate cluster extraction with temporal refinement and smart density filtering; size estimation provided. - Syntaxin 1a sptPALM: captured clusters and their temporal evolution. - RyR live-cell PALM: accurately captured clusters with granularity beyond DBSCAN alone. - NPC dSTORM: from 285 segmentations, fingerprints separated individual NPCs, overlapping NPCs, and noise; radius-based analysis yielded diameter 114 ± 18 nm (N = 213), consistent with prior reports.

Discussion

SEMORE addresses a key bottleneck in SMLM analyses by providing a universal, unsupervised, and time-aware pipeline to extract and quantify protein assemblies without a priori knowledge or large annotated datasets. By integrating 3D spatiotemporal clustering with temporal refinement and a comprehensive morphology fingerprint, SEMORE robustly handles heterogeneous densities, sizes, overlaps, and dynamics. The approach accurately segments diverse simulated growth modes, remains resilient to strong noise and blinking artifacts, and dissects fine morphological differences (e.g., branched vs non-branched fibrils). In dynamic contexts, it identifies onset times, tracks morphological evolution, and segments shrinking assemblies. Experimental validations across five datasets (REPLOM insulin, NPC, FGFR1, syntaxin 1a, RyR) demonstrate generalization across modalities (STORM, PALM, PAINT, REPLOM), spatial scales (nm–µm), and temporal scales, yielding outputs consistent with or exceeding contemporary methods and enabling rapid analysis. These results support SEMORE as a general-purpose platform for static and 4D SMLM data, capable of accelerating unbiased discovery, mechanistic interpretation via feature ranking, and standardized quantification across laboratories and systems.

Conclusion

The study introduces SEMORE, a modular, unsupervised, and time-aware framework that automates segmentation and morphological fingerprinting of SMLM data. It transforms raw localization coordinates into individualized assemblies and quantifies their morphologies via 40+ descriptive features. Extensive simulations and multiple experimental datasets show high segmentation and classification performance, robustness to noise and blinking, and the capacity to capture temporal morphology evolution. SEMORE shortens analysis time from days to minutes and generalizes across systems and imaging modalities. Future directions include integrating motion-aware/diffusional analysis for laterally moving assemblies, expanding the feature library, and building morphology databases to facilitate mapping morphology to function and supporting future supervised learning applications.

Limitations

- Currently optimized for laterally stable assemblies; temporal refinement tolerates only limited lateral movement within the data-driven search range. A motion-aware/diffusional analysis module is planned for future versions. - At extremely low or absent noise levels, the smart density filter can remove true positives; filter settings may need adjustment for such cases. - Excessive over-clustering during the initial (DBSCAN/HDBSCAN) step can degrade final segmentation quality; care is advised when tuning initial hyperparameters. - Common imaging artifacts (e.g., blinking, vibrational/spherical aberrations, drift) should be corrected prior to analysis to avoid misinterpretation, even though SEMORE shows robustness to blinking. - Experimental ground truth is difficult to define in real datasets, limiting strict quantitative benchmarking beyond qualitative and consistency checks.

Related Publications

Explore these studies to deepen your understanding of the subject.

Engineering and Technology

Rapid and flexible segmentation of electron microscopy data using few-shot machine learning

S. Akers, E. Kautz, et al.

Medicine and Health

Predictive model of castration resistance in advanced prostate cancer by machine learning using genetic and clinical data: KYUCOG-1401-A study

M. Shiota, S. Nemoto, et al.

Chemistry

Estimating pitting descriptors of 316 L stainless steel by machine learning and statistical analysis

L. B. Coelho, D. Torres, et al.

Engineering and Technology

Improved Fault Classification and Localization in Power Transmission Networks Using VAE-Generated Synthetic Data and Machine Learning Algorithms

M. A. Khan, B. Asad, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny