Biology

Machine learning for cluster analysis of localization microscopy data

D. J. Williamson, G. L. Burn, et al.

Unlock the secrets of molecular clustering with this groundbreaking research by David J. Williamson and colleagues. They introduce a fast, supervised machine-learning method that accurately classifies millions of points from single-molecule localization microscopy data, paving the way for new insights in cell biology.... show more

Introduction

The study addresses how to effectively analyze single-molecule localization microscopy (SMLM) data, which consist of lists of point coordinates rather than pixel images. Existing spatial point pattern methods (e.g., Ripley’s K, Getis & Franklin’s local analysis, pair-correlation functions, DBSCAN) often require user-defined parameters and thresholds that can be suboptimal, especially for heterogeneous biological samples. Model-based Bayesian approaches reduce parameter tuning but are computationally intensive and impractical for large datasets. The research question is whether a supervised machine learning approach can accurately and rapidly classify individual localizations as clustered or not clustered directly from coordinate-derived features, enabling robust, scalable cluster analysis without subjective thresholds, and supporting downstream quantification of cluster properties in large, heterogeneous SMLM datasets.

Literature Review

The paper reviews spatial statistics and clustering methods for SMLM: Ripley’s K function and local variants (Getis & Franklin), pair-correlation analyses, DBSCAN, and Bayesian cluster analysis. These methods can be sensitive to parameter choice, density, and sample heterogeneity; Bayesian methods are accurate but slow due to exhaustive parameter scanning. Machine learning, particularly convolutional neural networks, is commonly used for raster images but less compatible with coordinate lists. For SMLM, SR-Tesseler provides fast, interactive segmentation but requires user interaction. Additional related deep learning and clustering frameworks are discussed in the Discussion: PointNet for point-cloud segmentation (trained on uniform densities; differs from SMLM characteristics) and graph-theoretic dominant sets clustering, which still needs empirical parameter estimation and can be computationally limited. The proposed approach complements these by using supervised neural networks on nearest-neighbor distance sequences tailored to SMLM.

Methodology

Overview: The method (CAML) uses supervised neural networks that take, for each point, a sequence of nearest-neighbor (NN) distances (or differences between consecutive distances) and output a binary label: clustered or not clustered.

Input features: For each localization, distances to its N nearest neighbors (N=100 or 1000). The number of neighbors should exceed the expected maximum points per cluster.
Model architectures (Keras sequential):
- XPILJZ: Input 100 NN distances; two fully connected (dense) layers; output layer for binary classification.
- 07VEJJ: Input 100 NN distances; 1D convolutional layer; max pooling; dropout (20%); stacked LSTM layers (two); additional dropout and pooling; flatten; dense layers (32→16); output layer. Designed to capture sequential dependencies in NN sequences.
- 87B144: Same as 07VEJJ but with input 1000 NN distances to handle larger, denser clusters at higher computational cost.
Data simulation for training/validation/testing: Points distributed within irregular 'cell-like' shapes mimicking T cell synapses (TIRF footprint). Cluster scenarios parameterized by: overall point density, points per cluster, fraction of points in clusters, and maximum cluster radius. 711 viable scenarios chosen to yield 1–5 clusters per µm² and intra-cluster densities 1.5×–100× background. Scenarios include a broad biologically plausible density range, including challenging high-density cases.
Training protocol: NN distances computed for all points. Mixed samples of clustered and non-clustered points compiled and split into training, validation, and test sets. Example: 07VEJJ trained on 500,000 samples (balanced classes), validated on 100,000, tested on 100,000. Ten-fold cross-validation performed. Alternative input tested: normalized relative xy coordinates for each neighbor (2D vectors) achieving similar accuracy but requiring much longer training times (especially for 1000 neighbors).
Post-processing/segmentation: After labeling points, clustered points are grouped into clusters: each clustered point is grouped with its nearest neighbor if that neighbor is also labeled clustered; iteratively include consecutive neighbors until a non-clustered neighbor occurs. Cluster shapes formed by placing discs centered on cluster points (radius proportional to mean NN distance) and taking their union; resulting outline is eroded proportionally to mean NN distance to form final cluster shape. Cluster properties (area, shape, points per cluster, densities) are then quantified.
Performance evaluation on simulated data: Models trained on a range of scenarios and tested on held-out and novel scenarios, including densities from 10–500 points/µm², varied cluster sizes (including beyond training, up to 100 points; additional tests at 150–200 points per cluster), Gaussian-distributed clusters, and structured clusters (lines, rings). Sensitivity analyses on model input window size (50/100/200 NNs) and training on smaller vs larger clusters. Comparative benchmarking with Getis & Franklin LPPA, Bayesian cluster analysis, DBSCAN, and SR-Tesseler. Timings measured where possible.
Extensions:
- 3D model (GAXJPR): Same general architecture as XPILJZ, input 1000 3D NN distances; trained on 3D simulated spherical clusters with axial spread up to 500 nm.
- Multi-class model (3TXKFS): Input from 1000 neighbors; outputs three classes: non-clustered, clustered (round), clustered (fiber), trained on mixed circular and filamentous structures.
Experimental application: Human primary T cells (naive and pre-stimulated) imaged under non-activating (glass) and activating (anti-CD3 + ICAM-1) conditions. dSTORM imaging of Csk and PAG in TIRF. Image reconstruction via ThunderSTORM with specified filtering, drift correction, and re-blinking mitigation (merging within 50 nm and 25 frames). Statistical analyses by Kruskal–Wallis with Dunn’s tests.
Software/data: Python scripts and trained models available at https://gitlab.com/quokka79/caml; simulated data at https://osf.io/xa4zj/.

Key Findings

Model performance on simulated data:
- 07VEJJ: 92.4% accuracy on training and testing sets; F1=0.9243 (precision 0.9245, recall 0.9243). 10-fold CV: 91.8 ± 0.3% accuracy.
- 87B144 (1000 NNs): 94.0% accuracy on training and testing; F1=0.9398 (precision 0.9420, recall 0.9399).
- XPILJZ: 91.9% (train) and 92.0% (test) accuracy; F1=0.9199 (precision 0.9204, recall 0.9199). Recovered cluster properties close to ground truth.
- Robust performance (>90% accuracy) across many novel scenarios, including very high densities and clusters up to the model’s input window; decreased accuracy at low overall densities (e.g., 84.5% at 10 points/µm²; 89.2% at 50 points/µm²) and when clusters exceeded the input window (fragmentation or missed clusters for >100 points per cluster with a 100-NN model).
- Models trained on Euclidean NN distances achieved similar accuracy to coordinate-input models but trained much faster.
- 3D model (GAXJPR): 97.5% accuracy on 3D test data, detected clusters in experimental PALM data of LAT-mEos3.2, including elongated clusters due to axial resolution limits.
- Multi-class model (3TXKFS): Accurately distinguished circular clusters and filamentous structures in simulations and performed well on experimental data.
Benchmarking:
- CAML comparable to G&F LPPA and Bayesian cluster analysis in accuracy; faster than Bayesian (which sometimes did not finish on high-density CSR tests). DBSCAN sensitive to epsilon choice and scenario; SR-Tesseler fast computationally but requires significant user interaction. CAML speed competitive, especially on large datasets (~≥0.5 million points).
Biological results (dSTORM of T cell synapses; model 87B144):
- Csk:
  - Naive cells: clusters per µm² increased from 4.85 (median, IQR 3.15–7.38) to 8.80 (IQR 6.75–10.75) upon activation, P<0.0001. Percentage of points clustered increased from 53.2% (IQR 48.3–59.7%) to 61.1% (IQR 53.1–64.4%), P<0.0001. Points per cluster increased from 9 (IQR 6–14) to 11 (IQR 7–18), P<0.0001. Cluster area decreased from 2322 nm² (IQR 1310–4522) to 2144 nm² (IQR 1261–3770), P<0.0001.
  - Pre-stimulated cells: clusters per µm² increased from 6.77 (IQR 5.13–8.37) to 14.37 (IQR 11.03–18.57), P<0.0001. Cluster area decreased from 3093 nm² (IQR 1632–6154) to 1920 nm² (IQR 1030–3655), P<0.0001. Points per cluster higher than naive irrespective of activation (P<0.0001). Between statuses: non-activated pre-stimulated cells had larger Csk clusters than naive (P<0.0001), but clusters became smaller upon activation (P<0.0001).
- PAG:
  - Naive cells: clusters per µm² did not significantly change with activation (10.19 vs 13.89 clusters/µm²; P=0.3303).
  - Pre-stimulated vs naive: fewer PAG clusters in pre-stimulated cells in both non-activated (P=0.0002) and activated (P<0.0001) conditions. Points per cluster decreased with activation in pre-stimulated cells (14 [IQR 7–32] to 11 [IQR 6–20], P<0.0001). More points per cluster in pre-stimulated than naive under non-activating conditions (P<0.0001), no difference when activated (P=0.0763). Cluster areas larger in pre-stimulated cells (non-activated: 19504 nm², IQR 8597–41690; activated: 15669 nm², IQR 8054–30041) than naive (non-activated: 1551 nm², IQR 841–3159; activated: 1359 nm², IQR 766–2522), P<0.0001 in both cases.

Discussion

The proposed supervised neural-network approach classifies individual SMLM localizations as clustered or not using nearest-neighbor distance sequences, enabling accurate, scalable, and fast analysis of entire datasets without subjective thresholds or ROIs. Per-cluster segmentation downstream of point classification allows measurement of cluster area, shape, and density, providing richer spatial insights than global metrics like Ripley’s K. The method handled heterogeneous data and various cluster morphologies despite being trained on simple circular clusters, and it generalized to 3D data and multi-class labeling. Application to T cells revealed biologically relevant remodeling of Csk and PAG clustering with cell status and activation, suggesting altered PAG-mediated recruitment of Csk and potential shifts in inhibitory signaling complexes. Compared with existing methods, CAML reduces parameter sensitivity and user intervention while maintaining accuracy and improving throughput on large datasets. Nonetheless, performance depends on the model’s input window and the representativeness of training simulations, and extreme scenarios (very low densities, clusters larger than the input window) can reduce accuracy.

Conclusion

This work introduces CAML, a supervised machine-learning framework that leverages nearest-neighbor distance sequences to rapidly and accurately classify and quantify clustering in SMLM data. The approach scales to millions of points, minimizes user-defined parameters, and supports downstream per-cluster measurements. It performs competitively with established methods while offering improved automation and speed, and it generalizes to 3D datasets and multi-class structural labels. Demonstrations on T cell data uncovered condition-dependent remodeling of Csk and PAG clustering. Future directions include incorporating additional features (e.g., neighbor angles, localization precision, photon counts), developing models that output cluster membership relationships among neighbors, extending to dynamic live-cell data and multi-channel co-clustering, and exploring unsupervised learning to reduce training bias.

Limitations

Dependence on training simulations: potential bias if simulated scenarios do not reflect experimental structures; models trained on hard-edged circular clusters may underperform on drastically different structures unless retrained.
Fixed input window (number of nearest neighbors): clusters containing more points than the model’s input window can be fragmented or missed; selection of N must exceed expected cluster sizes.
Reduced accuracy at very low overall point densities and at extremes of parameter ranges; potential overfitting for large-window models (e.g., 87B144 on hard-edged clusters).
No built-in correction for SMLM artifacts (fluorophore re-blinking, sample drift, chromatic aberration, PSF overlap, labeling density issues); assumes these are corrected during reconstruction.
Does not incorporate localization precision by default; although possible, it was not used here and may affect generalizability across acquisition setups.
Occasional aberrant clustering near irregular cell boundaries or protrusions; typically large, sparse clusters that can be filtered post hoc.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

D. Rankin, M. Black, et al.

Medicine and Health

HIDDEN: a machine learning method for detection of disease-relevant populations in case-control single-cell transcriptomics data

A. Goeva, M. Dolan, et al.

Engineering and Technology

A robust synthetic data generation framework for machine learning in high-resolution transmission electron microscopy (HRTEM)

L. R. Dacosta, K. Sytwu, et al.

Computer Science

On the Readiness of Scientific Data Papers for a Fair and Transparent Use in Machine Learning

J. Giner-miguelez, A. Gómez, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny