Biology

Machine learning approach for discrimination of genotypes based on bright-field cellular images

G. Suzuki, Y. Saito, et al.

This study showcases the groundbreaking potential of bright-field microscopy images in distinguishing single-gene mutant cells from wild-type cells through a machine learning approach, conducted by leading researchers including Godai Suzuki and Yutaka Saito. Discover how texture features and morphology inference can revolutionize mutant cell profiling!

00:00

~3 min • Beginner • English

Index

Introduction

The study investigates whether label-free bright-field (BF) microscopy images contain sufficient morphological information to discriminate single-gene perturbations in human cells. Prior image-based profiling has largely relied on fluorescence microscopy targeting specific organelles, while BF imaging—though lower contrast—offers advantages such as minimal sample preparation, lower cost, and reduced imaging artifacts. The authors pose the research question of whether single-gene knockout mutants can be distinguished from wild-type cells using machine learning on BF images and whether the resulting morphological feature profiles reflect functional relationships among genes. They focus on genes in the ubiquitin-proteasome system with paralog redundancy to ensure nonlethal, subtle phenotypes, testing the limits of BF-based profiling for sensitive genotype discrimination.

Literature Review

The paper situates its work within morphological profiling and high-content imaging literature. Previous studies have successfully used fluorescence microscopy and deep learning for high-throughput phenotyping and organelle-specific analyses. BF imaging has been used less frequently due to lower contrast but has shown capability to classify different cell lines, distinguish infected vs. non-infected macrophages, and identify live vs. dead cells. BF images can encode information about organelle morphology and localization, and nuclei can be detected from BF alone. However, prior BF-based classifications typically addressed large phenotypic differences (e.g., across cell lines or infection status) or used advanced hardware/3D stacks or large CNN datasets. The present work addresses the gap of detecting subtle, single-gene perturbations using standard BF images and interpretable feature-based models.

Methodology

- Cell lines and genome editing: HEK293Ta cells expressing Cas9 were established and used as parental wild-type. Eight nonlethal genes in the ubiquitin-proteasome system with paralog redundancy (PSMA2, PSMA7, PSMB5, PSMB6, PSME1, PSME2, UBQLN1, UBQLN2) were targeted with CRISPR-Cas9 using lentiTRACE-puro sgRNA constructs. Lentiviral transduction, antibiotic selection, single-cell cloning via FACS, and deep sequencing-based genotyping (RCP-PCR indexing, BLASTn alignment, allele frequency filtering) were employed. Clones with frameshift mutations on both alleles were selected. Two independent clones per gene were pooled for analysis. - Imaging: Knockout clones were seeded at three densities (C400, C2000, C10000). One to three days post-passage, nuclei were stained with Hoechst 33342. BF and fluorescence images were acquired using IN Cell Analyzer 6000 (20x lens), capturing 25 fields per well. For primary analyses, the C10000 dataset was used; C2000 and C400 served as independent fold-out test sets. - Single-cell extraction and feature quantification: Nuclei were detected in Hoechst images using a Faster R-CNN detector (ResNet-101 backbone) trained on annotated data. BF image patches were cropped around each detected nucleus to define single-cell regions. From each cell’s BF patch, 296 texture features (LPX296/KBI feature extractor) were computed, including intensity distribution statistics, binarized shape-related measurements (e.g., thinness, size, perimeter), and multi-threshold clump-based metrics capturing number, size, and non-circularity of darker intracellular structures. - Preprocessing: Each feature was standardized using the mean and SD of the wild-type population. Outliers (beyond ±3 SD) in each mutant-feature pair were removed. Feature datasets from two independent clones per gene were pooled. - Classification models: For each mutant vs. wild-type comparison, a logistic regression with L1 regularization (LASSO) was trained using glmnet. Model output is y = σ(β’x + α). Nested 10-fold cross-validation optimized the penalty λ (inner loop) and estimated performance (outer loop) by AUC; additional metrics are in Supplementary Table 1. Alternative models (SVM, random forest, k-NN) were implemented in scikit-learn, with and without PCA, and hyperparameters optimized via nested CV. Image preprocessing variants (blur, edge enhancement, sharpen) were evaluated for impact on AUC. - Feature interpretation: Features with nonzero, large-magnitude coefficients in L1 models were identified as contributors. Clump-detection procedure across multiple thresholds τ was detailed, computing sample means across τ for number of clumps, average clump size, and area-weighted non-circularity. - Similarity analysis: Morphological profiles for each mutant were defined as the vector of fitted regression coefficients (β). Pairwise Pearson correlations between profiles were computed, and hierarchical clustering (complete linkage) performed to assess relationships among mutants. - Data size and reproducibility: Learning curves assessed AUC vs. training set size. Generalization was tested by training on C10000 and evaluating on independent C2000 and C400 datasets.

Key Findings

- Discrimination performance: Using logistic regression with L1 feature selection, PSMB5 mutant vs. wild-type discrimination achieved mean AUC = 0.773 (tenfold CV). Across all mutants, AUCs exceeded 0.59, demonstrating above-random discrimination from BF images. Alternative classifiers (SVM, random forest, k-NN, with/without PCA) yielded comparable AUCs, but L1-logistic regression performed best. - Contributing features: On average, 53 features were selected per mutant model. High-weight features often derived from darker intracellular “clumps” (putative organelles/structures). PSMB5 mutants exhibited more clumps, larger clump areas, and higher non-circularity than wild type. Additional contributing features included intensity distribution statistics and binarized shape metrics (thinness, size, perimeter). - Functional relationships reflected in profiles: Hierarchical clustering of mutants by regression coefficients grouped paralog pairs with shared functions (PSME1–PSME2; UBQLN1–UBQLN2). Notably, a nonparalog pair (PSMB5–PSMA7) clustered together, consistent with physical proximity of their encoded subunits in the proteasome core complex. - Data size and preprocessing: AUC gains saturated with increasing training size for most mutants, except UBQLN1/UBQLN2 where more data may help. Simple image preprocessing (blur, edge enhancement, sharpening) slightly decreased AUC. - Reproducibility: Models trained on C10000 and tested on independent C2000 and C400 sets showed slightly reduced but still above-baseline AUCs (>0.5) across all mutants, indicating applicability to independent datasets.

Discussion

The findings confirm that label-free BF microscopy contains sufficient morphological signal to detect subtle, single-gene perturbations in human cells using interpretable machine learning. The successful discrimination addresses the central question and extends prior BF-based classification beyond markedly different phenotypes to single-gene knockouts. The prominent role of clump-derived features suggests that morphological alterations in organelles or intracellular structures underlie the discriminative signal, consistent with prior evidence that BF images encode organelle morphology and localization. Moreover, similarity of feature coefficient profiles mirrored functional relationships—paralogs clustered together and a pair of physically contacting proteasome subunits grouped—supporting that the extracted profiles capture biologically meaningful effects of gene dysfunction. Compared with prior approaches requiring high-resolution or 3D imaging or large CNN datasets, the study demonstrates that a simpler, feature-based logistic model trained on >670 2D images per class can effectively profile mutants. Generalization tests showed consistent above-baseline discrimination on independent datasets, though with modest performance drops, underscoring both promise and areas for improvement.

Conclusion

This study demonstrates that single-gene knockout cells can be discriminated from wild-type cells using texture features from label-free bright-field images and a sparse logistic regression model. Contributing features point to morphological changes of intracellular structures, and feature profiles reflect functional and even physical relationships among gene products. These results establish BF-based, low-cost, and label-free morphological profiling as a viable strategy for genetic perturbation detection, with potential to impact high-throughput phenotyping and functional genomics. Future work should broaden gene coverage beyond the ubiquitin-proteasome system, integrate additional label-free modalities (e.g., phase-contrast, DIC), explore advanced yet interpretable models, and expand datasets for mutants with weaker signals to further improve accuracy and generalization.

Limitations

- Potential off-target effects from CRISPR-Cas9 editing were not ruled out; while two independent clones per gene were pooled, whole-genome sequencing would be needed to assess off-target impacts. - Scope limited to eight genes in the ubiquitin-proteasome pathway; generalizability to other pathways and cellular functions remains to be established. - Independent dataset performance, while above random, was slightly reduced compared to cross-validation, indicating room for improved robustness. - Simple image preprocessing degraded performance; optimization of preprocessing and feature extraction specific to BF imaging may yield gains.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Machine learning for accurate estimation of fetal gestational age based on ultrasound images

L. H. Lee, E. Bradburn, et al.

Medicine and Health

Prediction of mortality risk and duration of hospitalization of COVID-19 patients with chronic comorbidities based on machine learning algorithms

P. Amiri, M. Montazeri, et al.

Medicine and Health

Machine learning-based prediction of COVID-19 diagnosis based on symptoms

Y. Zoabi, S. Deri-rozov, et al.

Medicine and Health

Interpretable machine learning-based decision support for prediction of antibiotic resistance for complicated urinary tract infections

J. Yang, D. W. Eyre, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny