logo
ResearchBunny Logo
Machine learning-aided design and screening of an emergent protein function in synthetic cells

Biology

Machine learning-aided design and screening of an emergent protein function in synthetic cells

S. Kohyama, B. P. Frohn, et al.

This groundbreaking research by Shunshi Kohyama, Béla P. Frohn, Leon Babl, and Petra Schwille showcases a machine learning-aided pipeline that successfully designs and screens proteins with new functionalities using the MinDE system. It highlights a high-scoring variant that completely replaces the wild-type MinE gene in E. coli, revealing vast potential for engineering cellular functions.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the challenge of designing proteins with emergent, higher-order cellular functions—behaviors observable only within specific biological contexts, such as pattern formation or membrane deformation. While ML-based generative models have advanced design for individual protein functions (e.g., catalysis, small-molecule binding), predicting and engineering emergent functions remains difficult due to conformational switching and cooperative interactions among proteins, lipids, and nucleotides. The authors propose that tailored, integrated screening—computational and experimental—is required for such functions. They focus on the bacterial MinDE system, where MinD and MinE form ATP-driven reaction-diffusion dynamics to position the division ring via pole-to-pole oscillations. The aim is to create an effective pipeline to screen ML-generated MinE variants for the emergent function of intracellular spatiotemporal pattern formation and to test whether such variants can functionally substitute the wild-type gene in vivo.
Literature Review
The authors situate their work within recent advances in ML-driven protein design, including conditional generative models (e.g., conditioned on Gene Ontology terms or enzyme condition numbers) and models that incorporate non-protein atoms. Prior work has successfully generated proteins with individual functions (catalysis, small-molecule binding, spike capping), but emergent functions are harder to predict and validate. The MinDE system is a well-established model for biological pattern formation and has been reconstituted in vitro on membranes and within lipid compartments, highlighting the importance of membrane interactions and spatial confinement. Evolution-based generative approaches like MSA-VAE have been experimentally validated and can produce diverse functional variants, outperforming simpler HMM-based sampling. The study leverages this background to develop an integrated screening pipeline suitable for emergent functions.
Methodology
Design and generation: The team used an MSA-based Variational Autoencoder (MSA-VAE) trained on 5,958 non-redundant MinE sequences (InterPro IPR005527; MSA width 186 columns). The VAE (two 128-unit hidden layers for encoder/decoder, 16-d latent, ReLU, Adam lr=0.001, 60 epochs) employed a modified ELBO with down-weighted KL (0.01) to prevent mode collapse. They sampled 4,000 sequences from the latent space, choosing amino acids by argmax per position. Evaluation included amino acid frequency correlations with natural sequences and PCA of latent encodings, confirming evolutionary constraints and phylogenetic clustering. Initial computational filtering: Excluded sequences with ≥60% identity to E. coli MinE, clustered remaining by 60% identity, and randomly selected one per cluster, yielding 167 heterogeneous candidates. In silico divide-and-conquer scoring: Structures were predicted with AlphaFold2 Multimer for (i) MinE:MinD heterodimer (using E. coli MinD) and (ii) MinE:MinE homodimer. Four sub-scores were computed: (1) membrane binding via N-terminal hydrophobicity (ProteinSol Patches) on the predicted heterodimer, (2) MinD interaction by mean PAE between MinD-binding helix of MinE and structured MinD regions, (3) MinE homodimerization by mean inter-chain PAE over structured regions, and (4) solubility in E. coli (ProteinSol). Scores were normalized (with inversion for PAE-based metrics) and summed to a Function Score (0–4). The 24 highest- and 24 lowest-scoring variants were chosen for experimental tests and double-blinded as synMinEv1–v48. In vitro screening with synthetic cells and cell-free expression: Variants were synthesized using the PURE cell-free system (1 h incubation). Reaction mixtures with EGFP-MinD and ATP were encapsulated in POPC/POPG (70:30) lipid droplets (water-in-oil emulsions). Confocal imaging (10 s intervals) detected Min oscillation patterns; the outcome (oscillatory vs non-oscillatory) was scored by wave occurrence across droplets. This accelerated screening (about 24 variants/day; all 48 variants in ~2 days) avoided purification. In vivo screening in E. coli: Fourteen in vitro positives were cloned with GFP-MinD into ΔminDE E. coli (HL1). Oscillations were monitored by live-cell confocal microscopy; cell phenotypes were categorized as normal, minicell, or filamentous. Top five in silico but in vitro negatives were also tested to assess false negatives. Functional characterization: Selected high-scoring variants (6 purified, including v25) underwent in vitro assays: MinD ATPase stimulation (NADH-coupled assay with DOPC/DOPG SUVs), membrane binding (QCMD on supported lipid bilayers), and oligomerization (size exclusion chromatography). Growth curves, cell-size distributions, and oscillation period vs cell length were quantified in vivo for synMinEv25 vs wild type. Post-hoc analyses: Statistical tests (Mann-Whitney-Wilcoxon, AUC) compared in silico scores to experimental outcomes. An improved two-feature function score (MinD interaction + N-terminal hydrophobicity) was derived and benchmarked against sequence identity and HMM-profile scores.
Key Findings
- Generation and computational screening: 4,000 MinE variants were generated; 167 remained after identity filtering and clustering. A four-component Function Score (membrane binding, MinD interaction, homodimerization, solubility) ranked candidates; 48 (24 high, 24 low) were selected for experiments. - In vitro synthetic-cell screening: 14 of 48 variants produced Min oscillations in lipid droplets. Of these, 10 were from the high-score group and 4 from the low-score group. The initial Function Score significantly distinguished in vitro positives from negatives (Mann-Whitney-Wilcoxon p=0.03, AUC=0.68) with minimal correlation to sequence similarity. - Improved in silico metric: A simplified score combining only MinD interaction and N-terminal hydrophobicity nearly perfectly separated in vitro positives from negatives (p=2e-7, AUC=0.92), outperforming sequence identity to nearest homolog and HMM-profile scoring. - In vivo screening: 7/10 high-scoring variants induced Min oscillations in ΔminDE E. coli, while only 1 low-scoring variant did. None of the top-5 high-score but in vitro-negative variants oscillated in vivo, supporting the in vitro filter. - Full functional substitution: synMinEv25 fully substituted wild-type MinE in vivo, restoring normal morphology and robust Min oscillations. Growth rates matched wild type (no significant difference at 300 min, Welch’s t test), minicell fraction was comparable (2.1% wt vs 2.3% v25), median cell length similar (3.5 µm wt vs 3.4 µm v25) with reduced variance (2.4 µm wt vs 1.4 µm v25). Oscillation periods vs cell length tracked wild type with slightly slower oscillations (~<10% difference; 39 s vs 42 s noted). - Biochemical parity: synMinEv25 matched wild type in membrane binding (QCMD), MinD ATPase stimulation, and oligomerization profiles; other variants showed deviations correlating with phenotypes (elevated ATPase with similar oligomerization tending to filamentous phenotype; wt-like ATPase with larger oligomers tending to minicell phenotype). - Sequence comparisons: synMinEv25 shares <50% identity and <70% similarity to E. coli MinE; identity to its closest natural homolog is 78.7%. Despite multiple required sub-functions, synMinEv25 performs at the edge of reported empirical activity cutoffs for engineered enzymes (~80% identity to closest homolog). - Throughput benefits: Cell-free expression yielded >80% successful variant production versus ~60% success for purification, enabling full experimental screening of 48 variants in ~2 days (24/day).
Discussion
The work demonstrates that emergent protein functions can be effectively screened by decomposing them into measurable sub-functions and combining structural predictions with synthetic-cell assays. The divide-and-conquer in silico approach, using AlphaFold2 Multimer-derived features and N-terminal hydrophobicity, provided predictive power beyond sequence-based similarity measures. The in vitro droplet-based system with cell-free expression supplied the necessary membrane environment and rapid throughput to assay spatiotemporal pattern formation. The convergence of the best in silico and in vitro candidate (synMinEv25) successfully substituting the wild-type gene in vivo validates the pipeline and indicates that emergent functions can be engineered and transferred to cellular contexts. The study suggests generalizability to other systems where higher-order behavior emerges from defined sub-functions (e.g., motor proteins requiring track binding, asymmetry, and allostery). It also highlights that in vivo constraints are stricter than in vitro, as evidenced by fewer oscillatory variants in cells, likely due to confinement and competing interactions (e.g., with MinC). Finally, the work underscores the caution needed when using ML-based function predictors due to dataset biases and advocates structure- and surface-feature-based assessments for emergent functions.
Conclusion
This study establishes an integrated in silico–in vitro–in vivo pipeline to design and screen proteins for emergent functions. By generating MinE variants with an MSA-VAE and applying a divide-and-conquer structural scoring, followed by synthetic-cell screening, the authors identified synMinEv25, which fully replaces wild-type MinE in E. coli, restoring growth, morphology, and Min oscillations. The simplified two-feature score (MinD interaction + N-terminal hydrophobicity) strongly predicts emergent function and outperforms sequence similarity and HMM-based measures. The approach is adaptable to other emergent systems by defining appropriate sub-functions and leveraging modular synthetic-cell environments. Future work may refine score weighting based on quantitative phenotypic correlates (e.g., balancing ATPase stimulation and oligomerization), expand to de novo designed proteins, and engineer tailored oscillation dynamics or other cellular behaviors.
Limitations
- The training dataset for MinE was modest (~5,958 sequences), and VAE performance was assessed with frequency correlations rather than a held-out test set; generalization is inferred from experimental validation. - The AlphaFold-based sub-function proxies (PAE for interactions/dimerization; hydrophobicity for membrane binding) are indirect and may not fully capture conformational switching and dynamics. - Only a subset of variants (48) was experimentally tested, and deeper biochemical characterization was limited to six purified high-scoring variants. - Correlations between individual in silico sub-scores and in vitro biochemical measures were weak, likely due to small sample sizes and the complexity of emergent functions. - In vivo validation revealed stricter requirements than in vitro; additional cellular factors (e.g., MinC interactions) influence proper division site placement, which were not explicitly modeled in scoring. - The simplified scoring omitted solubility and dimerization due to lack of discriminative power in this dataset; generality of this weighting may not hold for other proteins or contexts.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny