Biology

Deep flanking sequence engineering for efficient promoter design using DeepSEED

P. Zhang, H. Wang, et al.

Discover how DeepSEED, an innovative AI-powered framework, revolutionizes the design of synthetic promoters critical for synthetic biology. This research, conducted by Pengcheng Zhang and colleagues, optimizes flanking sequences and unveils hidden features that enhance the functionality of *E. coli* and mammalian promoters.

00:00

Playback language: English

Index

Introduction

Precise control of gene expression is paramount in synthetic biology and gene therapy, necessitating the design of synthetic promoters with specific characteristics. Traditionally, promoter design focuses on cis-regulatory elements like TFBSs (transcription factor binding sites), often represented by PWMs (position weight matrices). However, flanking sequences surrounding TFBSs significantly impact promoter properties, influencing physicochemical properties such as DNA shape, specific flanking sequence preferences by TFBSs, and the presence of low-affinity binding sites. These factors are difficult to incorporate into explicit design rules, leaving flanking sequence optimization largely unexplored. While deep learning models show promise in promoter engineering by capturing implicit sequence patterns, their application is limited when dealing with specific promoter types (e.g., inducible promoters) due to the scarcity of naturally occurring examples. Therefore, current methods rely heavily on expert knowledge, often leading to suboptimal designs. This paper introduces DeepSEED to bridge this gap by integrating expert knowledge with deep learning to optimize both the core promoter elements and their flanking sequences.

Literature Review

Existing research emphasizes the importance of cis-regulatory elements, particularly TFBSs, in determining promoter activity. Methods for designing promoters often involve manipulating combinations and arrangements of TFBS motifs based on known sequence preferences. However, recent studies increasingly highlight the significant contribution of flanking sequences to promoter function. These flanking regions influence promoter properties through various mechanisms, including effects on DNA shape, specific sequence preferences, and the presence of low-affinity binding sites that can enhance TF binding. These influences are complex and not easily captured by simple rules, leading to a need for advanced approaches like those offered by AI and machine learning. Previous attempts at utilizing deep learning for promoter design have focused on generating de novo promoters, but these methods struggle with designing promoters with specific properties because of limited training data. This necessitates a hybrid approach combining data-driven methods with expert knowledge.

Methodology

DeepSEED employs a two-stage approach. Stage I integrates expert knowledge by specifying 'seed' sequences (e.g., TFBSs) based on the desired promoter properties. Stage II optimizes the flanking sequences using a deep learning model. This model consists of two neural networks: a conditional generative adversarial network (CGAN) generates flanking sequences conditioned on the 'seed' sequences, and a DenseNet-LSTM-based model predicts promoter properties. The CGAN learns the implicit patterns of flanking sequences from large datasets of natural promoters. The DenseNet-LSTM architecture is designed to capture long-range interactions within promoter sequences and effectively predict promoter activity. The optimization is guided by a genetic algorithm (GA) that iteratively refines the flanking sequences, maximizing the predicted promoter activity while maintaining the 'seed' sequences. The probabilistic framework of DeepSEED is expressed mathematically to maximize the joint probability of the promoter sequence and target property. Specific datasets used for training included those from Johns et al. (for *E. coli* promoters), HACER (for human enhancer sequences in HEK293 cells), and Ernst et al. (for human regulatory regions). The study evaluated the designed promoters in *E. coli* (constitutive and IPTG-inducible) and mammalian cells (Dox-inducible) using reporter gene assays (sfGFP and mRFP). Analyses included k-mer frequency comparisons, DNA shape feature analysis using t-SNE, and BLAST searches to assess sequence novelty. Statistical analyses involved t-tests to compare promoter activities.

Key Findings

DeepSEED successfully designed high-activity promoters across various types: *E. coli* constitutive, IPTG-inducible, and mammalian Dox-inducible promoters. For *E. coli* constitutive promoters, DeepSEED-designed sequences exhibited 1.42- to 33.43-fold increases in activity compared to control groups with randomized flanking sequences. The DeepSEED approach significantly outperformed a previous whole-sequence generation method. K-mer frequency analysis revealed strong correlations between natural and DeepSEED-designed sequences, indicating successful capture of sequence patterns. DNA shape feature analysis showed that DeepSEED-optimized promoters exhibited DNA shape features more similar to those of high-activity natural promoters than to randomly generated sequences. t-SNE embedding of DNA shape features showed that DeepSEED optimization moved promoters into high-activity regions of the functional sequence space. In *E. coli* IPTG-inducible promoters, DeepSEED-designed promoters showed substantial improvements (8.96- to 47.96-fold) over control groups where *lacO* sites were simply substituted into existing promoters. This highlights the importance of flanking sequence optimization in inducible promoter design. In the mammalian Dox-inducible system, DeepSEED-designed promoters with three *tetO* sites showed up to 2.46-fold improvements in induced activity compared to the template. Combining flanking sequences from high-performing 3-*tetO* promoters to create 7-*tetO* promoters yielded further improvements (up to 1.23-fold). Remarkably, DeepSEED-designed promoters demonstrated consistent performance in both HEK293 and HepG2 cell lines, despite training data coming only from HEK293 cells.

Discussion

The study demonstrates the significant impact of flanking sequences on promoter activity. DeepSEED's ability to successfully design promoters of various types highlights the power of integrating expert knowledge with deep learning. The observed correlations between natural and DeepSEED-designed sequences in terms of k-mer frequencies and DNA shape features suggest that the model successfully captures biologically relevant features. The improvements in promoter activity, especially the substantial increases observed in both prokaryotic and eukaryotic systems, validate the importance of considering flanking sequences beyond simple motif arrangements. While this study used plasmid systems, future work should explore the performance of these AI-designed promoters in genomic contexts where chromatin structure and other epigenetic factors could influence their activity. The results emphasize the need for more comprehensive consideration of flanking sequences in promoter engineering.

Conclusion

DeepSEED provides an efficient and effective method for synthetic promoter design. Its success in improving various promoter types underscores the critical role of flanking sequence optimization. While this study primarily focused on expression level optimization, the framework can be adapted to optimize other properties by training on relevant datasets. Further research could explore the application of DeepSEED to other genetic elements and organisms, and investigate the biological mechanisms underlying the influence of flanking sequences on promoter activity.

Limitations

The study primarily focused on optimizing promoter expression levels. While some inducible promoters exhibited high induction rates, others showed elevated basal expression. Future work should focus on generating larger datasets to address this and incorporate other crucial aspects of promoter design like leaky expression, sequence stability, and cell-type specificity. Additionally, the experimental validation was performed in plasmid systems; further investigation is needed to assess the performance of these promoters within the genomic environment, considering factors such as chromatin accessibility and epigenetic modifications.

Related Publications

Explore these studies to deepen your understanding of the subject.

Engineering and Technology

Small dataset machine-learning approach for efficient design space exploration: engineering ZnTe-based high-entropy alloys for water splitting

S. V. Oh, S. Yoo, et al.

Medicine and Health

Design and Analysis of a Deep Learning Ensemble Framework Model for the Detection of COVID-19 and Pneumonia Using Large-Scale CT Scan and X-ray Image Datasets

X. Xue, S. Chinnaperumal, et al.

Engineering and Technology

Design of optical meta-structures with applications to beam engineering using deep learning

R. Singh, A. Agarwal, et al.

Biology

Design of a recombinant asparaginyl ligase for site-specific modification using efficient recognition and nucleophile motifs

J. Tang, M. Hao, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny