Biology
Deep flanking sequence engineering for efficient promoter design using DeepSEED
P. Zhang, H. Wang, et al.
Promoters are core genetic elements that regulate gene expression. Designing synthetic promoters with strong or inducible properties is critical for applications in biosynthetic engineering and gene therapy. Traditional design has focused on cis-regulatory elements (TFBSs) like the -10/-35 elements in prokaryotes and the TATA-box in eukaryotes, often represented by PWMs and assembled by expert rules. However, accumulating evidence shows that flanking sequences around TFBSs significantly influence promoter properties via effects on DNA physicochemical shape, specific flanking preferences, and low-affinity binding sites that enhance TF binding. These features are difficult to encode into explicit rules, leaving flanking sequence optimization largely unexplored. While deep learning has generated diverse constitutive promoters by learning from large natural datasets, these data-driven models struggle to design promoters with specific properties (e.g., inducible or tissue-specific) due to the scarcity of suitable training examples. Consequently, most practical synthetic promoters rely on expert-designed motif arrangements, with flanking regions chosen arbitrarily, often yielding suboptimal outcomes. This work addresses the gap by integrating expert-specified motif 'seeds' with deep learning-based flanking sequence optimization to design promoters with desired properties.
Prior work established the central role of TFBS motifs (-10/-35, TATA-box) and their combinatorial arrangements in promoter design, summarized by PWMs. Recent studies have highlighted the importance of motif-flanking sequences affecting DNA shape (MGW, Roll, ProT, HelT), TF-specific flanking preferences, and contributions of low-affinity binding site clusters to TF binding and regulatory robustness. Deep learning approaches have advanced promoter engineering in both prokaryotes and eukaryotes, generating de novo constitutive promoters by learning shared patterns from large datasets. However, these models face limitations for inducible or tissue-specific promoters due to limited training examples, and expert designs still dominate such tasks. Previous optimization often lacked systematic strategies for flanking sequences, leading to high variability and trial-and-error. This study unifies expert knowledge with data-driven modeling to explicitly target flanking sequence optimization.
DeepSEED formalizes promoter design as maximizing the joint probability P(s, T), where s is the sequence and T the target property. The sequence s is decomposed into 'seed' motifs m (set by expert knowledge) and flanking regions f. By chain rule and Bayes, the optimization is split into two stages: (Stage I) maximize P(m|T) by selecting seed motifs m* (e.g., -10/-35, tetO, lacO, miniCMV) consistent with T; (Stage II) maximize P(f|m*, T) ≈ P(f|m*)·P(T|f, m*). Stage II is implemented by: (1) a conditional GAN (CGAN) generator with attention layers to model P(f|m*), generating flanking sequences conditioned on m*; (2) a DenseNet-LSTM predictor to estimate P(T|f, m*), i.e., promoter properties; and (3) a genetic algorithm (GA) optimizing latent variables z fed to the generator to maximize the predictor output F(G(m*, z), m*). Model architectures and training: The CGAN uses multi-head attention to capture long-range dependencies, with resblocks and WGAN-GP for stable training. An L1 reconstruction loss encourages recovery of masked flanking regions during training. The predictor begins with 1D convolutions (64 channels) to capture local patterns, followed by LSTM for regional relationships and DenseNet blocks (4 blocks with 2,2,4,2 layers; growth rate 32; kernels 1 and 3) to extract long-range features; a final fully connected layer outputs property scores. For E. coli, the training set comprises 165 bp functional promoters (from Johns et al.) filtered to positional constraints on -10/-35 and spacer lengths. Generator input concatenates one-hot motif seeds with random vectors in flanking regions; discriminator input concatenates seeds and generated sequences. Training used Adam (lr 1e-4, beta1 0.5, beta2 0.9), batch size 32, 50,000 batches. Optimization: The GA (sko Python package) optimizes z with 5×1024 seeds, mutation probability 0.005, for 100 epochs in vectorized mode. Datasets: Prokaryotic tasks used Johns et al. (29,249 regulatory sequences from 184 prokaryotic genomes) for training generator and predictor. Mammalian Dox-inducible tasks used HACER HEK293 enhancers (26,604 sequences) to train the generator and Ernst et al. MPRA tiles (15,720 regions) to train the predictor. For mammalian training, known motifs (JASPAR) in input sequences were annotated and preserved, while flanking regions were masked for the generator to learn to reconstruct them. Design tasks and experimental validation: (1) E. coli constitutive promoters: fixed -10/-35 motifs as seeds; DeepSEED generated flanking sequences. Initial seeds were three iGEM promoters (BBa_J23119, J23118, J23114). Controls: Control-1 extended to 165 bp with random flanks; Control-2 preserved seeds but randomized other regions. Activities measured via sfGFP fluorescence (and mRFP subset) in LB, M9, and EZ-rich media. BLAST against E. coli K12 genome assessed similarity; motif scans (FIMO) checked for unintended second promoters. (2) E. coli IPTG-inducible promoters: seeds included -10/-35 elements, spacer length, and 2/3/4 lacO sites with appropriate spacing for loop formation. Backbones were 25 randomly chosen constitutive promoters from the dataset. A substitution control directly replaced backbone regions with lacO sites; DeepSEED optimized flanks around these seeds. Induced activities and fold-changes were measured (0.1 mM IPTG). Performance was compared to pLlacO1 and placUV5. (3) Mammalian Dox-inducible promoters: Tet-On TRE system with miniCMV and tetO sites as seeds. Generator trained on HEK293 HACER enhancers; predictor on Ernst et al. tiles. To avoid increased basal expression, sequences with additional TF binding motifs were filtered. First, 3-tetO truncated TRE promoters were optimized and tested; then top-performing 3-tetO flanks were combined to assemble 7-tetO promoters. Induced activity (EYFP) and fold change were measured by flow cytometry in HEK293; selected promoters were also tested in HepG2. Analyses: Saliency maps from the predictor highlighted influential positions, clustered by k-means. k-mer (k=4–6) frequencies compared between natural and DeepSEED-designed sequences overall and in distal/proximal regions. DNA shape (MGW, Roll, ProT, HelT) profiles were computed near -10 motifs and across entire promoters; features embedded with DeepInfoMax and visualized via t-SNE to relate shape features to activity. Semantic sequence space embeddings (DeepInfoMax + t-SNE) compared natural and DeepSEED sequences across species. Sequence diversity assessed via edit distances; BLAST e-values quantified genomic similarity.
- Importance of flanking sequences: Predictor saliency maps of 2,000 functional E. coli promoters showed strong influence of -10/-35 elements and distinct, cluster-specific influential patterns in flanking regions, underscoring their contribution to activity.
- Capturing implicit patterns: DeepSEED-designed E. coli promoters recapitulated natural k-mer frequencies (k=4–6) with high correlations across entire promoters (Pearson r up to 0.98) and in distal/proximal regions (r=0.85 and 0.73). For Dox-inducible designs preserving tetO, k-mer frequencies correlated strongly with natural sequences (r≈0.95). DNA shape features around the 5′ ends of -10 motifs (TATAAT, TATAAA) in DeepSEED sequences resembled natural promoters and differed from random. t-SNE embeddings of DNA shape features separated high vs low activity promoters; DeepSEED optimization shifted sequences toward high-activity regions.
- Novelty and diversity: DeepSEED-generated promoters showed edit-distance differences comparable to random-flanked controls and lower BLAST similarity to the natural E. coli genome than prior designs; semantic embeddings colocated DeepSEED and natural E. coli promoters, indicating learning of species-specific promoter semantics.
- E. coli constitutive promoters: Relative to Control-2 (randomized flanks with seeds fixed), DeepSEED increased activity by 1.42× (J23119 group; p=1.91E-03), 4.11× (J23118; p=1.18E-09), and 33.43× (J23114; p=8.22E-09). Compared to a prior whole-sequence generation method, DeepSEED achieved a 6.73× average activity increase (p=1.75E-24). Activities measured with sfGFP correlated with mRFP (Pearson r=0.83). Designed promoters maintained high activity across LB, M9, and EZ-rich media and did not introduce second high-activity promoters in flanks (FIMO scans).
- E. coli IPTG-inducible promoters: Direct substitution of lacO reduced induced activity by 52.8% (2 lacO), 80.8% (3 lacO), and 97.1% (4 lacO). DeepSEED restored and enhanced induced expression by 8.96× (2 lacO; p=2.15E-23), 6.54× (3 lacO; p=5.54E-08), and 47.96× (4 lacO; p=2.65E-06) over substitution controls. Fold-change improved by 2.20× (3 lacO; p=5.00E-04) and 2.94× (4 lacO; p=9.32E-05) on average; 2 lacO designs had higher basal expression and lower fold-change on average. Many DeepSEED designs outperformed pLlacO1 and showed tunable trade-offs between induced level and fold-change by varying lacO count. t-SNE of predictor features placed DeepSEED designs in high-activity regions.
- Mammalian Dox-inducible promoters: Among 12 3-tetO designs, 75% exceeded the 3-tetO template in induced activity (up to 2.46×) and 50% improved fold-change (up to 1.41×); some 3-tetO designs approached the induced activity of the canonical 7-tetO TRE at 54.4% of its length. Combining optimized 3-tetO flanks into 7-tetO promoters yielded 77.8% with higher induced activity (average 1.13×, max 1.23×) and 83.3% with higher fold-change (up to 1.61×); 72.2% improved both metrics. Many designs generalized from HEK293 to HepG2 with consistent performance. Overall, DeepSEED efficiently optimized flanking sequences to achieve desired promoter properties across prokaryotic constitutive, IPTG-inducible, and mammalian Dox-inducible contexts, while producing diverse, novel sequences that preserve key statistical and biophysical features.
The study addresses the challenge of optimizing promoter flanking sequences—often neglected in traditional motif-centric designs—by integrating expert-defined seed motifs with deep learning to learn implicit sequence preferences. The results demonstrate that flanking regions substantially affect promoter activity, and DeepSEED can capture relevant sequence statistics (k-mers) and DNA shape features linked to function. By unifying constitutive and inducible promoter design under a probabilistic, conditional optimization framework, DeepSEED overcomes limitations of purely data-driven generative models that require many examples of the target class. Across E. coli constitutive and IPTG-inducible tasks and mammalian Dox-inducible promoters, DeepSEED consistently improved induced activity and/or fold-change relative to controls, with robust performance across media and cell types. Embedding analyses showed that optimized designs occupy high-activity regions in learned functional spaces, suggesting compatibility between generated flanks and seed motifs. Sequence similarity analyses indicate designs are novel rather than copied. While interpretability remains limited, observed alignment of DNA shape features with activity provides mechanistic plausibility. These findings validate the knowledge–data co-driven approach for promoter engineering and suggest broader applicability to other regulatory elements and properties.
DeepSEED introduces a two-stage AI-aided framework that combines expert-specified seed motifs with deep learning-based flanking sequence generation and prediction to efficiently design synthetic promoters with desired properties. It learns implicit patterns in flanking regions, preserves key sequence and DNA shape features, and generates diverse, novel promoters with improved activities and inducibility across prokaryotic and eukaryotic systems. The framework generalizes across different promoter types by encoding desired functions in seeds while optimizing flanks globally. Future work should expand training datasets for additional objectives (e.g., minimizing basal/leaky expression, enhancing sequence stability, achieving cell type specificity), improve interpretability to derive explicit flanking rules, and validate performance in genomic integration contexts. The same strategy could extend to designing other regulatory DNA elements across organisms.
Experimental validation was performed in plasmid contexts; genomic integration effects (chromatin accessibility, nucleosome positioning, epigenetics) may alter behavior and require further validation. The current model primarily optimizes expression levels due to limited high-throughput datasets for other properties; some inducible designs exhibited elevated basal expression. Interpretability of the learned features is partial (k-mers, DNA shape) and underlying mechanisms remain to be fully elucidated. Task-specific datasets (e.g., for leakiness, stability, tissue specificity) are needed to train models targeting additional constraints.
Related Publications
Explore these studies to deepen your understanding of the subject.

