Engineering and Technology
Promiscuous molecules for smarter file operations in DNA-based data storage
K. J. Tomek, K. Volkel, et al.
The study addresses how to organize, access, and search data in DNA-based storage systems where information is stored as many distinct, freely diffusing DNA molecules. Conventional architectures avoid nonspecific primer–address interactions by enforcing large Hamming distance (HD) between address sequences, which reduces address space and limits metadata and functionality. The authors hypothesize that thermodynamically driven, imperfect DNA hybridizations—typically treated as hindrances—can be predictably tuned via environmental parameters (temperature, primer concentration, ionic conditions) to differentially access subsets of strands using the same primer. The goal is to exploit this tunable promiscuity to expand usable address space and implement in-storage functions, demonstrated here as a File Preview capability that retrieves low-resolution versions of files prior to full access, potentially reducing sequencing cost and latency in practical systems.
Prior DNA data storage systems demonstrated feasibility up to gigabyte scales using PCR-based random access and hybridization-based separations, typically requiring 20-nt addresses designed to be at least 6–10+ HD apart to avoid cross-hybridization. This constraint drastically reduces the theoretical address space (from 4^19 for 20-nt to only thousands of non-interacting addresses) and limits system capacity and metadata-rich search. Thermodynamic models and empirical work have shown that hybridization affinity and kinetics depend on sequence mismatches, temperature, concentration, and ionic conditions. Recent efforts leveraged nonspecific interactions for in-storage search, but broad, controllable exploitation for file operations remained underexplored. The present work builds on thermodynamic principles and PCR optimization literature to intentionally use controlled nonspecific binding to enable multi-level access within files, akin to progressive access in formats like progressive JPEG, but implemented molecularly.
The authors combine computational modeling, experimental screening, and end-to-end file encoding/decoding:
- Thermodynamic/Monte Carlo hybridization modeling: Using NUPACK, they simulated primer hybridization likelihood versus HD by mutating a storage primer to generate variants with HD 1–N and computing ΔG for binding to reverse complements. Binding was called when ΔG < −10 kcal/mol. At least 10,000 trials per HD were run to estimate hybridization probability.
- qPCR verification across HD: Using a single primer pair (0 HD) and 10 reverse binding sequences at each even-numbered HD (0 to 20), they quantified successful amplifications across primer concentrations (125–500 nM), template concentrations (2×10^3–2×10^6 strands/μL), and annealing temperatures (40–60 °C), using Cq thresholds and melt curves to define success.
- Competitive PCR assays: They set up reactions with two templates accessed by the same primer pair: a 200 bp template with perfect primer sites (0 HD) and a 60 bp template with mismatched sites (typically 2 HD). They varied annealing temperature (40–60 °C) and primer concentration (125–500 nM) to observe selective amplification under stringent (higher temperature/lower primer) versus promiscuous (lower temperature/higher primer) conditions via gel electrophoresis and high-sensitivity fragment analysis. Libraries combining six distinct 2 HD forward and five distinct 2 HD reverse binding sites (paired combinatorially) were screened to quantify amplification tunability (ratio changes of mismatch to perfect products between conditions).
- Parameter sweeps for specificity/promiscuity: Using strands with 0, 2, 4, 5 (accidental), and 6 HD binding sites tagged with unique restriction sites, they quantified the fraction of each HD amplified under varying conditions. Parameters included annealing temperature (40–60 °C), primer concentration (62.5–1000 nM), MgCl2 (0.75–3 mM), KCl (50–200 mM; amplification inhibited beyond ~150 mM in their setup), detergents (0.1–1% Triton X-100, Tween-20), BSA (0.1–1%), DMSO (2–8%), betaine (0.1–3.5 mM), cycle number (25–40), and anneal/extension times (20–90 s). Capillary electrophoresis quantified contributions of each HD class.
- File encoding with progressive partitions and HD-tagged addresses: JPEG images (Wuflab logo; two Wright glider photos; Earth) were arranged into progressive scans (42 scans; e.g., Y[0], Cb[0], Cr[0], then grouped AC bands). Scans were grouped into 3–4 partitions per file so that early partitions yield recognizable low-res images (grayscale/DC), with later partitions adding color and higher frequencies. Each partition was encoded as DNA blocks: 1665-byte blocks arranged 185×9 bytes; RS outer code (n=255,k=185,d=71) added redundancy; each row indexed (2 bytes) and protected by RS inner code [n=14,k=11,d=4]. Bytes mapped to 8-nt comma-free, GC-balanced codewords with edit distance ≥2. Primer binding sites for each partition were assigned to different HD classes (e.g., 0 HD for Preview, 4 HD for Intermediate, 6 HD for full). A central restriction site facilitated partition quantification. Flanking common primers enabled library-wide amplification. To balance PCR efficiency, mismatched partitions were stored with higher physical copy numbers (e.g., 10× for lower HD mismatches, 100× for 4 HD, 1000× for 6 HD in screening; later analyses discuss 1:100 ratios for 4 HD vs 0 HD).
- Background generation and scaling: An unrelated text file (U.S. Declaration of Independence) was amplified using error-prone PCR (with dPTP and 8-oxo-dGTP) to generate a noisy, nonspecific background equivalent to ~1.5 GB of data (assuming 200-nt strands with 17 bytes data per strand and 10× physical redundancy). Image file accesses were performed in the presence of this background.
- Sequencing and decoding: Products from selected conditions were subjected to Illumina NGS. Reads were clustered with Starcode (message passing; Levenshtein distance 8; cluster ratio 5) seeded with expected strands to quantify per-strand read counts. Decoding reconstructed JPEGs by partition indices; incomplete later partitions yielded lower-resolution images while preserving readability of earlier partitions.
- Economic and density analysis: They modeled storage density overhead as a function of Preview percentage and copy number ratios, and sequencing cost savings for finding one target among multiple similar files using Preview versus full-file sequencing.
- PCR stringency is tunable: Modeling and experiments show hybridization/amplification likelihood decreases with increasing HD, plateauing around HD ≈10. Lower annealing temperatures and higher primer concentrations increase nonspecific amplification; higher temperatures and lower primer levels increase specificity.
- Competitive PCR demonstrates controllable selectivity: With one primer pair, stringent conditions amplify only 0 HD strands, whereas promiscuous conditions co-amplify 0 HD and 2 HD strands. A library of 2 HD variants shows sequence-dependent tunability across both temperature and primer concentration changes.
- Multi-level File Preview implemented: Files were partitioned and tagged so that 0 HD yields a low-resolution Preview, adding 4 HD yields an intermediate image, and adding 6 HD yields full access. An unintended 5 HD site in payload prevented 6 HD access in one experiment but enabled a sharp transition between 4 and 5 HD, confirming fine-grained control.
- Parameter influence on access layers: Annealing temperature had the strongest effect (40–45 °C increased mismatched fractions; 55–60 °C increased specificity). Primer and MgCl2 concentrations inversely correlated with specificity; KCl increased specificity up to ~150 mM (beyond which PCR was inhibited). Distinct condition sets could selectively access 0; 0–2; 0–2–4; and 0–2–4–5 HD subsets, yielding corresponding image qualities.
- Robustness in large, noisy backgrounds: In a 1.5 GB-equivalent nonspecific background, Preview strands constituted ~0.036% of total strands. Stringent conditions reliably amplified and decoded Preview images for all four files; promiscuous conditions retrieved full images without background contamination.
- Cost and density tradeoffs quantified: With ~5% of strands designated as Preview, compensating for PCR efficiency required storing ~100× more copies per unique 4 HD strand than 0 HD (1:100), leading to ~95× physical space versus theoretical minimum; reducing the copy ratio to 1:10 or 1:5 cuts overhead to ~9.5× or ~4.8×, respectively. For a 15-file search where only previews are initially sequenced, a 5% Preview reduces the cost to find and fully sequence a target file by 85.3% versus full-file sequencing; using 1% Preview yields a 91.7% cost reduction. Cost advantage increases as Preview fraction decreases and as the number of files searched increases.
The work demonstrates that thermodynamically controlled, imperfect DNA hybridizations can be leveraged as a feature rather than a drawback in DNA storage. By tuning annealing temperature, primer concentration, and ionic conditions, the same primer can differentially access subsets of strands carrying progressively more data, enabling a practical File Preview operation. This addresses the challenge of limited address space and improves data access workflows by reducing sequencing burden when searching or triaging files. The approach also suggests broader system-level functionalities: prioritizing frequently accessed data, adjustable search stringency, and deduplication by sharing common data across files via promiscuous binding. While unintended 5 HD matches in payload disrupted 6 HD access in one case, the result also validated sharp thresholding between HD classes under tuned conditions. Overall, the findings show that previously disallowed address similarities can be harnessed to expand usable address space and implement in-storage operations without sacrificing file-level random access, provided conditions are carefully optimized.
This study introduces controllable primer promiscuity as a mechanism to implement multi-level file operations in DNA storage, exemplified by a File Preview that retrieves low-resolution images using the same primer pair under stringent conditions and full images under more permissive conditions. The approach works across multiple images and in a realistic, gigabyte-scale noisy background. Quantitative analyses show substantial sequencing cost savings for file search and triage, with tunable tradeoffs in physical storage density. Future work should: (i) refine encoding to avoid unintended near-match sites in payloads; (ii) systematically characterize off-target risks under promiscuous conditions at larger scales; (iii) co-optimize polymerases, primers, and access conditions to minimize copy number imbalances; and (iv) extend the paradigm to metadata encoding, adjustable search stringency, tiered storage access, and deduplication strategies.
- Unintended 5 HD payload binding sites (from adjacent codewords) interfered with intended 6 HD access in one design, indicating the need for stringent payload screening and QC in encoding.
- More promiscuous PCR conditions may increase off-target interactions in data payload regions, particularly in extreme-scale systems; comprehensive characterization is needed.
- To balance amplification efficiencies, higher physical copy numbers of mismatched (higher HD) partitions were required, reducing storage density; dominance of 0 HD reads was observed without copy number rebalancing.
- Access specificity depends sensitively on environmental parameters (temperature, salts, additives), requiring careful optimization and potentially limiting robustness across platforms or polymerases.
- The demonstrated tunability is shown primarily for HD differences up to 6, with practical transitions observed between nearby HD classes; generalization to broader address spaces and diverse sequence contexts warrants further study.
Related Publications
Explore these studies to deepen your understanding of the subject.

