logo
ResearchBunny Logo
Rewritable two-dimensional DNA-based data storage with machine learning reconstruction

Computer Science

Rewritable two-dimensional DNA-based data storage with machine learning reconstruction

C. Pan, S. K. Tabatabaei, et al.

Discover the groundbreaking work of Chao Pan and colleagues as they unveil 2DDNA, an innovative two-dimensional molecular data storage system that cleverly encodes images and metadata within DNA. This research harnesses machine learning for error correction, demonstrating DNA's potential as a rewritable memory with minimal image degradation.

00:00
00:00
~3 min • Beginner • English
Introduction
DNA-based storage offers extreme density, durability, and nonvolatility but faces practical challenges including high synthesis cost, lack of simple rewriting, long write–read latencies, and missing-oligo errors from solid-phase synthesis. Image data are typically compressed before storage, where even a single bit error can catastrophically propagate during decompression, making robust error correction necessary. However, synthesis/sequencing error rates vary widely across platforms and can increase due to PCR and rewriting, and powerful channel codes like LDPC require accurate channel error estimates to avoid the mismatched-decoder problem. This variability complicates robust design without resorting to worst-case redundancy. The authors introduce 2DDNA, a hybrid two-dimensional storage system that records high-volume image content in DNA sequence and lower-volume metadata via backbone nicks. Sequence-based information is dense but not easily rewritable; nick-based metadata are easily erased and rewritten (privacy-preserving) and can be read simultaneously with sequence content. The approach combines simple per-channel image compression with ML/CV-based reconstruction to avoid global error-correction redundancy. For images with highly granular details, optional unequal error protection using LDPC codes targets only sensitive facial features. The system aims to eliminate worst-case redundancy, avoid parameter mismatch issues, support multiresolution-like quality that adapts to channel error rates, and enable efficient metadata rewriting.
Literature Review
Prior DNA storage work demonstrated high-density information encoding in synthetic DNA and explored error-correcting codes, silica encapsulation, and random access architectures. JPEG-compressed images are highly sensitive to errors, often requiring significant redundancy to prevent catastrophic decompression failures. Joint source–channel coding approaches (including LDPC-based schemes) mitigate errors but still require redundancy and rely on accurate channel parameter estimates; LDPC decoders are sensitive to mismatched log-likelihoods. Earlier rewritable systems used overlap-extension PCR or hybridization/strand displacement, which are more complex than ligation-based rewriting. Prior nick-based storage (DNA Punch Cards) recorded information via enzymatic nicking on native DNA. The present work generalizes these ideas to two dimensions (sequence and topology), replacing worst-case ECC with ML-based image post-processing and introducing nick-based metadata superimposed on sequence content for efficient, permanent erasure and rewrite.
Methodology
Sequence dimension encoding: Images are split into R, G, B channels, aggressively quantized from 256 to 8 intensity levels (3 bits). To preserve locality, each channel is traversed via a Hilbert space-filling curve to produce 1D sequences, then differential encoding transforms adjacent differences to bias small values, followed by Huffman coding exploiting symbol frequency. Operations are performed separately for eight intensity levels. Binary outputs are mapped to DNA oligos of 196 nt, subdivided as: 20 nt prefix primer, 10 nt address (preceded by a 3-nt RGB indicator), eleven 13-nt information-bearing segments, and a 20 nt suffix primer. Constrained mappings enforce GC content between 40–60% and limit G-run length to ≤3; mapping converts 16 address bits to 10 nt and 22 information bits to 13 nt, with synchronizing markers and addressing enabling PCR-based random access. Optional variant: basic 3-bit quantization without lossless compression (larger file sizes). Topological dimension encoding: Metadata (e.g., ASCII letters) are encoded via ON–OFF inclusion of native nicking endonucleases, each enzyme representing one bit in 7-bit ASCII. Enzymes are chosen for high site-specificity (preventing nonspecific cleavage) and mutually large Hamming distances between recognition sequences (reducing cross-nicking). A '1' bit corresponds to presence of the enzyme (ON), '0' to absence (OFF). This superimposes metadata on the same synthetic DNA without changing sequence content. Rewriting: Nicks are sealed by T4 DNA ligase to erase data; new metadata can then be written by re-nicking. DNA synthesis and sequencing: Eight Marlon Brando still images were encoded. After compression and mapping, total sequence length was 2,317,896 nt across 11,826 oligos (eight pools, one per intensity level), synthesized by IDT. Sequencing used Illumina MiSeq with standard protocols. Consensus sequences were formed by aligning reads with error-free addresses; most sequences were perfectly recovered with some missing/erroneous oligos. Sequence dimension decoding and ML post-processing: Initial reconstruction from decoded oligos exhibits discoloration blocks due to errors/missing oligos. A three-step ML/CV pipeline is used: (1) automatic discoloration detection exploits redundancy across separate R, G, B storage; pairwise channel differences (R–G, R–B, G–B) identify likely corrupted pixels via low-frequency occurrences, generating masks, (2) deep learning inpainting (Gated Convolution and EdgeConnect) fills masked regions using learned priors, (3) smoothing and enhancement via bilateral and adaptive median filtering, with additional enhancement, to reduce blocking artifacts and blend inpainted regions. Optional unequal error protection (UEP): For images with granular facial details, selected oligos containing key facial features receive LDPC protection using regular systematic LDPC codes (rate 0.75), adding 391 oligos (≈3.3% redundancy) to improve reconstruction of those areas. JPEG+LDPC comparison: Simulations compared 2DDNA’s approach with joint source–channel coding using JPEG compression plus LDPC decoding under substitution error 0.8% and missing oligo 0.7% (≈1.9% bit error). Mismatched LDPC channel parameters degraded recovery and triggered JPEG decoding failures, highlighting sensitivity to parameter mismatch. Topological readout and decoding: Metadata are read by sequencing nicked pools after denaturation and conversion to dsDNA. A prefix–suffix pattern search counts fragment pairs bounded by enzyme recognition sites to infer which enzymes were present (thresholding on counts). For simultaneous two-dimensional readout, two subpools per level are used: one with ligation (nick-free) to reconstruct sequence content and serve as reference for alignment, and one nicked for topological readout. Methods details: PCR amplification conditions, library prep, nicking protocols with enzyme lists, and sequencing workflows are provided; primer design satisfies Hamming distance, correlation, balance, and primer-dimer constraints. Random access is performed by isolating subpools, ligating to enable PCR, then sequencing; alternatively, random access can use magnetic beads with attached primers corresponding to addresses to avoid ligation.
Key Findings
- Experimental demonstration on eight images (original total 8,654,400 bits) resulted in 2,317,896 nt of encoded sequence across 11,826 oligos (eight intensity-level pools). Illumina MiSeq sequencing and consensus assembly recovered 11,726 sequences perfectly, 22 with minor errors not significantly affecting image quality, and 78 oligos highly corrupted or missing. - Without error-correction redundancy, initial reconstructions showed discoloration artifacts. The ML/CV pipeline (automatic discoloration detection, deep inpainting, smoothing/enhancement) substantially improved visual quality, yielding high-quality replicas with undetectable or small visual degradation. - Optional unequal error protection targeting facial features with LDPC (rate 0.75) added 391 oligos (≈3.3% overhead) and improved reconstruction of granular facial details compared to no redundancy. - Comparison with JPEG+LDPC joint source–channel coding showed that LDPC decoding is highly sensitive to mismatched channel parameters; under realistic error rates (≈0.8% substitutions, 0.7% missing oligos), incorrect parameter assumptions led to decoding failures and JPEG decompression errors (e.g., OSError), whereas the 2DDNA ML approach avoided global redundancy and catastrophic failures. - Topological metadata storage: Successfully wrote the word “ILLINOIS” (56 ASCII bits) across the eight pools using seven nicking enzymes (one per ASCII bit), then fully erased metadata via T4 DNA ligase (no residual nicks detected), and rewrote “GRAINGER,” achieving error-free reconstruction. Enzyme count tables showed clear separation between used and unused enzymes; the second enzyme set eliminated spurious nicking. - Sequencing coverage averaged 112×, notably lower than some prior works (e.g., 3000×) yet sufficient for high-quality reconstruction using the ML pipeline. - Information density: 3.73 bits/nt when referenced to original raw image bits (reflecting acceptable distortion after reconstruction), and 1.40 bits/nt relative to quantized image bits. Primer/address overhead for random access was 27% (53 nt per 196-nt oligo). Reported storage densities correspond to approximately 0.91 ZB/gram (raw basis) and 0.34 ZB/gram (quantized basis). - Robustness: Simulations and supplemental analyses indicate the approach accommodates low coverage and higher error rates (up to ~7%) while maintaining reconstructability via ML post-processing.
Discussion
The 2DDNA system leverages two orthogonal molecular dimensions—sequence and backbone topology—to encode high-volume image content and rewritable metadata within the same DNA substrate. By combining tailored compression and per-channel encoding with ML/CV post-processing, it circumvents the need for worst-case error-control redundancy and avoids LDPC parameter mismatch issues that complicate traditional DNA storage pipelines. Using RGB channel separation as inherent redundancy enables automated discoloration detection and targeted inpainting, yielding high-quality reconstructions even with missing or erroneous oligos. The topological layer enables permanent, privacy-preserving metadata erasure and rewriting via a simple ligation step, contrasting with more complex prior rewriting methods. Simultaneous two-dimensional readout is achieved using nick-free reference sequencing for alignment and a dedicated nicked subpool for metadata detection. The demonstrated coverage requirements are modest, suggesting practical viability. Overall, 2DDNA reduces synthesis overhead, supports random access, enhances robustness to varying channel qualities, and lays groundwork for multidimensional molecular storage and in-memory computing applications.
Conclusion
This work introduces 2DDNA, a multidimensional DNA data storage platform that encodes image content in sequence and metadata in backbone nicks, enabling simultaneous readout and efficient, permanent erasure/rewriting via ligation. By replacing global error-correction redundancy with ML/CV-based reconstruction (discoloration detection, inpainting, smoothing/enhancement), the system achieves high-quality image recovery at modest sequencing coverage and demonstrates robustness to channel variability. Optional unequal error protection selectively improves critical features with minimal overhead. The platform attains high effective information density and demonstrates practical metadata rewriting on synthetic DNA. These results establish foundations for storing heterogeneous datasets with rewrite capabilities and for nontraditional applications such as parallel in-memory computing. Future work may further optimize enzyme sets for nicking specificity, expand to additional molecular dimensions (e.g., concentration), and scale to larger, more diverse datasets and media.
Limitations
- Readout challenges arise from nick-induced fragmentation causing assembly ambiguities; decoding requires algorithmic prefix–suffix pattern searches and, for joint readout, sequencing a ligated reference subpool. - Nicks prevent direct PCR amplification; random access requires ligation prior to PCR or alternative bead-based methods. - Image reconstructions with highly granular details may remain blurred without optional unequal error protection, necessitating small targeted redundancy for best fidelity. - LDPC-based approaches are sensitive to mismatched channel parameters; while 2DDNA avoids global LDPC redundancy, scenarios needing stronger guarantees might still require carefully parameterized ECC. - Primer/address sequences introduce overhead (≈27% per 196-nt oligo) to support random access, reducing net bits/nt compared with very long gBlocks. - Enzyme selection is critical; insufficient sequence specificity or small Hamming distances between recognition sites can cause cross-nicking. The study improved results by refining enzyme choices between writing rounds. - Performance and robustness at very large scales, across diverse data types, and under extreme error/coverage regimes require further validation beyond the eight-image test set.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny