Chemistry

Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction

P. L. Antkowiak, J. Lietard, et al.

Discover the groundbreaking DNA storage system developed by Philipp L. Antkowiak and colleagues. Utilizing massively parallel light-directed synthesis, this cost-effective method tackles high sequence error rates with innovative encoding and reconstruction algorithms, achieving perfect data recovery. Witness the fascinating achievement of storing sheet music of Mozart, showcasing the immense potential of DNA for data storage.

00:00

~3 min • Beginner • English

Index

Introduction

DNA has recently been showcased as a powerful medium for digital information storage, with early demonstrations by Goldman et al. and Church et al. storing around 1 MB of data. Interest in DNA storage is driven by its potential for high information density and longevity, alongside advances in parallel DNA synthesis and sequencing technologies. Enzymatic DNA synthesis methods have emerged as alternatives to phosphoramidite chemistry but are not yet capable of producing highly diverse libraries suitable for large-scale storage. This work investigates whether lower-cost, higher-speed, lower-precision traditional chemical synthesis—specifically light-directed maskless photolithography—can be used despite its higher sequence error rates. The central research question is whether data can be reliably recovered from highly erroneous synthetic DNA by leveraging advanced encoding, reconstruction, and error-correction algorithms. The authors encode nearly 100 kB of data into 60-nt oligos synthesized on a rapid light-directed array platform and develop a decoding pipeline designed for high-error regimes, aiming to demonstrate perfect data recovery despite substantially elevated synthesis error rates.

Literature Review

The paper situates its contribution within prior DNA data storage research that relied on high-quality, low-error oligos from commercial providers (e.g., Twist, Agilent via printing, and CustomArray via electrochemical deprotection). Early demonstrations stored ~0.65–0.75 MB with varying logical densities. Organick et al. introduced data randomization to mitigate GC imbalance and homopolymer issues and demonstrated large-scale random access. Typical prior synthesis and sequencing error rates reported were about 0.5% substitutions and <0.1% insertions/deletions. Enzymatic synthesis approaches (e.g., Lee et al.; Palluk et al.) show promise but are not yet suited for building large, highly variable libraries. Compared with these, the present work explores photolithographic, maskless synthesis using DMDs, trading synthesis fidelity for cost and speed. The authors explicitly forgo sequence constraints (e.g., avoiding homopolymers) and rely instead on invertible pseudorandomization and stronger error-correcting codes to tolerate significantly higher error rates.

Methodology

Synthesis platform and design: The authors use a maskless, light-directed photolithographic microarray DNA synthesis system. A DMD (1024×768 mirrors) projects UV (365 nm) patterns onto a flow cell to deprotect Bz-NPPOC-protected phosphoramidite 5′-OH groups site-specifically. Features are grouped as 8×6 mirrors, yielding a 128×128 array (16,384 possible oligos). To balance yield and error rates, they synthesize 60-nt oligos without embedded PCR primer sites to maximize the sequence length available for data (omitting ~40 nt of constant adapters). Because coupling yields are relatively low (95–99%), sequence length is limited to 60 nt. A 3′ dT linker (base-sensitive succinyl moiety) ensures a terminal T at the 3′ end for all sequences. Encoding and indexing: The input data (99,103 bytes of ZIP-compressed MusicXML files of Mozart’s String Quartet K458) are pseudorandomized to avoid long homopolymers and extreme GC content and to promote pairwise near-orthogonality of oligos. The data are partitioned into K=10,977 sequences, each of length 14×6 bits (seen as 6 blocks of 14 bits). An outer Reed–Solomon (RS) code over GF(2^14) of length N=16,383 appends N−K=5,406 parity sequences, producing 16,383 sequences. A 24-bit index is added to each sequence (operationally only 14 bits used in this experiment), yielding sequences of 24+14·6=18.6 bytes-equivalent (149 bits). For inner protection, each is viewed as 18 symbols of 6 bits and inner-encoded with an RS code over GF(2^6) by adding 2 parity symbols, resulting in sequences of 20.6 symbols (≈124 bits). Finally, bits are mapped to DNA bases via 2-bit mapping (00→A, 01→C, 10→G, 11→T) to produce 60-nt oligos. Index placement is internal (not at termini) to avoid higher terminal error rates. Photolithographic synthesis details: Slides are silanized with N-(3-triethoxysilylpropyl)-4-hydroxybutyramide and assembled with a 50 µm PTFE gasket in a synthesis cell. A UV LED (Nichia NVSU333A) with mirror-tunnel homogenizer and Offner relay illuminates ON mirrors for 35 s at ~85 mW cm−2 (≈3 J cm−2). During UV deprotection, 1% w/w imidazole in DMSO assists Bz-NPPOC removal. Coupling with 0.25 M dicyanoimidazole in acetonitrile proceeds for 15 s, followed by oxidation (I2 in pyridine/H2O/THF). Bases are protected with tac (A,G) and acetyl (C). Synthesis uses 187 cycles for >16,000 unique sequences. After synthesis, terminal 5′-Bz-NPPOC groups are removed by UV; deprotection and cleavage of the 3′ succinyl ester are done in ethylenediamine/toluene (1:1) 2 h at room temperature; eluates are desalted (ZipTip C18) and quantified. Library preparation and sequencing: Because oligos lack constant primer regions, the Accel-NGS 1S Plus kit (Swift Biosciences) is used to attach adapters to single-stranded DNA, enabling dsDNA formation and full Illumina adapters via PCR. The kit adds a polynucleotide tail (median length 8 nt; 80% G, 20% A) to 3′ ends. Libraries (181 bp after full adapter ligation) are sequenced on Illumina NextSeq. For the 100 kB file (file 1), ~30 million reads (average length 63 nt within the 60-nt analysis window due to tailing and indels) are obtained. Error characterization and modeling: Read GC content is balanced. Error rates per base within a 60-nt window are quantified as ~2.6% substitutions, 6.2% deletions, and 5.7% insertions, with errors increasing toward the 3′ end. Base frequency profiles across positions reveal shifts in the index region consistent with stochastic deletions before the index. A Monte Carlo simulation introducing deletions, insertions, and substitutions and appending 3′ tails reproduces observed base distributions; best overlap occurs at ~11.7% deletion, 3.3% substitution, and 5.0% insertion probabilities. Spatial analysis across the DMD array indicates higher read yields centrally and lower at edges/corners due to optical intensity and homogenization variations; error rates are more homogeneous across the surface. Decoding pipeline: Because raw error-free reads are extremely rare (~10−7 fraction), a multi-stage recovery pipeline is used: - Clustering: A custom locality-sensitive hashing (LSH) approach groups similar reads; a naive hash on the first 16 nt also works but yields slightly inferior clusters. - Multiple alignment: Clusters with fewer than 5 reads are discarded. For clusters of 5–15 reads, MUSCLE is used to compute multiple alignments; for larger clusters, random subsets of 15 reads are aligned (computational cost scales sharply beyond 15). Candidates are extracted from alignments via weighted majority voting with weights: A,C,G,T=1 and deletion=0.4, reflecting higher deletion likelihood. - Inner decoding: Candidate sequences are mapped back to 6-bit symbols, inner RS decoded to yield sequences ordered by indices. - Outer decoding: The outer RS code over GF(2^14) corrects residual errors/erasures provided up to 33% sequences are missing. Scaling experiments: Two additional datasets are synthesized with the same parameters: file 2 (~323 kB across 49,149 sequences; ~100.7M reads) and file 3 (~1.3 MB across 196,596 sequences; ~195.6M reads). As array density increases, spot size decreases and synthesis quality drops, increasing insertion rates by ~2% (file 2) and ~4% (file 3) relative to file 1, while substitution and deletion rates remain relatively stable. With the original redundancy, perfect recovery is not achieved for files 2 and 3; simulations indicate tripling outer-code redundancy would enable perfect recovery, with projected consumables cost reductions per MB at larger scales.

Key Findings

- Demonstrated storage of 99,103 bytes (~100 kB) of compressed Mozart sheet music in 16,383 photolithographically synthesized 60-nt oligos, using 60% redundancy, achieving a net information density of ~0.94 bits/nt. - Sequencing (Illumina NextSeq) produced ~30 million reads (avg. 63 nt; tailing and indels explain length >60). GC content was balanced. - Measured per-base error probabilities within a 60-nt window: substitutions ~2.6%, deletions ~6.2%, insertions ~5.7%; errors increase toward the 3′ end. Base-frequency shifts around the index are consistent with stochastic deletions prior to the index. - Monte Carlo modeling matched observed distributions with deletion ~11.7%, substitution ~3.3%, insertion ~5.0%, supporting synthesis-originated errors (underexposure and coupling inefficiency; optical scattering/diffraction causing insertions; appended tails causing apparent 3′ insertions/substitutions). - Raw error-free reads are exceedingly rare (~10−7 fraction). After trimming reads >60 nt down to 60 nt, only ~0.04% of reads are error-free; only ~0.09% of distinct target sequences occur at least once without error among 15M reads without trimming; after trimming, ~31% of sequences appear at least once without error—insufficient for outer code alone. - Recovery pipeline results (file 1): After clustering and weighted majority voting, 393,004 candidate sequences obtained; 7% candidate sequences are error-free; 74% (naive clustering) to 83% (LSH clustering) of original sequences appear at least once without error among candidates. Outer RS decoding corrects ~3.4% erasures and ~13.3% errors, enabling perfect data recovery. - Spatial synthesis analysis shows higher yields centrally and lower at edges/corners, attributed to optical intensity/homogenization and β-carotene UV absorber coverage variations; error patterns are comparatively uniform. - Scaling (files 2 and 3): Insertion error rates increased by ~2% and ~4% vs. file 1; with the same redundancy, data could not be fully recovered, but increasing outer-code redundancy by ~3× would enable perfect recovery. Projected consumables cost: ~160 US$ per MB (file 2) and ~40 US$ per MB (file 3); current small-scale file 1 cost reported as ~530 US$ per MB (consumables only).

Discussion

The study addresses the core challenge of DNA data storage cost and synthesis speed by demonstrating that highly erroneous, low-cost photolithographic synthesis can still yield reliable data retrieval when paired with tailored encoding and reconstruction algorithms. By randomizing data, using nested RS codes, and introducing an LSH-based clustering plus high-quality multiple alignment and weighted majority voting, the pipeline transforms noisy reads into sufficiently accurate candidates for successful inner and outer decoding. Error analyses confirm that the dominant errors originate from the synthesis process—primarily deletions due to deliberate underexposure and coupling inefficiencies, and insertions from optical crosstalk—rather than from library preparation or sequencing. Spatial analyses identify optical nonuniformities as the main drivers of yield inhomogeneity. Despite raw reads being almost entirely erroneous, the decoding pipeline corrects substantial error and erasure rates (13.3% errors, 3.4% erasures after inner decoding) to recover the file perfectly. Scaling experiments reveal that as spot density increases, insertion errors rise and recovery becomes challenging at the original redundancy, but modest increases in outer-code redundancy would restore perfect recovery. Economically, larger-scale synthesis substantially reduces cost per MB, suggesting that with further hardware and chemistry optimization (improved light homogenization, exposure tuning, and reagent scaling), DNA data storage with photolithographic synthesis can approach highly competitive costs while accepting higher error regimes handled in software.

Conclusion

This work shows that archival data can be stored and perfectly recovered from highly erroneous DNA synthesized via low-cost, maskless photolithography. The main contributions are: (1) a practical synthesis platform producing 60-nt oligos without embedded primers; (2) a robust encoding framework using data randomization, nested Reed–Solomon codes, and careful index placement; (3) an advanced reconstruction pipeline combining LSH-based clustering, MUSCLE-based multiple alignment, and weighted majority voting; and (4) a comprehensive error characterization attributing elevated error rates chiefly to synthesis. The approach reduces costs relative to prior high-fidelity synthesis-based systems and shows promising cost scaling at larger array sizes. Future work should focus on optimizing optical uniformity and exposure to mitigate deletions and crosstalk, refining chemistry (e.g., coupling efficiencies), integrating adaptive redundancy based on measured error profiles, improving computational efficiency of alignment/clustering, and scaling arrays and reagent usage to target near 1 US$ per MB storage costs while maintaining reliable recovery.

Limitations

- Elevated synthesis error rates (notably deletions and insertions) are intrinsic to the fast, low-cost photolithographic process used; while the decoding pipeline compensates, it increases computational demands (multiple alignment is the main bottleneck). - Recovery for larger datasets (files 2 and 3) was not achieved with the same redundancy due to increased insertion rates from denser arrays; additional redundancy (~3×) would be required. - Yield inhomogeneity across the array (lower at edges/corners) due to optical intensity and homogenization variations affects read coverage. - Library preparation steps introduce 3′ tails that can confound fixed-window error accounting, contributing to apparent insertion/substitution at the 3′ end. - The study omits embedded primer regions to save synthesis length, necessitating specialized ssDNA adapter ligation methods; this may complicate standard library workflows. - No formal statistical hypothesis testing was performed; Monte Carlo modeling and empirical comparisons were used for validation.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Low-cost and efficient prediction hardware for tabular data using tiny classifier circuits

K. Iordanou, T. Atkinson, et al.

Medicine and Health

Predictive model of castration resistance in advanced prostate cancer by machine learning using genetic and clinical data: KYUCOG-1401-A study

M. Shiota, S. Nemoto, et al.

Engineering and Technology

Dynamic and scalable DNA-based information storage

K. N. Lin, K. Volkel, et al.

Computer Science

Rewritable two-dimensional DNA-based data storage with machine learning reconstruction

C. Pan, S. K. Tabatabaei, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny