Computer Science

Molecular-level similarity search brings computing to DNA data storage

C. Bee, Y. Chen, et al.

Discover a groundbreaking technique that executes similarity searches over a massive DNA-based database of 1.6 million images, using hybridization probes to provide accurate results. This innovative research by Callista Bee, Yuan-Jyue Chen, and others showcases a molecular implementation that rivals state-of-the-art in silico algorithms.... show more

Introduction

The study addresses the lack of content-based retrieval capabilities (e.g., similarity search) in DNA-based data storage systems, which traditionally rely on key-based retrieval via PCR primers analogous to filenames. With the rise of DNA as a dense, durable storage medium and the inherently parallel nature of molecular operations, the authors propose leveraging DNA hybridization to perform near-data computing over high-dimensional feature representations of content. They frame document similarity as a geometric problem in feature space (e.g., using neural network activations such as VGG16 FC2), seeking a feature-to-DNA encoding that preserves similarity such that reverse-complemented query sequences preferentially hybridize with targets representing semantically similar items. Prior cluster-based approaches reduce content-based retrieval to key-based retrieval but lack granularity and adaptability; this work proposes a continuous learned encoding to enable scalable, content-based molecular similarity search without retraining when new items are added, demonstrating at scale over 1.6 million images.

Literature Review

The paper situates its contribution within: (1) the evolution of DNA computing from exponential-scale combinatorial encodings to strand-displacement circuits and algorithmic self-assembly; (2) DNA data storage advances enabling random access via PCR primers (key-based retrieval); (3) early proposals for DNA content search requiring similarity-preserving encodings but incompatible with arbitrary digital encodings; (4) prior cluster-based mappings (e.g., k-means with fixed codebooks) that reduce content retrieval to exact matching but lack graded similarity and adaptability; and (5) modern content-based retrieval using deep feature extractors (e.g., VGG16/FC2) and ANN indices (e.g., HNSW, faiss, annoy, RPForest). The authors build on their earlier proof-of-principle work by introducing a learned, continuous feature-to-sequence encoding and a differentiable hybridization predictor to scale molecular similarity search to over 1.5 million images.

Methodology

Data and feature extraction: Use OpenImages dataset. Extract 4096-dimensional feature vectors from images using VGG16 (FC2 activations) with pretrained weights.
Similarity labeling: Based on analysis of Euclidean distances in feature space, pairs with distance ≤ 75 are labeled "similar"; others "not similar". This threshold is a property of the feature space and chosen once per document type.
Sequence encoder architecture: Fully connected neural network mapping 4096-d features to an 80-nt DNA sequence in a softmax one-hot representation (4 channels per position summing to 1). Architecture: 4096 → 2048 ReLU → output layer producing 80×4 softmax channels. To discretize outputs, select the maximum channel per position. Apply per-position entropy regularization to encourage confident base calls and reduce indeterminate outputs.
Differentiable hybridization predictor: A convolutional neural network that takes a pair of one-hot sequences and estimates hybridization yield between a target sequence and the reverse complement of a query. Includes a local match layer (3-mer window comparisons allowing misalignments), followed by pooling and convolution, and global averaging to produce an estimated yield.
Training loop (alternating phases): • Encoder training: Sample batches with 50% "similar" pairs. Encode feature pairs into softmax one-hot sequences; pass to the predictor to estimate reaction yield. Use cross-entropy loss between similarity labels (0/1) and predicted yield to update encoder parameters via gradient descent. Apply entropy regularization to maintain near-discrete outputs. • Predictor training: Discretize encoder outputs to deterministic one-hot sequences; simulate hybridization yields with NUPACK at 21 °C, equal molar 1 nM target and query, with appended primer toeholds (target: reverse primer appended; query: 6 bases of RP appended and reverse complemented). Compute cross-entropy between NUPACK simulated yield and predictor estimate; update predictor parameters. Encoder frozen during this phase. Alternate phases until convergence. Predictor remains simpler and not as accurate as NUPACK overall but specialized to encoder outputs and sufficient for guiding encoder optimization.
Database construction for lab experiments: Hold out 1.6 million images from OpenImages V4 as the molecular database. Encode each image into a single synthesizable DNA oligo sequence comprising: (i) unique 30-nt error-correctable barcode mapped from integer ID (0–16,777,215) via pseudo-random permutation then Reed–Solomon coding with homopolymer-free 5-nt symbols; (ii) 80-nt feature-encoding region; (iii) conserved primer regions (FP, RP, IP) for PCR and processing.
Query design and retrieval protocol: For each query image (not in database), synthesize a 5′-biotinylated probe oligo containing the reverse complement of the query’s encoded feature sequence plus a short toehold (first six bases of RP). Anneal probe with a prepared sample of the database (feature region exposed via linear PCR with IP*). Use streptavidin magnetic beads to capture stable query-target duplexes. Sequence the captured pool (Illumina NextSeq) and count reads per barcode to infer retrieved items.
Oligo layouts and processing: Database oligo layout includes unique barcode and feature regions and conserved FP, RP, IP regions. PCR with FP and RP* to amplify; linear PCR with IP* to expose feature region. Query oligo: biotinylated reverse complement of features with toehold. Stable complexes captured via beads. USER enzyme digestion used to release captured strands before sequencing.
Laboratory workflow (overview): (1) PCR enrichment of synthesized pool; (2) linear amplification with forward primer; (3) linear amplification with internal primer to create partially dsDNA exposing feature region; (4) hybridization with query; (5) streptavidin bead capture; (6) release via USER digestion; (7) PCR enrichment of released oligos; (8) addition of randomized region for sequencing diversity; (9) ligation of Illumina adapters; (10) Illumina NextSeq sequencing. Reagents sourced from Twist (database oligos), IDT (queries), Thermo Fisher (Dynabeads), NEB (USER).
In silico benchmarking: Compare molecular method against ANN algorithms (HNSW, faiss-quantized, annoy, RPForest). For each algorithm and query, vary parameters to obtain candidate sets of different sizes; measure proportion of database retrieved vs 100-nearest-neighbor recall. Where algorithms could not produce very small candidate sets, assume linear subsampling to extend curves.
Scaling simulations: Run NUPACK simulations on an additional 5.5 million images to assess yield distributions vs Euclidean distance to query and rank top targets by simulated yield.

Key Findings

The learned encoder maps nearby image feature vectors to DNA sequences whose reverse complements preferentially hybridize, enabling molecular similarity search.
Experimental molecular filtering over a 1.6 million-image DNA database enriches visually similar images: higher sequencing read depths correspond to targets with smaller Euclidean distances to the query.
Tradeoff between recall and proportion retrieved (exemplars from Fig. 3): • Query 1 (cat): thresholds yielded sets with approximately: 13 retrieved → 100-NN recall 0.01; 58 → 0.05; 141 → 0.10; 1,831 → 0.50; 315,736 → 0.94. • Query 2 (building with fireworks): 5 → 0.01; 76 → 0.05; 301 → 0.11; 1,592 → 0.51; ≥350,516 → 0.96. • Query 3 (Lego sushi): 22 → 0.01; 439 → 0.05; 2,335 → 0.10; 224,097 → 0.35; no retrieved sets with recall > 0.35.
Even with aggressive filtering (high read thresholds), the top retrieved images remain relevant to the query. If read counts reflect concentrations, filtered mixtures could be diluted ~1000× while still retrieving relevant images, conserving sequencing resources.
Performance comparison (Fig. 4): The molecular method’s recall vs proportion-retrieved curves are comparable to state-of-the-art in silico ANN algorithms (HNSW, faiss-quantized, annoy, RPForest) under the same database and queries, indicating competitive resource efficiency (sequencing reads vs disk operations), though with higher latency.
Scaling simulations to 5.5 million additional images show that highest NUPACK-simulated yields are concentrated among visually similar images, supporting feasibility of aggressive filtering at larger scales.

Discussion

The findings demonstrate near-data processing using DNA as both storage medium and computational substrate via hybridization, enabling massively parallel operations on high-dimensional features. This molecular similarity search achieves enrichment of nearest neighbors with resource efficiency comparable to leading electronic ANN methods, while exploiting physical parallelism of hybridization. Although current latency (minutes to hours per query) is higher than disk-based retrieval at this scale, the approach may scale favorably to much larger databases without proportional increases in processing time due to inherent parallelism. Energy considerations are favorable since hybridization releases energy; primary energy costs arise from initial synthesis, thermal cycling for annealing, and sequencing readout, which can be reduced via low-power sequencing (e.g., nanopores) and by diluting filtered mixtures. The method generalizes to other data types given suitable feature extractors and could extend to patterns such as classification and time-series analysis. Engineering considerations include automation (digital microfluidics) for scalable, low-cost operation and tuning feature-region length to balance specificity and capacity.

Conclusion

This work presents the first large-scale demonstration of molecular-level similarity search over a DNA-encoded database of 1.6 million images using a learned, continuous feature-to-DNA encoding and a differentiable hybridization predictor. The molecular retrieval achieves enrichment of true nearest neighbors and performs competitively with state-of-the-art electronic ANN algorithms in terms of recall vs proportion retrieved, highlighting DNA’s potential for near-data computing at scale. Future directions include: automating the full protocol via digital microfluidics; optimizing feature-region length and sequence design to balance specificity and robustness; extending to other modalities with appropriate feature extractors; integrating low-power sequencing for efficient readout; and developing hybrid molecular-electronic systems that couple DNA storage and computation with conventional infrastructure.

Limitations

The search paradigm is fixed at database creation (e.g., visual similarity), limiting query types without rebuilding the index for new tasks.
Query latency is relatively long (minutes to hours), though throughput can be increased via batching; still slower than disk reads for current database sizes.
Sequence length trade-offs: longer feature regions may suffer from nonspecific interactions and secondary structures; shorter regions reduce encoding capacity and discriminability, requiring empirical optimization.
The hybridization predictor is an approximation specialized to encoder outputs and less accurate than full thermodynamic models (e.g., NUPACK), though sufficient for training.
Requires upfront training per data type and synthesis of encoded data and queries; energy and time costs are front-loaded.
Current system depends on laboratory processes (PCR, annealing, bead capture, sequencing) that need robust automation for practical deployment.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Rewritable two-dimensional DNA-based data storage with machine learning reconstruction

C. Pan, S. K. Tabatabaei, et al.

Biology

DNA storage in thermoresponsive microcapsules for repeated random multiplexed data access

B. W. A. Bögels, B. H. Nguyen, et al.

Engineering and Technology

Promiscuous molecules for smarter file operations in DNA-based data storage

K. J. Tomek, K. Volkel, et al.

Chemistry

Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction

P. L. Antkowiak, J. Lietard, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny