Computer Science

Rewritable two-dimensional DNA-based data storage with machine learning reconstruction

C. Pan, S. K. Tabatabaei, et al.

Discover the groundbreaking work of Chao Pan and colleagues as they unveil 2DDNA, an innovative two-dimensional molecular data storage system that cleverly encodes images and metadata within DNA. This research harnesses machine learning for error correction, demonstrating DNA's potential as a rewritable memory with minimal image degradation.

00:00

Playback language: English

Index

Introduction

Traditional DNA-based data storage systems face challenges such as high cost, lack of rewriting mechanisms, long latencies, and synthesis/sequencing errors. These errors, particularly in image data, can lead to catastrophic error propagation during decompression. Existing solutions often rely on extensive redundancy for error correction, increasing storage costs and requiring accurate estimates of error rates. This paper addresses these limitations by developing a novel 2DDNA system. The 2DDNA system utilizes two dimensions for data storage: the DNA sequence (for storing large amounts of image data) and the DNA backbone structure (for smaller amounts of metadata via nicks). The use of two dimensions allows for efficient rewriting of metadata. The significance of this approach lies in the potential to overcome the challenges associated with error correction redundancy and mismatched decoding parameters, ultimately making DNA-based data storage more cost-effective and robust. The researchers aim to demonstrate that DNA can function as a write-once and rewritable memory system capable of storing heterogeneous data types, and that data erasure can be achieved in a permanent and privacy-preserving manner. The use of machine learning (ML) and computer vision (CV) techniques is crucial to mitigating the impact of errors and avoiding catastrophic error propagation frequently encountered in traditional methods. This approach focuses on constructing high-quality replicas of the original data, making the system more practical and reliable.

Literature Review

Existing DNA-based data storage systems, while offering high density and durability, suffer from limitations. Traditional approaches store information solely in the nucleotide sequence, necessitating high redundancy for error correction to combat synthesis and sequencing errors. Rewriting mechanisms are often complex and costly. The high cost of DNA synthesis and the stochastic nature of PCR and sequencing processes further hinder practical implementation. Prior work exploring nick-based recording provides a foundation for metadata storage, but it hasn't addressed the challenges of combining sequence and topological dimensions for efficient, rewritable storage. The mismatched-decoder problem, arising from inaccurate estimations of channel error probability, remains a significant hurdle for efficient error correction in DNA-based storage. This paper builds upon these existing works, addressing the need for a robust, rewritable system that avoids the need for extensive redundancy while mitigating the mismatched-decoder problem through machine learning techniques.

Methodology

The 2DDNA system employs a two-dimensional approach to data storage. The first dimension utilizes the DNA sequence to encode image data. The image is first separated into RGB channels, quantized to 8 intensity levels, and then losslessly compressed using a Hilbert space-filling curve, differential encoding, and Huffman encoding. This compressed data is then translated into DNA oligo sequences, incorporating primers, address sequences, and information-bearing sequences. The second dimension utilizes the DNA backbone structure to store metadata as nicks using nicking endonucleases. A set of complementary nicking endonucleases, carefully chosen for site specificity and Hamming distance to prevent cross-nicking, is used to encode binary strings representing metadata. The encoding scheme allows for the simultaneous reading of both information dimensions. The method further incorporates unequal error protection using LDPC codes for sensitive image regions, such as facial features, to enhance the reconstruction quality. This selectively adds redundancy only where needed, minimizing the overall overhead. DNA synthesis and sequencing are performed using standard protocols. A crucial aspect of the methodology is the post-processing pipeline, leveraging machine learning and computer vision techniques. A three-step process involves automatic discoloration detection using the redundancy inherent in the three RGB channels, image inpainting using deep learning architectures (GatedConvolution and EdgeConnect) to fill in masked pixels, and smoothing to reduce blocking artifacts. For images with granular details, unequal error protection is employed, only adding redundancy to essential parts. The paper also compares its approach with JPEG compression and LDPC codes for error correction, demonstrating its advantages in both efficiency and robustness. The topological dimension recording uses a generalized version of the previously developed "Punch Cards" system, which directly uses native nicking endonucleases instead of requiring additional synthetic guide sequences. ON-OFF encoding across intensity pools selects the appropriate combination of enzymes based on the desired binary code. The readout process incorporates an algorithmic solution to address assembly ambiguities caused by nicks. This involves searching for prefix-suffix substrings in the nicked pool, aligning reads to determine nick locations, and reconstructing the metadata.

Key Findings

The 2DDNA system was experimentally validated using eight Marlon Brando movie stills. The images were successfully encoded and decoded, demonstrating the system's ability to store and retrieve image data. The researchers observed that initial reconstructions suffered from discolorations, but their machine learning-based post-processing pipeline effectively corrected these artifacts, resulting in high-quality images with minimal visual degradation. The use of unequal error protection further enhanced image quality, particularly in areas with fine details. The system successfully demonstrated the rewriting capabilities of the topological dimension (metadata encoded in nicks). The word "ILLINOIS" was encoded in nicks, successfully erased using T4 DNA ligase, and subsequently rewritten with "GRAINGER," all without affecting the sequence-based image data. The quantitative assessment of the image reconstruction, using PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index), showed significant improvements compared to the images reconstructed directly from the raw DNA sequence data. These improvements highlight the efficacy of the post-processing steps, particularly the machine learning-based image inpainting. Direct image enhancement without error correction produced images of significantly poorer quality. Comparisons with JPEG compression and LDPC codes revealed that the 2DDNA approach outperforms joint source-channel coding methods. The information density of the system was calculated in two ways: considering raw image bits, it exceeded 2 bits/nucleotide, and using quantized image bits, it reached 1.40 bits/nt. The relatively small sequencing coverage (112x) highlights the efficiency of the system, especially in comparison with other studies employing higher coverages but also significant error-control coding overhead. The results demonstrate the successful combination of sequence and topological information dimensions for data storage and rewriting, enabling heterogenous data storage with minimal redundancy and efficient error mitigation.

Discussion

The 2DDNA system successfully addresses key limitations of traditional DNA-based data storage. By employing a two-dimensional approach and leveraging machine learning techniques, it significantly improves image reconstruction quality while avoiding the need for worst-case error-correction redundancy. The system's rewritable nature, enabled by the topological dimension, opens new possibilities for applications requiring data update and privacy-preserving erasure. The machine learning-based image processing steps are crucial to achieving high-quality reconstructions, demonstrating the potential for synergies between molecular biology and computer science. The comparison with JPEG and LDPC methods underlines the superior efficiency and robustness of the 2DDNA system. The relatively low sequencing coverage further illustrates the cost-effectiveness of this approach. Future work could explore higher-dimensional data encoding using additional molecular features, integrating more advanced machine learning models for error correction and image enhancement, and optimizing the system for different data types and applications.

Conclusion

This paper presents 2DDNA, a novel two-dimensional DNA-based data storage system that significantly advances the field of molecular storage. The system’s capacity for rewritable metadata, high-quality image reconstruction using machine learning, and minimized redundancy offer a cost-effective and robust solution. The integration of machine learning proves crucial for overcoming challenges associated with error correction and data recovery. Future research should focus on expanding the system to accommodate diverse data types and exploring advanced machine learning techniques for improved error correction and image reconstruction, while optimizing the system for varied applications such as parallel in-memory computing.

Limitations

While the 2DDNA system demonstrates significant improvements in DNA-based data storage, some limitations exist. The current implementation focuses on image data; further research is needed to extend the system's applicability to other data types. The machine learning models rely on training data, and their performance might be affected by variations in synthesis and sequencing errors. Optimizing the compression and encoding schemes for specific image characteristics could further improve efficiency. The experimental validation used a limited dataset; broader testing is needed to evaluate the system's performance under different conditions. The use of a specific set of nicking enzymes might limit the extensibility to different platforms or experimental setups.

Related Publications

Explore these studies to deepen your understanding of the subject.

Engineering and Technology

Deep-learning-based image segmentation integrated with optical microscopy for automatically searching for two-dimensional materials

S. Masubuchi, E. Watanabe, et al.

Medicine and Health

Improved metabolomic data-based prediction of depressive symptoms using nonlinear machine learning with feature selection

Y. Takahashi, M. Ueki, et al.

Engineering and Technology

Topographic design in wearable MXene sensors with in-sensor machine learning for full-body avatar reconstruction

H. Yang, J. Li, et al.

Engineering and Technology

Promiscuous molecules for smarter file operations in DNA-based data storage

K. J. Tomek, K. Volkel, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny