Microsoft is carrying out experiments with synthetic DNA for digital data storage and has recently agreed to purchase ten million strands of DNA from genetics startup Twist Bioscience.
Microsoft is known to be working on DNA Storage in collaboration with the University of Washington. The joint research group recently presented a paper describing an entire architecture for a DNA-based archival storage system, as depicted in the following picture.
A DNA storage system consists of a DNA synthesizer that encodes the data to be stored in DNA, a storage container with compartments that store pools of DNA that map to a volume, and a DNA sequencer that reads DNA sequences and converts them back into digital data.
One interesting problem that DNA storage has to solve is addressing. The basic unit of DNA storage is a DNA strand, made of roughly 100–200 nucleotides and capable of storing 50–100 bits of information. This means that a typical data objects is mapped to a large number of DNA strands. Researchers are using a key-value architecture, whereby the key is first associated to the pool containing the required strand, then a random access mechanism allows accessing the strand inside of the pool.
Another interesting aspect is that of data representation. Since DNA is a combination of 4 bases (A, C, G, T), the most direct approach is representing data in base 4, e.g., 01110001 maps to 1301 in base 4, and to the DNA sequence CTAC. Instead of that, though, researchers have chosen a base–3 representation, so they can use one nucleotide for error correction. So, in the above example, 01100001, which is 01112 in base–3, is mapped to the DNA sequence CTCTG.
More information about how DNA storage works, including encodings to improve reliability and a few experiments that were carried through, can be found in the above mentioned PDF paper.
According to Twist Bioscience, DNA-based archival technology has two key advantages on traditional digital storage: much longer lifespan, with recent data showing that DNA data storage could last up to 2,000 years, and higher data density, which can reach one trillion GB for a single gram of DNA.
According to Microsoft and University of Washington researchers, DNA storage is not to be considered an alternative to flash memories or hard drives:
We envision DNA storage as the very last level of a deep storage hierarchy, providing very dense and durable archival storage with access times of many hours to days.
The key idea is that DNA synthesis and sequencing may be arbitrarily parallelized, thus making it possible to reach the required read and write bandwidth.
Doug Carmean, the DNA storage project lead for Microsoft, clarified that their initial tests using Twist DNA “demonstrated that we could encode and recover 100 percent of the digital data”, yet there is still a lot of work to be done before a commercially-viable product is available.