How DNA data storage works

DNA data storage is a big deal. Partly, it's because we're based on DNA, and any research into manipulation of that molecule will pay dividends for medicine and biology in general -- but in part, it's also because the world's most wealthy and powerful corporations are getting discouraged at cost estimates for data storage in the future. Facebook, Apple, Google, the US government, and more are all making astounding investments in storage ("exabyte" is the buzzword now). But even these mega-projects can only put off the inevitable for so long; we are simply producing too much data for magnetic storage to keep up, without a major unforeseen shift in the technology.

That's why a company like Microsoft recently decided to invest in the prospect of storing information with a totally different sort of tech: biotech. It might seem off-brand for the software giant, but teaming up with academics to take on molecular biology has produced(Opens in a new window) stunning results: The team was able to store and perfectly recall digital data with incredible storage density. According to an accompanying blog post(Opens in a new window), they managed to pack about 200 megabytes of data into just a fraction of a drop of liquid, including a compressed music video from the band OK Go. Even more impressive, that data was stored in a quickly and easily accessible form, making it more akin to computer RAM, than computer storage.

So how did they accomplish this incredible feat?

First, they had to convert the digital code of 1's and 0's to a genetic code of A's, C's, T's, and G's, then take this lowly text file and manually construct the molecule it represents. Each of these is a feat in and of itself. DNA storage requires cutting-edge techniques in data compression and security to design a sequence both info-dense enough to realize DNA's potential and redundant enough to allow robust error-checking to improve the accuracy of information retrieved down the line.

Very little of the technology on display here is new, since the most important parts of the system have existed much longer than mankind itself. But if all the data necessary to code for Albert Einstein was contained within the nucleus of every single cell of Albert Einstein's body, as it was, then this classical approach to data storage must have something going for it. Researchers in this field set out to understand and harness that something, and they're getting better at it seemingly every couple of months.

At the end of the day, DNA's key special attribute it data storage density: how much information can DNA fit into a given unit volume? The NSA's largest, most notorious data-center is an enormous, sprawling complex full of networked racks of magnetic storage drives -- but according to some estimates, DNA could take the volume of data contained in about a hundred industrial data centers and store it in a space roughly the size of a shoe box.

DNA achieves this in two ways. One, the coding units are very small, less than half a nanometer to a side, where the transistors of a modern, advanced computer storage drive struggle to beat the 10 nanometer mark. But the increase in storage capacity isn't just ten- or a hundred-fold, but thousands-fold. That differential arises from the second big advantage of DNA: it has no problem packing three-dimensionally.

Sequencing has gotten much faster and cheaper over time -- and that's good, because we need to sequence DNA data to read it!

See, transistors are generally aligned on a flat plane, meaning their ability to fully use a given space is pretty low. We can of course stack many such flat boards one atop another, but at that point a new and totally debilitating problem arises: heat. One of the most challenging parts of designing new transistor-based technologies, whether they're processors or storage devices, is heat. The more tightly you pack silicon transistors, the more heat you'll create, and the harder it will be to ferry that heat away from the device. This both limits the maximum density, and requires that we supplement the cost of the drives themselves with expensive cooling systems.

With its super-efficient packing structure, the DNA double helix offers a great solution. Chromatin, the DNA-protein system that makes up chromosomes, is essentially a very complex mechanism designed to allow an inherently sticky molecule like DNA to roll up really tight, yet still unroll quickly and easily later on, when certain patches of DNA are needed by the body.

Here's a simplified look at how DNA packs so tightly into three-dimensional space.

This at-hand nature of the chromatin system, which allows any gene to be "called" from any part of the genome with roughly equal efficiency, has led the researchers to dub their storage system a DNA version of a computer's random access memory, or RAM. Like RAM, the physical location of a piece of data within the drive isn't important to the computer's ability to access that information.

However, storing information in DNA differs from computer RAM in some pretty significant ways. Most notable is speed; part of what makes RAM RAM is that its easy-access system is also a quick access system, allowing it to hold data the computer might need at an instant's notice, and make it available on those timescales. On the other hand, DNA is significantly harder and slower to read than conventional computer transistors, meaning in terms of access speed it's actually less RAM-like than your average computer SSD or spinning magnetic hard-drive.

That's because the incredible abilities of evolution's data storage solution were tailored to evolution's unique needs, and those needs don't necessarily include performing thousands of "reads" per second. Regular, cellular DNA data storage has to untangle the complex chromatin structure of stable DNA, then unwind the DNA double helix itself, make a copy of the sequence of interest, then zip everything right back up the way it was -- it takes a while.

For our purposes, we must then add the extra step of reading the DNA. In this case, that's achieved by using an age-old technique in biotech labs called the polymerase chain reaction (PCR) to amplify, or repeatedly duplicate, the sequence we want to read. The whole sample is then sequenced, and everything but the many-many-many-times repeated sequence we amplified is discarded. What remains is our sequence of interest. These stretches of DNA are marked with little target sequences that allow the PCR proteins to bind, and the replication process to begin.

In cells, genes are turned "on" and "off" largely by changing the availability of these target sequences to the always-waiting machinery of DNA replication. This can be done via the winding and unwinding of chromatin, the direct addition or removal of a blocker protein, or even interaction with other areas of the genome to promote or preclude transcription. In a man-made data storage system, we could theoretically make something better suited to our needs, stronger or more efficient or less wasteful on forms of security we don't need for this purpose, but that would require a level of sophistication in protein engineering that still seem a ways out.

Check out our ExtremeTech Explains series for more in-depth coverage of today’s hottest tech topics.

Now read: How DNA sequencing works

How DNA data storage works

Tagged In

More from Extreme