Every day, we create 2.5 quintillion bytes of data. This data includes everything from the documents we saved on our personal devices, to customer databases, to the messages we send each other on social media. In fact, by some estimates, we created data so quickly that 90 percent of the world’s data was actually created within the last two years alone.
Even if your business is only responsible for an infinitesimal fraction of new data that’s created every day, chances are you’re still creating a substantial amount of new data everyday — data that needs to be backed up and stored. We all know about the importance of backups; without an up-to-date copy of your data stored in a safe place, a disaster could prove devastating to your company. The problem? Every time you back up your data, it creates a new copy, so to speak. And when you back up again and again, you could wind up with multiple copies of every single byte of information, copies that take up a great deal of space on your storage platform.
Consider this: An employee sends an e-mail with a 2 MB attachment. That email is then included in the nightly backup — every night for the next 90 days, according to company policy. Each time the email is copied to backup, the attachment is included, and eventually that single email takes up 180 MB of space on the backup server. Now, if this were the only file being backed up each night, it wouldn’t be a big deal. But imagine that the employee sent a variation of that email to 50 people — and so did her entire team. Suddenly, the data burden becomes unruly.
There is a solution though. Data deduplication has been around for several decades, but is only now starting to gain a larger following thanks to the explosion of data creation. In fact, some data centers are offering deduplication as a standard service to customers who want to maximize their storage capabilities.
How Deduplication Works
One of the biggest misconceptions about data deduplication is the idea that when duplicate data is found, it’s discarded, leaving only the original data. That’s not exactly how deduplication works.
The easiest way to imagine the deduplication process is to think of reams of data being segmented into chunks, also known as hashes. Each hash, which is generally over 8KB long, is compared against the existing sequences of data on the server. If the sequence is unique, it’s backed up in its existing size. If there is a match to the sequence, the first sequence of data is maintained, and the copy is compressed to a fraction of its former file size. The deduplication program then creates a reference, which directs users to the original should they try to access the files from a backup.
The entire deduplication process takes place behind the scenes, so the user never sees the hashing process or realizes that they have been redirected to an original file. Nothing is ever “erased,” so to speak, but simply made smaller to maximize storage space. In fact, thanks to deduplication, it’s possible for a company to store up to 30 TB of data on a 1 TB disk, saving a significant amount of money.
A Few Words of Warning
The benefits of deduplication are so clear — who doesn’t want to save money by eliminating waste? However, before you demand deduplication of data, there are some important points to consider.
- You cannot deduplicate 100 percent. Because deduplication is only compressing data, not deleting it, it will always require some space. In addition, higher ratios lead to diminishing returns, contrary to what you might think. In short, you may need to adjust your compression ratios to find the “sweet spot” to maximize space.
- “Hash Collisions” are possible. A hash collision is the exceedingly rare case in which a sequence of data is matched to two separate, different pieces of data. It’s rare, but possible, and you need to be prepared and take precautions against such collisions.
- File formats that are already compressed do not deduplicate well. Formats such as JPEG and MP3 are already compressed, and are difficult to compress further.
- Deduplication can slow performance. Performing the complex algorithms required to deduplicate data uses a lot of CPU power, meaning it’s best if this function is carried out on a dedicated machine.
Still despite the drawbacks, deduplication is an effective way to maximize storage space while still meeting your backup needs. When done well, you won’t notice a difference in performance or access, only improvements in the space you have available.