From The Editor | November 19, 2009

Maximum Data Reduction (PART ONE): Beyond Deduplication

By Dipesh Patel, Senior Product Marketing Manager, CommVault

Deduplication is not Data Reduction. Or rather, with all of the buzz around dedupe, it's important to note that deduplication is not the only means to reducing the amount of data being backed up. Data Reduction is a broader category that includes deduplication among other technologies and approaches. This is something that I've been expounding upon during recent Innovate 8 events.

In the meantime, it seemed to me that in pushing the next "big" thing, many folks out there are happy to pin their hopes and dreams solely on deduplication. But they would be doing themselves a disservice. For one thing, deduplication is a great approach, but it doesn't solve the underlying problems in processes, people or platforms on its own. If you don't fix the root causes, no amount of deduplication will prevent you from facing the same issues, just in one, two, or (if you're lucky) three years.

For one thing, in many cases it doesn't solve the issues on the front-end, on primary storage. Most folks are implementing deduplication for backup and archive, where they get the most amount of data reduction across multiple backup/archive cycles. However, if the breakdown in your backup process actually originates up-front with the size of your primary datasets and/or length of your backup windows, then deduplication may not be the total answer. So even though the backed up data is going to occupy less space, each uncompressed full backup job (generally required if you're using a device-based target dedupe approach) will still have to churn through larger and larger amounts of production data.

Even if you use dedupe on the primary/production storage tier (NetApp comes to mind as one of the most successful of the production tier dedupe vendors), when it comes time to backup the data, it still gets rehydrated. So if you have 10TBs of file data that dedupes down to 5TB on your Tier1 storage, when it comes time to backup the data (using CommVault, EMC Legato, etc.), you still end up processing 10TBs worth of data. So deduplication of production data helps reduce the costs of your most expensive tier of storage but doesn't shrink your backup windows or the network/processing load.

In the next post, I'll walk through a simple exercise to illustrate how to implement a maximum data reduction approach that delivers more bang for your buck (compared to dedupe alone). In the meantime, I'd love to hear from you. What key technologies do you think Data Reduction includes besides dedupe?

SOURCE: CommVault Systems