From The Editor | December 9, 2009

Maximum Data Reduction (PART THREE): Beyond Deduplication

By Dipesh Patel, Senior Product Marketing Manager, CommVault

In the previous two posts we talked about how maximum data reduction is more than "just" dedupe.

Keep in mind I said "more than." Deduplication definitely has a role in the data reduction story. So once you actually know where your data is, how much of it is actually being accessed/modified, and have archived what's "stale" then it's time to consider how deduplication can lower the costs even further. Why dedupe now, versus at the beginning? With the stale data moved off to Tier2, you not only have the potential to double the capacity on your Tier1 storage (assuming 50% of data is found to be stale), but also reduce the amount of dedupe processing you need to do with the data that remains. And that's true no matter who you ultimately choose for deduplication.

So, what would happen with deduplication in the picture? For the sake of argument, let's use a conservative dedupe ratio of 5:1. Now keep in mind that deduplication ratios are much more a factor of data change rates and retention periods. So for ease-of-comparison, we're assuming that you get 5:1 no matter which dedupe vendor you choose to use.

Our baseline case started with backing up the same 10TBs of data 30 times = 300TBs of cumulative raw backup data. With a 5:1 ratio, that reduces down to 60TBs. Sounds great, and it is if you're only paying $2K/TB for the backup disk capacity. Given that number, the costs come down even further than "archive alone", down to $220K. That is $100K for the 10TBs on primary storage, and $120K for the 60TBs on Tier2 storage.

In this example, it's great that the Tier2 storage requirements were dramatically reduced, but there's still NO relief from the backup volumes/pressure on the front-end production systems.

Deduplication and Beyond

But the real magic around data reduction happens when you combine the approaches. After all, with Simpana 8 (shameless plug here), you can archive and deduplicate all on one technology platform. I'm not talking about just superficial integration through a common management console, but deep integration whose roots are firmly entrenched in a common codebase versus technology silos.

In that case, holding the above variables constant (10TBs of production data, 50% of which is stale; 30 days of full backups; a dedupe ratio of 5:1; Tier1/2 costs of $10/$2K respectively), your costs actually go down to $112K.

So you've now gone from $700K as a baseline, to $360K through archiving, to $220K through deduplication. But we can reduce costs even further to just $112K using a combination of archiving and deduplication along with commodity storage.

Now THAT is a truly magical way to maximize data reduction.

The caveat here, of course, is that you still will pay some amount for the ability to dedupe even where this is happening in-line during the backup/archive process before it hits the disk. But the amount of potential savings out there from a combination of archiving, and deduplicating to commodity storage leaves an awful lot of room for savings all-around.

So the bottom line is that if you really want to save, you should pursue a combination of approaches to maximize data reduction. And before we leave this topic too soon, let's come back to the first piece I originally mentioned: identifying/categorizing your data based on demonstrated usage. Accurate reporting on your data will not only alert you to where you're likely to run out of capacity, but will also help you quantify what data is best suited to archival for space management in the first place. It's the logical first step on your way to maximum data reduction.

SOURCE: CommVault Systems