Deduplication For Everyone

If you have a lot of duplicate data sitting around on laptops, USB drives, and external hard drives, you probably would like to clean it up and get back some of that wasted free space. But you probably don’t have time to go through and delete all the duplicates, or you’re concerned if you do you will mis-identify a duplicate and accidentally delete your only copy.

I ran into this problem recently when I came dangerously close to running out of storage space on my NAS box which was running Freenas 0.7. I purchased an external hard drive to handle the overflow, but it was just not convenient to keep the drive attached whenever I needed something. Nor was it fun searching through the NAS and the external drive whenever I was looking for a file and couldn’t remember where it was. I didn’t have time to go and manually delete duplicate data (and I knew I had a ton of it), so I started looking for a free NAS solution that incorporated deduplication (or dedup). I quickly discovered that Sun/Oracle’s ZFS filesystem had gotten dedup in late 2009, so I started doing some researching on how I could build a NAS box with OpenSolaris and the new and improved ZFS. After much searching (and struggling with getting OpenSolaris to even behave right in VMware ESXi 3.5) I found that someone else had already done the work for me.

NexentaStor is a free (for home use anyway) OpenSolaris-based storage “appliance” that has ZFS and dedup! I loaded up my old Dell PowerEdge server with some hard drives, installed it, turned on deduplication, and started dumping my data onto it. I wish I could say the process of copying data went flawlessly, but it didn’t. The appliance froze up several times, requiring a hard power off, and I’m guessing it’s because the ZFS logbias was set to latency instead of throughput and it was creating some kind of weird race condition. I don’t know. But I do know that when I changed the logbias to throughput it never froze up again. Anyway, my dedup ratio ended up being around 2.20x, which basically means that for every 2.2 GB of data I was writing, it was only taking up 1 GB of disk space. I ended up with a huge amount of free space.

But there was one problem. When creating the ZFS storage pool, I didn’t make it redundant. I just strung all my disks together to maximize the storage space, and that’s a very bad idea. If you remember nothing else about ZFS, remember this: Once you create a non-redundant ZFS storage pool, it’s non-redundant forever. I had to pull all the data off, delete the pool, and create a new one (type raidz1, the logical equivalent of raid-5). This ate up a lot of my available storage space, and I ended up having to split some of my easily replaceable data off onto a Freenas server. I also turned on gzip compression, which didn’t do much. My compression ratio is about 1.02x, but my dedup ratio is 1.74x.

Overall, I highly recommend trying NexentaStor if you absolutely must have deduplication or redundancy. ZFS is a rock-solid file system, despite Apple turning it down for OS X. I applaud the Nexenta folks for creating this product, and not just creating it, but making it very robust (you can get to a shell if you want or need to) and user-friendly at the same time . But if you just want a quick and dirty NAS solution and don’t care about dedup, I recommend Freenas.