How to set up Windows Deduplication

Setting up Windows Deduplication in Server 2012 or Server 2012 R2 is easy.

You can do this through Server Management Console, but I find it is faster and simpler to do it using Windows PowerShell.

Click the PowerShell icon on your taskbar and when the PowerShell window opens, enter the following commands:

Import-Module ServerManager
Add-WindowsFeature -name FS-Data-Deduplication
Import-Module Deduplication

This installs deduplication.

Next, choose the volume that you want to enable deduplication on, such as volume D:, and enter:

Enable-DedupVolume E: -UsageType Default
Set-Dedupvolume E: -MinimumFileAgeDays 0

This set the deduplication type to default, which is good for backup files. It also sets deduplication to run on all data files, which is also a good choice.

Once you write or copy files to the deduplication volume, you need to run a deduplication job to actually deduplicate the files. Windows sets up an automatic job to deduplicate the volume using the Windows Task Scheduler. You can also manually run a deduplication process, on demand.

To get the best deduplication performance, read How to Make Windows Deduplication Go Faster.

How to make Windows Deduplication go faster

While investigating Windows 2012 R2 Deduplication for the benefit of my customers, I have been testing the Windows Deduplication ingest process (Start-DedupJob) on a number of servers and desktops.

Windows Deduplication is post-process deduplication, which means you copy the raw file on to a server volume that has deduplication enabled, and then run a deduplication job to compare the contents of the file with all the files that are already on the volume.

Two different files may have some content that is the same within them, and deduplication compares the blocks of data looking for matching data. It then stores this data as a series of ‘chunks’, and replaces the data with ‘reparse points’, which are indexes to the chunks.

With full backup files, this can reduce the new disk space taken by the next backup by 96% in many cases.

The deduplication job can be run manually, or by using the task scheduler.

The first thing that concerned me about Windows Deduplication, was Microsoft’s suggestion that the maximum speed we could expect was 100 gigabytes an hour. This is 107 billion bytes in real world numbers.  This is about 30 billion bytes a second. Fortunately, I could never get the process to run this slowly, even on older servers.

For my testing, I went through lots of different tweaks of the command line trying to get every last bit of performance out of the deduplication process.

As I tested different processors, drives, and memory combinations, I found different thinks that seemed to be the bottleneck for the process.

When I first tested deduplication, even after I figured out the fastest combination of deduplication job parameters, I could see in the Task Manager Performance Monitor, that the disk drives were not heavily used, and none of the CPU cores were pegged near full usage.

My first thought was that the head movement on the drives during random access was slowing the process. So I switched to SSDs and saw a small performance boost, but the CPU was still not busy.

I scratched my head and said, let’s try a server with faster memory. The first system had 667 speed memory, so I changed to a newer server with a newer process and 1066 memory.

So I moved to a server with 1066 memory, and the process sped up quite a bit. But the CPU core was still not saturated, and the SSD wasn’t busy either.

I switched to a consumer desktop of recent vintage, a Dell 3647 I5 with 1600 speed memory.  I installed Windows Server R2 on it so it would support deduplication.

Windows deduplication sped up a lot, and for the first time, a single core was saturated. Windows Deduplication seems to be doing most of its processing on a single core.

Since random access didn’t seem to be a big factor, I switched back to hard drives from SSD so I could process larger amounts of test data.  The deduplication process seems to be combining its random access together and serializing them.

Next I got a Dell XPS desktop with an I7 at 4.0 ghz speed, also with 1600 speed memory.

This made deduplication even faster.

At this point I configured things as what I call a RackNStack server, using an Akitio rackmount 4 drive RAID array as RAID 10, connected to a desktop sitting on top of it in the rack thru (GASP) USB 3.0

I switched to a Dell Small Form Factor business-class  7020  desktop and I am continuing testing.

Along in here somewhere I got the idea to go to Control Panel / Power Options and set the server to high performance. This instantly improved performance by about 30%.   This works on both desktops and servers.   Try it on your other Windows servers and see what it does for you.  Windows is supposed to automatically increase your CPU under load, but it doesn’t work well with deduplication.

I also created what I call the Instant Dedupe Appliance: a Dell 7020 with a Western Digital 8TB or 12 TB Duo drive connected with USB 3.0. The Western Digital Duo has two drives that can be used as a RAID 1 mirror, so you get $TB or 6TB of usable deduplication space.  Some of that will be a landing zone for the raw data file before you deduplicate it.

Of course, you are welcome to run deduplication on a ‘Real’ server if you prefer.

The parameters that have worked the best for me are:

start-dedupjob –volume F: -inputoutputthrottlelevel none –priority high –preempt –type optimization –memory 80

Replace the volume letter with the volume you are deduplicating. The –priority high parameter seems to do nothing at all. For testing I went to task manager and manually increased the priority to high.

-memory 80 means use 80% of the memory for the deduplication process. This is okay on a server that is dedicated to storing and deduplicating backup files.

In deduplicating backup files, you will find that the first day’s deduplication runs the slowest.  This is most likely because much of this processing is actually the compression of the data in the deduplication chunks before storing them. In the following backup cycles, most of the data is likely to be identical, so relatively fewer unique deduplication chunks are being compressed and stored.

Even though you tell Windows Deduplication to use 80% of memory, it won’t at first, unless your server has a tiny amount of memory like 4GB.  Your second and following deduplication will use more memory.

Our deduplication test set of actual customer backup files is a little over a trillion bytes. Using an Akitio MB4 UB3 rackmount RAID enclosure with 4 Western Digital Red 4TB NAS drives, the first day’s deduplication ran at 92.7 million bytes a second, or 334 billion bytes an hour.

The second day’s deduplication ran at over 110 million bytes a second, and 400 billion bytes an hour.

Running deduplication with the WD Duo drive is a little faster than the Akitio, but it’s also half the useful storage.

Be sure to upgrade to Windows Server 2012 R2 if you are on Windows Server 2012, since the deduplication is up to twice as fast.

We combine Windows Deduplication with Replacador to do the dedupe-aware replication of the deduplicated volume over the Internet or two an external drive.

The brand name deduplication appliances will be faster and sexier than using Windows Deduplication. They may have some features that you really want, particularly if you are using deduplication for other things besides backup files.

For deduplicating backups, Windows Deduplication is great, and it generally costs about a third as much as the leading entry level deduplication appliance.

You might even consider getting two deduplication appliances for each location, and clustering them.

The hidden value of deduplication

The term deduplication is generally used to refer to block-level hash-based processing of multiple data files to shrink them to the smallest possible representation on disk.

A unique cryptographic hash, such as the 20 byte long SHA-1, is calculated for each unique block of data. The block of data is then compressed and stored with a hash index, and a pointer to the index takes the place of the raw data.

The original data file is replaced by a set of ‘reparse points’ that are the indexes of the ‘chunk storage’ that contains the compressed, hashed blocks.

When you want to read the file again, Windows transparently reassembles the original blocks using the reparse points and the chunk storage.

Deduplication can be used with live data, but where it really shines is in the storage of backup files.

For example, if you take an uncompressed, raw backup file of 600 GB and deduplicate it, it might take 200 GB on disk, with most of the savings on the first day coming from the compression of the data.

When you take the next day’s backup of 600 GB, deduplication will replace most of the data with pointers to existing dedupe chunks, and the total new storage used might be 15 GB.

So a deduplicated volume that is 8 TB in size, might hold 200 TB of raw backups, or about 330  600 GB backups, at 25 to 1 deduplication rate.

Deduplication makes it feasible to keep many backup cycles online at your fingertips.

This is one of two reasons why companies have spent billions of dollars on deduplication appliances.

The second reason is the hidden benefit of deduplication.

Deduplication reduces the size of each new backup cycle’s data footprint to a fraction of the amount taken by a complete backup.

If a 600 GB backup is reduced to 30 GB of reparse points and new chunk storage, it suddenly becomes reasonable to replicate that data across the internet to a second deduplication volume. Or even a third, or a fourth.

Replication makes it possible to move just the changes to the offsite disaster recovery copy.

Windows Deduplication does not include replication, but there is a third-party solution called Replacador that runs the Windows Deduplication job and then replicates the changes across the Internet, a local area network, or to an external drive.

With the addition of replication, Windows Deduplication can offer the SMB a backup deduplication solution at a much lower price than a dedicated appliance.