Replicating a Veeam Backup Repository with Windows Deduplication

I have already discussed the advantages of using Windows Deduplication for Veeam Backup Repositories, and introduced the idea of using Replacador to replicate a windows deduplication volume.

Now we will set up Replacador on a windows deduplication volume that holds a Veeam Backup Repository and then run the replication process.

First browse to C:\LaserVault\Replacador and execute the ReplacadorConfig application.

We will use this to define a replication task for our source and destination volumes.

In this first example, we will replicate the volume to another volume on an external drive on the same server.  The same process also runs across a network or the Internet.  We will give an example of that later.

In the Replication Configuration Screen, press the Add Definition button and define a replication task.

In this case, we make the task name vmbackup, the machine name is ‘.’  ( a period means the current machine).

The volume Path is D:\  Normally a deduplication volume will be a drive letter on the current machine.

replacador4

Don’t click OK yet.

Each replication needs at least one Destination. You can actually replicate to multiple destinations at once.

Click the Add Destination button and a new form opens to define the destination.

In this case we are replicating to another volume on the same server, so the Machine Name is ‘.’ for the local server again.

In the case of a different network location, this could be the VNC name of the target machine, or its IP address.

The username is the local username on the target machine plus a password.  In this case we are using administrator, but you could use the system account or whatever is appropriate.  The user needs to have sufficient authority to run Windows Deduplication garbage collection on the target machine.

The volume path is the local pathname on the target machine to the deduplication volume that will be a clone of the source deduplication volume.

The UNC path is the UNC version of this target deduplication volume, consisting of \\machinename\\volume

In the case where the target is on this local machine, just use the volume path again.

replacador5

Now click OK, then OK, then OK, and your replication is configured.

The Replacador Manual explains how to run the Replacador Transfer program from the command line or thru the task scheduler, but since we are just testing, we will just click to execute the ReplacadorTransfer application.  Since there is only one replication task defined and the default action of the application is to replicate, it will do exactly what we want.

When you first start the Replacador Transfer, it looks like nothing is happening.  Actually Replacador transfer starts a deduplication job on the source volume to make sure that everything is ready for replication.  If you have already deduplicated the volume, this part of the task will just take a minute or two.

Once the source volume is deduplicated, a command window will open and display the replication progress.

replacador7

You can get a better idea what is going one by looking at the Task Manager performance screen.

replacador8

Replication is really just a specialized copy task that takes very few CPU or memory resources.  The limiting factor is the speed of reading, transmitting, and writing the data on the target volume.

The whole point to replication is to reduce the data traffic to the minimum needed to move the changes from the source volume to the destination volume.  The first replication will be a large one, which is why it is sometimes a good idea to replicate to an external drive to seed the actual target server volume.  After that, the replication process should be a small fraction of the original deduplication volume content, even for a new full backup.

When Replacador is done, you will have an exact copy of the deduplicated volume on the target server.  Each time you replicate in the future, only the changed chunks and reparse points on the deduplicated volume will be sent to the target volume.  The original files will not be reflated at any time in the process.

 

Replication for Windows 2012 R2 Deduplication with Replacador

Windows Deduplication is a free feature in Windows Server 2012 and Server 2012 R2.  It works great and I recommend it.

I’ve  also worked with other deduplication systems, including Data Domain, Avamar, ExeGrid, GreenBytes, Opensolaris and Nexenta.

Deduplication works especially well for backup files. With a deduplication system, you can store many backups on one deduplication appliance, because deduplication only stores each unique chunk of data once.

Besides the obvious advantage of taking up much less disk space for each new backup, deduplication reduces the amount of data transmission you need in order to replicate the deduplicated backup across the network or Internet to an offsite location.

The reduction in data seems magical when you first encounter it.  It actually makes it possible to replicate a backup in a reasonable amount of time with the Internet connection you already have, for most people.

The one problem with using Windows Deduplication instead of another backup appliance is that it does not have replication built into it.

We decided to do something about this, so we wrote Replacador as a replication system for Windows Deduplication.

Replacador looks at two Windows Deduplication volumes, and keeps them synchronized.  First, the source volume is optimized to turn all in policy files into deduplication chunks and reparse points. All the new or changed chunks on the source volume are copied to the destination volume, along with the reparse points. Anything that has been deleted from the source volume is deleted from the destination volume.

Finally, garbage collection is run on both source and destination volumes to keep them synchronized.

The source volume has to be a locally attached volume on the source system.  The destination volume can be a second volume on the source system, such as an external drive.  Usually it is a volume on another server that is on the network or the internet.  Both servers must be running the same version of Windows 2012.  We suggest that they should be running release R2 because the deduplication process is much faster on R2.

The volumes do not have to be the same size, but you will probably want them to be.  The second volume has to have enough room for all the chunks, the reparse points, and some extra room for garbage collection to run.

The reason we support replicating to an external drive is to make it easy to ‘seed’ a new remote deduplication volume when you first start deduplicating.  You can replicate to an external drive on the source system, carry it to the remote system, and replicate from the external drive to the destination drive one time.  This can save days or even weeks of data transmission, in some cases.

This also make it possible to replicate back from the destination volume to an external drive to quickly restore to a source system in case of catastrophic loss of the first volume, due to tornado, fire, flood, etc.

An external disk drive can also be used as a source volume for deduplication.  Some external drives support RAID data protection.  A good example of this is the Western Digital Duo series.  The Duo 8TB costs less than $350 and provides 4 TB of protected storage.  There is also a 12 TB Duo with 6Tb of available storage.

There is a beta test version of Replicator available.  You will also need an authorization code to use Replacador.

Here is the PDF of the documentation.

Replacador Configuration and Use

To get an activation code, browse to c:\LaserVault\Replacador and click the Replacador Configuration application.

Click the Authorization button on the lower left.

replacador1

Replacador generates a unique serial number for your server.  Copy the contents of the Serial Number and paste it into an email and send it to ReplacadorCode@laservault.com.

We will send you a code good for 30 days.

Replacador2

 

When you receive the code, paste it into the Authorization Key and click OK.

Next, setup and run Replacador 

Veeam Backup Repository Settings for Windows Deduplication 2012 R2

To use Windows Deduplication with Veeam Enterprise Plus, you will most likely want to use a real Windows Server and create a deduplication-enabled volume for your Veeam backups.  Your Veeam backups will be stored in a Veeam Backup Repository, which is a folder holding all the files.

Windows deduplication ingestion is a CPU and memory intensive procedure and it is probably best not to run it in a VM.

For the same reason, it is best to run windows deduplication on a server that is not being used in production.

On another blog post, we show you how to roll your own deduplication appliance.

You can have multiple Veeam backup repositories on the same deduplicated volume.  For example, if you have three different Hyper-V servers, each with its own collection of VMs, you could have three repositories on one deduplicated volume.

You can also have multiple volumes with deduplication enabled on the same Windows server.  You might want to do this because the different Hyper-V or VMWare hosts have too many VMs for the volume you are using, or because the VMs are very different from each other and won’t deduplicate as well as if you organize them on separate volumes, with all the Linux MySQL VMs on one volume for example.

In this example, we are working with two different servers.  Server H is the host, which is running Windows 2012 R2 with Hyper-V and is hosting multiple VMs.  It has Veeam Enterprise Plus installed on it for backup.

Server R is the Repository server.  This is where Veeam will remotely install the Veeam Backup Repository agent and NFS.

You will be doing all your typing and viewing on Server H, while Veeam will install its software across the network on Server R.

Within Veeam Enterprise Plus, click on Backup Repositories, then right click and add new repository.

Click on Microsoft Windows Server, then Next.

2015-03-13 15_19_55-Edit Backup Repository

Put in the ip address or network name of the server.

At this point Veeam will ask you for the Username and Password to use on the repository server.

veeam3

Browse to or create the folder name for the backup repository.

veeam4

Set the Storage Compatibility Settings for deduplication.

2015-03-13 15_20_08-Storage Compatibility Settings

These are the best settings for a backup repository that will be on a volume with Windows Deduplication enabled.

The benchmark shown on this blog was run with these settings.

Veeam will ask to install its own NFS. OK this with these settings.

veeam6

Veeam will do some things to install the repository software and NFS on the replication server R.

veeam7

Now go back to the backup job you have set up, or create a new job, and point it to your new repository.

Run the backup job. When it is complete, go to the repository server and run Windows deduplication, or use Replacador to do this.

The first deduplication will not run as fast as your second and following deduplications.

Veeam Backup with Windows Deduplication Benchmark

Some people wonder why anyone would use Windows Deduplication on a Veeam Backup Repository.   Doesn’t Veeam have its own deduplication and replication?

The people at Veeam actually recommend using Windows Deduplication and have a great writeup about it that you can download here.

Veeam has great deduplication and replication, but it is within a single backup.  Veeam deduplicates the backup of one VM against another VM in the same backup.

Windows Deduplication deduplicates blocks of data across many backups.  For example, when I deduplicated a full Veeam backup of multiple VMs a second time, 800 GB of backup deduplicated down to 10 GB.  That means that using Windows Deduplication on the Veeam Repository reduces my full replication over the Internet from 800 GB to 10 GB, on the second and following full backups.

Of course most people using Veeam are going to do full backups periodically, but incremental backups one or more times each day.

My first incremental backup with Veeam takes about 20 GB of repository storage.  Deduplicating that with Windows Deduplication takes it down to about 2 Gb of disk usage.  This means I can protect my 800 GB of VMs using 2 GB of actual disk storage and replicate it in just a few minutes.

My Veeam repository testbed is actually on a Dell business desktop for price and performance reasons I explain elsewhere on this blog.

Del 7020 i7 4790 3.4ghz  16GB 1600 speed DDR3 Small Form Factor SFF  (price about $850 12/2013)

Probox 4x USB 3.0 4 drive enclosure ($99)

with 4 WD RED NAS 4TB drives (about $600 3/2015)

Windows Server 2012 R2 (your price will vary up to $800)

The Veeam backup time, inlcuding sending it across 1 gigabit network, was about an hour and a half for the first full backup.

Windows deduplication  of 836,809,182,740 bytes
Elapsed time is 8963 seconds
93,362,622 bytes per second
336,105,439,904 bytes per hour

The first time you run windows deduplication on a backup file, much of the time is used in the compression of the chunks.  Therefore the deduplication time is likely to be longer than your daily deduplication of second and following backups.

Windows reports the dedupe status on the volume containing the Veeam Backup Repository.

Volume : D:
Capacity : 7.27 TB
FreeSpace : 6.98 TB
UsedSpace : 304.06 GB
UnoptimizedSize : 785.56 GB
SavedSpace : 481.5 GB
SavingsRate : 61 %
OptimizedFilesCount : 2
OptimizedFilesSize : 779.34 GB
OptimizedFilesSavingsRate : 61 %

My second Veeam Backup is a forward incremental backup.  This is what Veeam suggests you use when storing backups on a windows deduplicated volume.

The entire backup ran in 6 minutes and 47 seconds.

26,859,267,777 bytes
261 seconds
102,909,071 bytes per second

Windows backup is ingesting at 370,472,655,600 bytes per hour.  Microsoft says it will only go 100 GB an hour.  I did juice it up a bit by running the process in high priority.

But wait, there’s more! (As they say on TV).  The next full backup will be faster.

Meanwhile, let’s look at the volume usage.

Volume : D:
Capacity : 7.27 TB
FreeSpace : 6.97 TB
UsedSpace : 305.54 GB
UnoptimizedSize : 810.63 GB
SavedSpace : 505.09 GB
SavingsRate : 62 %
OptimizedFilesCount : 3
OptimizedFilesSize : 804.35 GB

That’s nice – 26 billion bytes of backup in less than 2 GB of disk space.

I ran the incremental again and the size transferred was a lot smaller.

Volume: D:

Job processed space (bytes): 3,837,792,131
Job elapsed time (seconds): 81
Job throughput (MB/second): 45.18

The throughput doesn’t look as good, but it takes time to just start and stop the program, and the whole run was less than a minute and a half.

Volume : D:

Capacity : 7.27 TB
FreeSpace : 6.97 TB
UsedSpace : 306.17 GB
UnoptimizedSize : 814.22 GB
SavedSpace : 508.06 GB
SavingsRate : 62 %

My real concern in doing all this is how long it will take to replicate Veeam backups over the Internet.  I don’t want to be schlepping tapes all over the place.

I have the impression that a lot of Veeam users are only doing full backups once a month or even less.  But I am an old school kind of guy and I really want to be able to do a full backup once a week.  With Veeam deduplication that would still be a lot of data, I think.  But what about with Windows deduplication?

This time I chose ‘Active Full’ backup on Veeam.

Once again the backup to the repository took about an hour and a half.

But look at the windows deduplication processing:

837,381,046,512 bytes processed

5169 seconds
162,000,589 bytes per second
583,202,121,773 bytes per hour
837,381,046,512 bytes of backup used 10 GB of new disk space

Volume : D:

Capacity : 7.27 TB
FreeSpace : 6.96 TB
UsedSpace : 316.07 GB
UnoptimizedSize : 1.56 TB
SavedSpace : 1.25 TB

583 billion bytes an hour! 

While the overall deduplication rate of the volume is not that high yet, the new full backup used up 10 GB for almost 800 GB of new data.  If this rate holds, I should be able to store about 500 full Veeam backups on this volume.

Of course, my original plan was to store incremental forwards with a full backup once a week.  I will probably still do that, but I could also just do full backups for a long long time.

(Shameless product plug) What is going to make this nice for me is using Replacador to replicate the windows deduplication volume offsite and to the cloud.  The windows deduplication cuts the backup size down so much that I can replicate a days backups in under ten minutes and a full backup in an hour or so.

How does this scale?  If you are doing anything up to about 10 TB of Veeam full backup once a week, with incrementals the rest of the time, you could process it with this system.  You would want to put in 8 TB hard drives, I expect.  I’m running these RAID 10 through Windows file services on USB 3.0.

Of course a “real” server would have faster hard drives, and if you use faster memory and a sufficiently fast processor you might go even quicker than this.   We will be testing windows deduplication on a new Dell 530 with 2133 memory soon and hope to bring you even better numbers.

According to our testing, the single core speed and the memory speed are the most important factors in the windows deduplication ingestion.

Windows deduplication can add a lot of value to the Veeam backup process.  It allows you to store more backups in less space, and replicate them in far less time, than with Veeam alone.

 

Backing up with Veeam to your Windows 2012 R2 Deduplication Appliance

If you are following along with our idea of making your own deduplication appliance, you might be interested in using it as one or more Veeam Backup Repositories.

Veeam has some postings about using Windows Deduplication with Veeam backups, and they seem to think it is a great idea.

I do too, and I think our Replacador replication for Windows Deduplication makes things even better.

We have been backing up our own Windows discrete servers for years with plain old Windows Backup.  As we gradually migrated our servers to our first big Hyper-V server (called Borg1 for some reason) we kept using Windows Backup doing full backups every night, deduplicating, and replicating.

When I first heard of Veeam it was in reference to deduplication.  When I read about Veeams positive attitude towards Windows Deduplication, I became even more interested.  So we decided to install Veeam Enterprise Plus trial edition.

I set it up on our Borg1 server and defined the backup job for most of our VMs.  I skipped our document management VM for now because I’m impatient to run the tests faster and that data doesn’t change much.  I will add that in later for production.

I set up a Veeam Backup Repository on one of our UBD servers running Windows 2012 R2 with a deduplication volume. Actually I just twiddled my thumbs while Veeam did all the work.  I did get to make some important decisions about the settings for the Backup Repository, which I will share with you in a future post.

I set the system up for forward incremental backups with a full backup once a week.

The first Veeam backup took about an hour and a half and moved 800 GB of data across the network.  I ran Windows Deduplication on the volume and it compressed and deduped about 60%.  The deduplication job ran in a couple of hours, and of course was mostly compression for the first day.

I couldn’t wait a whole day to do another Veeam backup, so I did the same backup again after a couple of hours.  This was over the same 800 GB set of VMs.

The Veeam backup ran in 6 minutes and 47 seconds. It sent about 26 billion bytes of data to the Backup Repository.

Windows Deduplication ran in 15 minutes or so. The 26 GB of data on disk became 1 GB of deduplicated data. Over our 30 Mbit per second Internet upload we will be replicating in about six minutes.

According to what I have read about Veeam, it is deduplicating within a single backup, across VMs. What Windows Deduplication is adding is the deduplication across multiple backups. This means even a periodic full backup will take up very little space on the deduplicated volume and very little replication bandwidth.

The only replication job that should be somewhat large is the very first one, and our Replacador software supports replication to an external drive, which means you can seed the replication to a drive then send it to your DR site for immediate protection.

Every time I think of Veeam, I am saying WOW.  What an incredible product. If you haven’t tried it yet, spend an hour or two and set it up. And smile.

I am going to publish the settings and statistics for my first Veeam backup jobs in another post.

 

 

Roll Your Own Deduplication Appliance with Windows Server 2012 R2

We have been doing a lot of testing and implementation of Windows Deduplication and in the process we have come up with a basic roll-your-own dedupe appliance using Windows Server 2012 R2.

After testing Windows Deduplication on various hardware, we have come to use a simple business-class desktop with an external RAID array as our basic deduplication workhorse. Deduplication wants a fast single-core speed and fast memory, and the cheapest way, by far, is to fulfill these requirements with a desktop system.

A good place to begin is with a Dell 7020 or 9020 Small Form Factor computer with an Intel I7 processor. As of March 2015, a system with 16GB of 1600 memory is about $800. For storage, a USB 3.0 Western Digital Duo drive in either 8TB  $(340) or 12TB ($650) size can be a good choice. The Duo drive is actually 2 drives in an external enclosure. You can set the drives up with internal RAID 1 mirroring, so you get 4TB or 6 TB of usable space.

Plug these together and install Windows Server 2012 R2 on the Dell and you have a deduplication system. Some people like HP instead of Dell. Some brave hearts swear by SuperMicro.

If you need more storage or you want to put your deduplication appliance in a rack, you can use a rackmount RAID enclosure like the USB 3.0 Akitio MD4 U3B ($350) with 4 3.5 inch drives. Set it up as RAID 10 for both speed and protection, using 4TB, 6TB, or 8TB drives. Put it in the rack, and flip the 7020 sideways and put it on top.

At this time, 4 WD RED NAS drives are about $600. So for $1800 and the price of Windows Server 2012 R2 (which can be anywhere up to about $700 retail) you have a 8TB available deduplication appliance.

Since Windows Deduplication is post process, you will need a certain amount of that storage for the raw files before you deduplicate them. Since we are using these systems for deduplicating and replicating backups to our DR site, we need room for at least one day’s full backup plus 50% ‘fudge factor’ (this is a professional term of art from the 1960s, they may call this something else now.) Our daily full backup is a bit over 1 TB, so 8 TB – 1.5 TB is 6.5 TB of deduplication space. At 25 to 1, that is over 150 full backups.

This can be a useful low end SMB system, a Proof of Concept (POC) system, or a departmental system for an enterprise.

Many of our customers want ‘real’ servers and will install hardware that costs 2 to 4 times this much. That is okay too.

The WD DUO 4TB available system costs $350 for the WD DUO 8TB, $800 for a Dell 7020, and the cost of Windows Server 2012 R2.

Real deduplication for $1150 plus Windows.

By the way, I’ve clocked my 4TB (8TB) Duo system deduplication at over 400 billion bytes an hour on full backups after the first day (because the first day is mostly compression).  The Akitio based system is a little bit slower, but still respectable speed for what we are doing.

This is a roll your own price.  When I sell similar systems they cost a lot more, because all my hippies quit and now I have to pay my employees.

You can’t do everything with these that you can with the big name deduplication appliances.  They won’t scale as high – Windows deduplication doesn’t work on a physical disk volume over 64TB in size, for example.

The big guys are claiming ever more dizzying ingest rates for their systems as well. Microsoft claims their R2 version of deduplication tops out about 40 MB a second, but we generally see speeds two to three times that fast. 350 billion bytes an hour to 450 billion bytes an hour is typical with the systems we test. We expect this to continue to increase as processors and memory get faster.

If your full backups are up to about 3 TB a day, one of these systems may work for you. If you are doing periodic full backups and incremental backups the rest of the time, that number could be higher.

With the low cost of these systems, you can divide up the work and have two, four, or even eight deduplication appliances for different backups.

The other thing Microsoft Deduplication is missing is replication, but we have solved that problem. We will be releasing our own replication system, Replacador, very soon.  

How to set up Windows Deduplication

Setting up Windows Deduplication in Server 2012 or Server 2012 R2 is easy.

You can do this through Server Management Console, but I find it is faster and simpler to do it using Windows PowerShell.

Click the PowerShell icon on your taskbar and when the PowerShell window opens, enter the following commands:

Import-Module ServerManager
Add-WindowsFeature -name FS-Data-Deduplication
Import-Module Deduplication

This installs deduplication.

Next, choose the volume that you want to enable deduplication on, such as volume D:, and enter:

Enable-DedupVolume E: -UsageType Default
Set-Dedupvolume E: -MinimumFileAgeDays 0

This set the deduplication type to default, which is good for backup files. It also sets deduplication to run on all data files, which is also a good choice.

Once you write or copy files to the deduplication volume, you need to run a deduplication job to actually deduplicate the files. Windows sets up an automatic job to deduplicate the volume using the Windows Task Scheduler. You can also manually run a deduplication process, on demand.

To get the best deduplication performance, read How to Make Windows Deduplication Go Faster.

How to make Windows Deduplication go faster

While investigating Windows 2012 R2 Deduplication for the benefit of my customers, I have been testing the Windows Deduplication ingest process (Start-DedupJob) on a number of servers and desktops.

Windows Deduplication is post-process deduplication, which means you copy the raw file on to a server volume that has deduplication enabled, and then run a deduplication job to compare the contents of the file with all the files that are already on the volume.

Two different files may have some content that is the same within them, and deduplication compares the blocks of data looking for matching data. It then stores this data as a series of ‘chunks’, and replaces the data with ‘reparse points’, which are indexes to the chunks.

With full backup files, this can reduce the new disk space taken by the next backup by 96% in many cases.

The deduplication job can be run manually, or by using the task scheduler.

The first thing that concerned me about Windows Deduplication, was Microsoft’s suggestion that the maximum speed we could expect was 100 gigabytes an hour. This is 107 billion bytes in real world numbers.  This is about 30 billion bytes a second. Fortunately, I could never get the process to run this slowly, even on older servers.

For my testing, I went through lots of different tweaks of the command line trying to get every last bit of performance out of the deduplication process.

As I tested different processors, drives, and memory combinations, I found different thinks that seemed to be the bottleneck for the process.

When I first tested deduplication, even after I figured out the fastest combination of deduplication job parameters, I could see in the Task Manager Performance Monitor, that the disk drives were not heavily used, and none of the CPU cores were pegged near full usage.

My first thought was that the head movement on the drives during random access was slowing the process. So I switched to SSDs and saw a small performance boost, but the CPU was still not busy.

I scratched my head and said, let’s try a server with faster memory. The first system had 667 speed memory, so I changed to a newer server with a newer process and 1066 memory.

So I moved to a server with 1066 memory, and the process sped up quite a bit. But the CPU core was still not saturated, and the SSD wasn’t busy either.

I switched to a consumer desktop of recent vintage, a Dell 3647 I5 with 1600 speed memory.  I installed Windows Server R2 on it so it would support deduplication.

Windows deduplication sped up a lot, and for the first time, a single core was saturated. Windows Deduplication seems to be doing most of its processing on a single core.

Since random access didn’t seem to be a big factor, I switched back to hard drives from SSD so I could process larger amounts of test data.  The deduplication process seems to be combining its random access together and serializing them.

Next I got a Dell XPS desktop with an I7 at 4.0 ghz speed, also with 1600 speed memory.

This made deduplication even faster.

At this point I configured things as what I call a RackNStack server, using an Akitio rackmount 4 drive RAID array as RAID 10, connected to a desktop sitting on top of it in the rack thru (GASP) USB 3.0

I switched to a Dell Small Form Factor business-class  7020  desktop and I am continuing testing.

Along in here somewhere I got the idea to go to Control Panel / Power Options and set the server to high performance. This instantly improved performance by about 30%.   This works on both desktops and servers.   Try it on your other Windows servers and see what it does for you.  Windows is supposed to automatically increase your CPU under load, but it doesn’t work well with deduplication.

I also created what I call the Instant Dedupe Appliance: a Dell 7020 with a Western Digital 8TB or 12 TB Duo drive connected with USB 3.0. The Western Digital Duo has two drives that can be used as a RAID 1 mirror, so you get $TB or 6TB of usable deduplication space.  Some of that will be a landing zone for the raw data file before you deduplicate it.

Of course, you are welcome to run deduplication on a ‘Real’ server if you prefer.

The parameters that have worked the best for me are:

start-dedupjob –volume F: -inputoutputthrottlelevel none –priority high –preempt –type optimization –memory 80

Replace the volume letter with the volume you are deduplicating. The –priority high parameter seems to do nothing at all. For testing I went to task manager and manually increased the priority to high.

-memory 80 means use 80% of the memory for the deduplication process. This is okay on a server that is dedicated to storing and deduplicating backup files.

In deduplicating backup files, you will find that the first day’s deduplication runs the slowest.  This is most likely because much of this processing is actually the compression of the data in the deduplication chunks before storing them. In the following backup cycles, most of the data is likely to be identical, so relatively fewer unique deduplication chunks are being compressed and stored.

Even though you tell Windows Deduplication to use 80% of memory, it won’t at first, unless your server has a tiny amount of memory like 4GB.  Your second and following deduplication will use more memory.

Our deduplication test set of actual customer backup files is a little over a trillion bytes. Using an Akitio MB4 UB3 rackmount RAID enclosure with 4 Western Digital Red 4TB NAS drives, the first day’s deduplication ran at 92.7 million bytes a second, or 334 billion bytes an hour.

The second day’s deduplication ran at over 110 million bytes a second, and 400 billion bytes an hour.

Running deduplication with the WD Duo drive is a little faster than the Akitio, but it’s also half the useful storage.

Be sure to upgrade to Windows Server 2012 R2 if you are on Windows Server 2012, since the deduplication is up to twice as fast.

We combine Windows Deduplication with Replacador to do the dedupe-aware replication of the deduplicated volume over the Internet or two an external drive.

The brand name deduplication appliances will be faster and sexier than using Windows Deduplication. They may have some features that you really want, particularly if you are using deduplication for other things besides backup files.

For deduplicating backups, Windows Deduplication is great, and it generally costs about a third as much as the leading entry level deduplication appliance.

You might even consider getting two deduplication appliances for each location, and clustering them.

The hidden value of deduplication

The term deduplication is generally used to refer to block-level hash-based processing of multiple data files to shrink them to the smallest possible representation on disk.

A unique cryptographic hash, such as the 20 byte long SHA-1, is calculated for each unique block of data. The block of data is then compressed and stored with a hash index, and a pointer to the index takes the place of the raw data.

The original data file is replaced by a set of ‘reparse points’ that are the indexes of the ‘chunk storage’ that contains the compressed, hashed blocks.

When you want to read the file again, Windows transparently reassembles the original blocks using the reparse points and the chunk storage.

Deduplication can be used with live data, but where it really shines is in the storage of backup files.

For example, if you take an uncompressed, raw backup file of 600 GB and deduplicate it, it might take 200 GB on disk, with most of the savings on the first day coming from the compression of the data.

When you take the next day’s backup of 600 GB, deduplication will replace most of the data with pointers to existing dedupe chunks, and the total new storage used might be 15 GB.

So a deduplicated volume that is 8 TB in size, might hold 200 TB of raw backups, or about 330  600 GB backups, at 25 to 1 deduplication rate.

Deduplication makes it feasible to keep many backup cycles online at your fingertips.

This is one of two reasons why companies have spent billions of dollars on deduplication appliances.

The second reason is the hidden benefit of deduplication.

Deduplication reduces the size of each new backup cycle’s data footprint to a fraction of the amount taken by a complete backup.

If a 600 GB backup is reduced to 30 GB of reparse points and new chunk storage, it suddenly becomes reasonable to replicate that data across the internet to a second deduplication volume. Or even a third, or a fourth.

Replication makes it possible to move just the changes to the offsite disaster recovery copy.

Windows Deduplication does not include replication, but there is a third-party solution called Replacador that runs the Windows Deduplication job and then replicates the changes across the Internet, a local area network, or to an external drive.

With the addition of replication, Windows Deduplication can offer the SMB a backup deduplication solution at a much lower price than a dedicated appliance.