The term deduplication is generally used to refer to block-level hash-based processing of multiple data files to shrink them to the smallest possible representation on disk.
A unique cryptographic hash, such as the 20 byte long SHA-1, is calculated for each unique block of data. The block of data is then compressed and stored with a hash index, and a pointer to the index takes the place of the raw data.
The original data file is replaced by a set of ‘reparse points’ that are the indexes of the ‘chunk storage’ that contains the compressed, hashed blocks.
When you want to read the file again, Windows transparently reassembles the original blocks using the reparse points and the chunk storage.
Deduplication can be used with live data, but where it really shines is in the storage of backup files.
For example, if you take an uncompressed, raw backup file of 600 GB and deduplicate it, it might take 200 GB on disk, with most of the savings on the first day coming from the compression of the data.
When you take the next day’s backup of 600 GB, deduplication will replace most of the data with pointers to existing dedupe chunks, and the total new storage used might be 15 GB.
So a deduplicated volume that is 8 TB in size, might hold 200 TB of raw backups, or about 330 600 GB backups, at 25 to 1 deduplication rate.
Deduplication makes it feasible to keep many backup cycles online at your fingertips.
This is one of two reasons why companies have spent billions of dollars on deduplication appliances.
The second reason is the hidden benefit of deduplication.
Deduplication reduces the size of each new backup cycle’s data footprint to a fraction of the amount taken by a complete backup.
If a 600 GB backup is reduced to 30 GB of reparse points and new chunk storage, it suddenly becomes reasonable to replicate that data across the internet to a second deduplication volume. Or even a third, or a fourth.
Replication makes it possible to move just the changes to the offsite disaster recovery copy.
Windows Deduplication does not include replication, but there is a third-party solution called Replacador that runs the Windows Deduplication job and then replicates the changes across the Internet, a local area network, or to an external drive.
With the addition of replication, Windows Deduplication can offer the SMB a backup deduplication solution at a much lower price than a dedicated appliance.