Skip to content

Data deduplication

jasper-zanjani edited this page Aug 3, 2020 · 3 revisions

Data deduplication ("dedup") is a role service that conserves storage space by storing only one copy of redundant chunks of files. Data duplication is appropriate to specific workloads, like backup volumes and file servers. It is not appropriate for database storage or operating system data or boot volumes.

Data deduplication had required NTFS, although ReFS is supported since 1709.

Data deduplication runs as a low-priority background process when the system is idle, by default; however its behavior can be configured based on its intended usage. Deduplication works by scanning files, and breaking them into unique chunks of various sizes that are collected in a chunk store. The original locations of chunks are replaced by reparse points. When a file is recently written, it is written in the standard, unoptimized form; the accumulation of such files is known as churn. Other jobs associated with deduplication include garbage collection, integrity scrubbing, and (when disabling deduplication) unoptimization.

There are several deployment scenarios considered for data deduplication:

  • General purpose file servers Users often store multiple copies of the same, or similar, documents and files. Up to 30-50% of this space can be reclaimed using deduplication.
  • Virtualized Desktop Infrastructre (VDI) deployments Virtual hard disks that are used for remote desktops are essentially identical. Data Deduplication can also amelioriate the drop in storage performance when many users simultaneously log in at the start of the day, called a VDI boot storm.
  • Backup snapshots are an ideal deployment scenario because of the data is so duplicative.

Deduplication is especially useful for disk drive backups, since snapshots typically differ little from each other.

Resources

Clone this wiki locally