Identify duplicate files on a Linux filesystem and generate a script to hardlink them.
- Scans the source folder and maps out their file names, file sizes, and index nodes (inodes) on the drive
- Checks the scan folder for files with an identical name and file size
- This method of detection is not as reliable as a file hash check, however, for the use case it is likely fine (see case study)
- The inode of the file in the scan folder is compared to the inode inside the map (from the source folder)
- If the inode matches, the file is already a hard link and is ignored
- If the inode does not match, the file is noted as a duplicate
- All identified duplicate files get de-duplication instructions written to the output file, which will look similar to the following:
rm 'scanFilename'
ln 'sourceFilename' 'scanFilename'
- Use cmake to create the makefile:
cmake .
- Use the make utility to generate the binary:
make
- Run the binary:
./spacelink --source <path> --scan <path> --output <file name>
- The source folder contains the original files you want to keep
- The scan folder contains the duplicates you want to remove and replace with hardlinks
- The output file is the path where the de-duplication script is written to. This can be anywhere on your filesystem, and doesn't have to match the folders you're scanning.
- Do not include trailing slashes on source or scan arguments
- Transmission (a download client) downloaded large files to
/downloads
- Plex Media Server had a library on the folder
/tv
- A third application (Sonarr) detects new files in
/downloads
and hard-links them to/tv
- Due to permission issues, Sonarr copied the files from downloads rather than hard-linking them
- This resulted in unneccassary duplicate data being written to the drive
- After consolidating the files using Spacelink, over 47 GB of disk space was saved