Skip to content

Identify duplicate files on a Linux filesystem and generate a script to hardlink them

Notifications You must be signed in to change notification settings

diffs/Spacelink

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spacelink

Identify duplicate files on a Linux filesystem and generate a script to hardlink them.

How it works

  • Scans the source folder and maps out their file names, file sizes, and index nodes (inodes) on the drive
  • Checks the scan folder for files with an identical name and file size
    • This method of detection is not as reliable as a file hash check, however, for the use case it is likely fine (see case study)
    • The inode of the file in the scan folder is compared to the inode inside the map (from the source folder)
      • If the inode matches, the file is already a hard link and is ignored
      • If the inode does not match, the file is noted as a duplicate
  • All identified duplicate files get de-duplication instructions written to the output file, which will look similar to the following:
    • rm 'scanFilename'
    • ln 'sourceFilename' 'scanFilename'

Usage Instructions

  1. Use cmake to create the makefile: cmake .
  2. Use the make utility to generate the binary: make
  3. Run the binary:

./spacelink --source <path> --scan <path> --output <file name>

  • The source folder contains the original files you want to keep
  • The scan folder contains the duplicates you want to remove and replace with hardlinks
  • The output file is the path where the de-duplication script is written to. This can be anywhere on your filesystem, and doesn't have to match the folders you're scanning.
  • Do not include trailing slashes on source or scan arguments

Case Study

  • Transmission (a download client) downloaded large files to /downloads
  • Plex Media Server had a library on the folder /tv
  • A third application (Sonarr) detects new files in /downloads and hard-links them to /tv
  • Due to permission issues, Sonarr copied the files from downloads rather than hard-linking them
  • This resulted in unneccassary duplicate data being written to the drive
  • After consolidating the files using Spacelink, over 47 GB of disk space was saved

About

Identify duplicate files on a Linux filesystem and generate a script to hardlink them

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published