Python 3 duplicate file finder
####Usage:
dedup.py [-v] path1 path2 ...
Scans through paths looking for duplicate files.
Process is:
- gather all files by size (
map[size] = [file list]
) - read first 1k bytes of same-size files, hash using sha-256
- gather files with same size and same hash
- if the files in this gathered group are less than 1k in size, mark as duplicate
- if the files in the gathered group are more than 1k in size, enqueue for further testing
- for each file in the group gathered by same size and 1k byte hash, with files greater than 1k, create new hash based on reading through the entire file.
- create new groups based on the new full-file hashes.
- if hashes are the same, mark as duplicates
- order by largest file first, mark earliest file as "original" and all others as duplicates
- report
- skip files < 100 bytes
- skip thumbnails
This process efficiently identifies the duplicate files.
Scanned 54917 files, found 32420 duplicate files, time: 20 seconds