Skip to content

sdwarwick/dupfilefinder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

dupfilefinder

Python 3 duplicate file finder

####Usage:

dedup.py [-v]  path1 path2 ...

Scans through paths looking for duplicate files.

Process is:

  1. gather all files by size (map[size] = [file list])
  2. read first 1k bytes of same-size files, hash using sha-256
  3. gather files with same size and same hash
  4. if the files in this gathered group are less than 1k in size, mark as duplicate
  5. if the files in the gathered group are more than 1k in size, enqueue for further testing
  6. for each file in the group gathered by same size and 1k byte hash, with files greater than 1k, create new hash based on reading through the entire file.
  7. create new groups based on the new full-file hashes.
  8. if hashes are the same, mark as duplicates
  9. order by largest file first, mark earliest file as "original" and all others as duplicates
  10. report

Skip patterns

  • skip files < 100 bytes
  • skip thumbnails

Discussion

This process efficiently identifies the duplicate files.

Scanned 54917 files, found 32420 duplicate files, time: 20 seconds

About

python 3 duplicate file finder

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages