Got a lot of images with many duplicates? Maybe of different sizes? imagedup
uses perceptual hashing to find images that are close in appearance but not exact. Once imagedup
is finished the verify
tool can be used to read the delete log and open images in pairs so you can double check them before they are deleted. This step is necessary as perceptual hashing is not perfect and will sometimes show two completely different images. A second tool uniqdirs
can be use with the same options as verify
and will dedup within directories which are each considered unique. This is helpful with more organized directory layouts.
./nsquared -cache-file cache.json -output-file delete.log -dir /path/to/images -threads 5 -dedup-file-pairs
# OR
./uniqdirs -cache-file cache.json -output-file delete.log -dir /path/to/images -threads 5 -dedup-file-pairs
# this will create delete.log which will be used by the verify tool.
./verify -delete-file delete.log
print help:
imagedup -h
The cache contains hashes that correspond to the image in -dir and thus if -dir changes so should -cache-file, e.g.
-cache-file one.json -dir /path/to/one
-cache-file two.json -dir /path/to/two
Passing a -cache-file with a different -dir will result in an error, e.g.
-cache-file one.json -dir /path/to/two
Deduping is done with a roaring bitmap which will reduce the number of comparisons by half but will increase memory usage. This is a tradeoff you will need to consider. This feature is disabled by default and can be changed by passing -dedup-file-pairs
.
INFO[2022-09-15 11:29:32] Found 31722 dirs
INFO[2022-09-15 11:29:32] Started, go to grafana to monitor
INFO[2022-09-15 11:51:34] Shutting down
INFO[2022-09-15 11:51:34] Total time taken: 22m2.221316446s
INFO[2022-09-15 11:56:28] Found 31722 dirs
INFO[2022-09-15 11:56:28] Started, go to grafana to monitor
INFO[2022-09-15 12:13:52] Shutting down
INFO[2022-09-15 12:13:52] Total time taken: 17m24.991176074s
First run is without deduping file pairs, second is with it.