-
-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add phash generation and dupe checking #1158
Conversation
Done some extensive testing of this against a huge (40k) collection with many known dupes of different sorts. UI needs to be tweaked (to be similar to Stash lists/features now, including checkbox selection like cards have, pagination, filters, and an option beside delete (tagging most likely) It was fast, and it's been really accurate, I haven't found a false positive yet (working thru Accurate, it found 2k of dupes), it detected multiple resolutions, formats/bitrates/etc. Recommended for inclusion. Built fine, without errors, schema upgraded easily. |
76271ec
to
ffb09f5
Compare
PR has been updated to decouple phash generation from sprites. This will make generation a lot slower, but will free us up to change sprites in the future without having to worry about phashes. I've also changed the number of sprites to 5x5 and added offsets to avoid generating sprites for intros/outros. The match rate is significantly improved with these changes, with very few false positives except a few on the lowest accuracy level. I experimented with different numbers of sprites, 3x3, 3x4, 4x3, 4x4, 5x5, and 6x6. 3x4/4x3 give very bad results, presumably due to the image essentially being squished. 3x3 and 6x6 both yielded higher amounts of false positives. 4x4 and 5x5 have similar results, with 5x5 having slightly better rate of matches and lower false positives. 5x5 is 50% slower to generate than 4x4 but for long term scene identification purposes I prefer using the best possible algorithm even if it's a bit slower. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think it looks good. Can you please move the changelog entry to the 0.7 file (I don't have access to the branch to do it myself).
The dupe checker should probably be changed to be a view in the scenes page, but I'm happy to leave that to a later change.
Adds a task that generates perceptual hashes for all scenes with the phash algorithm. This stores a 64bit int in the scenes table which can be queried for equality, or distance from equality. Generation takes about 15mins on my 11k scenes, querying is more or less instant, at least for the highest level of accuracy.
Also adds a settings view with metadata/sprite comparison, and a delete option.
This can hopefully be used for querying stash-box. The results will likely be an order of magnitude better than oshash/md5, even if we only look at exact matches. We ideally need error handling first though, so we don't process blank/corrupted sprites.
I also looked at using larger hashes, 8x16 and 16x16. This leads to a lot more granularity, but that's not necessesarily useful since it means far fewer matches will be exact, and you need to use much larger hamming distances to get similar results.
Todo:
sqlite/scenes.go
.