Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add phash generation and dupe checking #1158

Merged
merged 25 commits into from
Apr 11, 2021

Conversation

ghost
Copy link

@ghost ghost commented Feb 28, 2021

Adds a task that generates perceptual hashes for all scenes with the phash algorithm. This stores a 64bit int in the scenes table which can be queried for equality, or distance from equality. Generation takes about 15mins on my 11k scenes, querying is more or less instant, at least for the highest level of accuracy.

Also adds a settings view with metadata/sprite comparison, and a delete option.

This can hopefully be used for querying stash-box. The results will likely be an order of magnitude better than oshash/md5, even if we only look at exact matches. We ideally need error handling first though, so we don't process blank/corrupted sprites.

I also looked at using larger hashes, 8x16 and 16x16. This leads to a lot more granularity, but that's not necessesarily useful since it means far fewer matches will be exact, and you need to use much larger hamming distances to get similar results.

Todo:

  • During the task count stage, filter out scenes with existing phashes, or without sprites.
  • Clean up sqlite/scenes.go.
  • Fix dupe view table so it doesn't need a huge screen.

@scruffynerf
Copy link

Done some extensive testing of this against a huge (40k) collection with many known dupes of different sorts. UI needs to be tweaked (to be similar to Stash lists/features now, including checkbox selection like cards have, pagination, filters, and an option beside delete (tagging most likely)

It was fast, and it's been really accurate, I haven't found a false positive yet (working thru Accurate, it found 2k of dupes), it detected multiple resolutions, formats/bitrates/etc. Recommended for inclusion. Built fine, without errors, schema upgraded easily.

@ghost ghost force-pushed the dupe-checker branch 2 times, most recently from 76271ec to ffb09f5 Compare March 10, 2021 23:14
@ghost
Copy link
Author

ghost commented Mar 17, 2021

PR has been updated to decouple phash generation from sprites. This will make generation a lot slower, but will free us up to change sprites in the future without having to worry about phashes. I've also changed the number of sprites to 5x5 and added offsets to avoid generating sprites for intros/outros. The match rate is significantly improved with these changes, with very few false positives except a few on the lowest accuracy level.

I experimented with different numbers of sprites, 3x3, 3x4, 4x3, 4x4, 5x5, and 6x6. 3x4/4x3 give very bad results, presumably due to the image essentially being squished. 3x3 and 6x6 both yielded higher amounts of false positives. 4x4 and 5x5 have similar results, with 5x5 having slightly better rate of matches and lower false positives. 5x5 is 50% slower to generate than 4x4 but for long term scene identification purposes I prefer using the best possible algorithm even if it's a bit slower.

@WithoutPants WithoutPants added the feature Pull requests that add a new feature label Mar 19, 2021
@WithoutPants WithoutPants added this to the Version 0.7.0 milestone Mar 22, 2021
pkg/scene/import.go Outdated Show resolved Hide resolved
@ghost ghost force-pushed the dupe-checker branch from dd8a5c0 to 72d6881 Compare March 31, 2021 19:32
Copy link
Collaborator

@WithoutPants WithoutPants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think it looks good. Can you please move the changelog entry to the 0.7 file (I don't have access to the branch to do it myself).

The dupe checker should probably be changed to be a view in the scenes page, but I'm happy to leave that to a later change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Pull requests that add a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants