Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autotag scraper #1817

Merged
merged 9 commits into from
Oct 11, 2021
Merged

Conversation

WithoutPants
Copy link
Collaborator

  • Moved autotag matching logic, and scraped object matching logic into the new common package match.
  • Refactored the scraper code a little to make it easier to add the new scraper implementation.
  • Added new built-in scraper Auto Tag. This scraper finds performers, studios and tags only and is supported for scene and gallery scrapes. It operates using the same logic as the auto-tagger.

@WithoutPants WithoutPants added the feature Pull requests that add a new feature label Oct 7, 2021
@WithoutPants WithoutPants added this to the Version 0.11.0 milestone Oct 7, 2021
@kermieisinthehouse
Copy link
Collaborator

✓ Code reviewed
✓ Tested scraping auto-tag from scene
✓ Tested auto-tag a performer
✓ Tested auto-tag task, and selective auto-tag

Auto-tagging just one scene / image is a nice touch!

return ret, nil
}

func PathToStudios(path string, reader models.StudioReader) ([]*models.Studio, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These PathTo* functions should probably be refactored. Instead of doing 3 database hits for every path and then repeatedly compiling the REs for each candidate match, we should probably enumerate all 'names' (Tags+aliases, Studios+aliases, Performer names) at the beginning of an autotask job and memoize their compiled REs at a higher level. The queries and compiles add up, and I suspect narrowing down the candidate list through QueryForAutoTag doesn't save a ton of time comparatively.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some reflection and some rough experiments, the queries are probably fine, but we should still memoize the compiled RE objects

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did start implementing RE caching since initial benchmarks suggested that there would be a performance benefit. However, after analysing the way that this code is used, the benefit would be limited to the autotag task only, and would require creating the cache in task_autotag.go and passing it down through all of the subsequent function calls. This is fine, but well out of scope for this PR, and the effort involved is non-trivial.

@kermieisinthehouse
Copy link
Collaborator

Retested with new changes

@WithoutPants WithoutPants merged commit e9d4868 into stashapp:develop Oct 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Pull requests that add a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants