Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autotag optimisation #2368

Merged
merged 12 commits into from
Mar 9, 2022
Merged

Conversation

WithoutPants
Copy link
Collaborator

Follow up to #2293, #2363, #2336.

Partial resolution to #2366.

  • Adds match.Cache. For now, this caches performers/studios/tags that start with a single letter word. A query is used to get these objects explicitly, and the applicable regex condition is taken out of QueryForAutoTag query.
  • Adds an LRU cache to the regexp sqlite function. This means that a regex pattern should only be compiled once per query.
  • Changed the behaviour of Query for scenes, images and galleries to not sort the results if no sort parameter is passed.
  • Moved regex pattern compilation out of getPathWords and out as a static variable.

From my testing on a production database, in a scenario where nothing needed to be updated, the autotag operation would complete in 35% of the original, unoptimised code time.

Analysis of the CPU profile for the autotag task now indicates that most of the time is spent querying the database, and not compiling regex patterns.

@WithoutPants WithoutPants added the improvement Something needed tweaking. label Mar 7, 2022
pkg/match/cache.go Outdated Show resolved Hide resolved
pkg/match/cache.go Outdated Show resolved Hide resolved
pkg/match/cache.go Outdated Show resolved Hide resolved
pkg/match/cache.go Outdated Show resolved Hide resolved
lru "github.com/hashicorp/golang-lru"
)

const regexCacheSize = 10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment for the size? is it in MB? (the docs in the library are not mentioning anything)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's number of entries, and I think it should be orders of magnitude higher. Also, all of the docs examples use powers of two here, but don't mention a requirement or a reason...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that the number was too low to be actual entries... We might need to tweak that a bit ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm flexible on the actual number used. I chose a small number because it's most likely use is for a single query - this function gets called for every row in the (filtered) results. It's likely to only need no more than 1 or 2 in any given query. After that point, it's just sitting in the cache and is unlikely to be used again. I think increasing the number is probably going to give negligible performance improvements at the cost of more memory usage. I have no idea how much memory a regexp.Regexp instance takes up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing the number makes sense if we later add more threads/parallel tasks to the autotag process ? (out of scope for now though). Another thing to check (probably with parallel tasks as by itself it might not lead to substantial performance improvements) would be to use WithReadTxn as much as possible instead of the WithTxn(we should only need to write only when/if something is matched) .

I did some more testing and i get the same cpu usage/performance whether the cache size is 10 or 100. Maybe 10 is enough for now?
Everything tests out ok

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've increased the cache size to 4096 and didn't see a change in memory usage (unless it was under 5MB), so there may be another issue somewhere...

Copy link
Collaborator

@bnkai bnkai Mar 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added a print when the cache is hit and it seems to me that it is like withoutpants mentioned above it is only needed once per query ( size =1 )
I did the below change(just for testing)and left it running some more time and it doesnt seem to get more than that

diff --git a/pkg/database/regex.go b/pkg/database/regex.go
index dc7b5feb..cdce9766 100644
--- a/pkg/database/regex.go
+++ b/pkg/database/regex.go
@@ -1,6 +1,7 @@
 package database
 
 import (
+       "fmt"
        "regexp"
 
        lru "github.com/hashicorp/golang-lru"
@@ -12,7 +13,7 @@ import (
 // results. It's likely to only need no more than 1 or 2 in any given query.
 // After that point, it's just sitting in the cache and is unlikely to be used
 // again.
-const regexCacheSize = 10
+const regexCacheSize = 100
 
 var regexCache *lru.Cache
 
@@ -36,6 +37,10 @@ func regexFn(re, s string) (bool, error) {
                regexCache.Add(re, compiled)
        } else {
                compiled = entry.(*regexp.Regexp)
+               length := regexCache.Len()
+               if length > 1 {
+                       fmt.Printf("cache usage %d", length)
+               }
        }
 
        return compiled.MatchString(s), nil

@WithoutPants WithoutPants added this to the Version 0.14.0 milestone Mar 7, 2022
pkg/manager/task_autotag.go Outdated Show resolved Hide resolved
@kermieisinthehouse
Copy link
Collaborator

I've been running autotag from this branch for 45 minutes, and I have no log entries after "starting auto tagging". Is this normal? It's using a ton of CPU so I assume it's busy, but I can't confirm it's actually working.

@bnkai
Copy link
Collaborator

bnkai commented Mar 7, 2022

I've been running autotag from this branch for 45 minutes, and I have no log entries after "starting auto tagging". Is this normal? It's using a ton of CPU so I assume it's busy, but I can't confirm it's actually working.

Its working fine and it is way faster than before
A single performer autotag takes <1 sec compared to 4 secs that it did before
I finished a full autotag task (performers only though) in ~30 minutes which i never run before because it was too slow
As for CPU usage it is CPU bound for me also. I do have the stash db in an ssd and a lot of ram though.

@kermieisinthehouse
Copy link
Collaborator

I think my issue is that because it goes by insert order now, if you've already autotagged some of it, it just spins with no log output for the majority of the time, since there are no new matches. All of the "useful" work is all toward the end, the media that you've added since your last run (unless you've added new tags).

Some progress logging would be nice, perhaps every 10,000 media steps, or when it switches from scenes to images or galleries?

@bnkai
Copy link
Collaborator

bnkai commented Mar 7, 2022

For my case i just saw the percentage increasing in the task status.....
I only have <100k scenes+images though so i guess that wouldnt update that fast for you.
A per fixed media steps or time interval log print could be handy though.

@kermieisinthehouse
Copy link
Collaborator

With this, I project I can tag my entire library in about 5 days, instead of 2-3 weeks! Still some performance left on the table, but it's a huge improvement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Something needed tweaking.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants