-
-
Notifications
You must be signed in to change notification settings - Fork 803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autotag optimisation #2368
Autotag optimisation #2368
Conversation
lru "github.com/hashicorp/golang-lru" | ||
) | ||
|
||
const regexCacheSize = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment for the size? is it in MB? (the docs in the library are not mentioning anything)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's number of entries, and I think it should be orders of magnitude higher. Also, all of the docs examples use powers of two here, but don't mention a requirement or a reason...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that the number was too low to be actual entries... We might need to tweak that a bit ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm flexible on the actual number used. I chose a small number because it's most likely use is for a single query - this function gets called for every row in the (filtered) results. It's likely to only need no more than 1 or 2 in any given query. After that point, it's just sitting in the cache and is unlikely to be used again. I think increasing the number is probably going to give negligible performance improvements at the cost of more memory usage. I have no idea how much memory a regexp.Regexp
instance takes up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Increasing the number makes sense if we later add more threads/parallel tasks to the autotag process ? (out of scope for now though). Another thing to check (probably with parallel tasks as by itself it might not lead to substantial performance improvements) would be to use WithReadTxn as much as possible instead of the WithTxn(we should only need to write only when/if something is matched) .
I did some more testing and i get the same cpu usage/performance whether the cache size is 10 or 100. Maybe 10 is enough for now?
Everything tests out ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've increased the cache size to 4096 and didn't see a change in memory usage (unless it was under 5MB), so there may be another issue somewhere...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added a print when the cache is hit and it seems to me that it is like withoutpants mentioned above it is only needed once per query ( size =1 )
I did the below change(just for testing)and left it running some more time and it doesnt seem to get more than that
diff --git a/pkg/database/regex.go b/pkg/database/regex.go
index dc7b5feb..cdce9766 100644
--- a/pkg/database/regex.go
+++ b/pkg/database/regex.go
@@ -1,6 +1,7 @@
package database
import (
+ "fmt"
"regexp"
lru "github.com/hashicorp/golang-lru"
@@ -12,7 +13,7 @@ import (
// results. It's likely to only need no more than 1 or 2 in any given query.
// After that point, it's just sitting in the cache and is unlikely to be used
// again.
-const regexCacheSize = 10
+const regexCacheSize = 100
var regexCache *lru.Cache
@@ -36,6 +37,10 @@ func regexFn(re, s string) (bool, error) {
regexCache.Add(re, compiled)
} else {
compiled = entry.(*regexp.Regexp)
+ length := regexCache.Len()
+ if length > 1 {
+ fmt.Printf("cache usage %d", length)
+ }
}
return compiled.MatchString(s), nil
I've been running autotag from this branch for 45 minutes, and I have no log entries after "starting auto tagging". Is this normal? It's using a ton of CPU so I assume it's busy, but I can't confirm it's actually working. |
Its working fine and it is way faster than before |
I think my issue is that because it goes by insert order now, if you've already autotagged some of it, it just spins with no log output for the majority of the time, since there are no new matches. All of the "useful" work is all toward the end, the media that you've added since your last run (unless you've added new tags). Some progress logging would be nice, perhaps every 10,000 media steps, or when it switches from scenes to images or galleries? |
For my case i just saw the percentage increasing in the task status..... |
With this, I project I can tag my entire library in about 5 days, instead of 2-3 weeks! Still some performance left on the table, but it's a huge improvement |
Follow up to #2293, #2363, #2336.
Partial resolution to #2366.
match.Cache
. For now, this caches performers/studios/tags that start with a single letter word. A query is used to get these objects explicitly, and the applicable regex condition is taken out ofQueryForAutoTag
query.regexp
sqlite function. This means that a regex pattern should only be compiled once per query.Query
for scenes, images and galleries to not sort the results if no sort parameter is passed.getPathWords
and out as a static variable.From my testing on a production database, in a scenario where nothing needed to be updated, the autotag operation would complete in 35% of the original, unoptimised code time.
Analysis of the CPU profile for the autotag task now indicates that most of the time is spent querying the database, and not compiling regex patterns.