Autotag optimisation #2368

WithoutPants · 2022-03-07T05:26:44Z

Partial resolution to #2366.

Adds match.Cache. For now, this caches performers/studios/tags that start with a single letter word. A query is used to get these objects explicitly, and the applicable regex condition is taken out of QueryForAutoTag query.
Adds an LRU cache to the regexp sqlite function. This means that a regex pattern should only be compiled once per query.
Changed the behaviour of Query for scenes, images and galleries to not sort the results if no sort parameter is passed.
Moved regex pattern compilation out of getPathWords and out as a static variable.

From my testing on a production database, in a scenario where nothing needed to be updated, the autotag operation would complete in 35% of the original, unoptimised code time.

Analysis of the CPU profile for the autotag task now indicates that most of the time is spent querying the database, and not compiling regex patterns.

pkg/match/cache.go

bnkai · 2022-03-07T08:56:24Z

pkg/database/regex.go

+	lru "github.com/hashicorp/golang-lru"
+)
+
+const regexCacheSize = 10


add a comment for the size? is it in MB? (the docs in the library are not mentioning anything)

It's number of entries, and I think it should be orders of magnitude higher. Also, all of the docs examples use powers of two here, but don't mention a requirement or a reason...

I thought that the number was too low to be actual entries... We might need to tweak that a bit ?

I'm flexible on the actual number used. I chose a small number because it's most likely use is for a single query - this function gets called for every row in the (filtered) results. It's likely to only need no more than 1 or 2 in any given query. After that point, it's just sitting in the cache and is unlikely to be used again. I think increasing the number is probably going to give negligible performance improvements at the cost of more memory usage. I have no idea how much memory a regexp.Regexp instance takes up.

Increasing the number makes sense if we later add more threads/parallel tasks to the autotag process ? (out of scope for now though). Another thing to check (probably with parallel tasks as by itself it might not lead to substantial performance improvements) would be to use WithReadTxn as much as possible instead of the WithTxn(we should only need to write only when/if something is matched) .

I did some more testing and i get the same cpu usage/performance whether the cache size is 10 or 100. Maybe 10 is enough for now?
Everything tests out ok

I've increased the cache size to 4096 and didn't see a change in memory usage (unless it was under 5MB), so there may be another issue somewhere...

I just added a print when the cache is hit and it seems to me that it is like withoutpants mentioned above it is only needed once per query ( size =1 )
I did the below change(just for testing)and left it running some more time and it doesnt seem to get more than that

diff --git a/pkg/database/regex.go b/pkg/database/regex.go index dc7b5feb..cdce9766 100644 --- a/pkg/database/regex.go +++ b/pkg/database/regex.go @@ -1,6 +1,7 @@ package database import ( + "fmt" "regexp" lru "github.com/hashicorp/golang-lru" @@ -12,7 +13,7 @@ import ( // results. It's likely to only need no more than 1 or 2 in any given query. // After that point, it's just sitting in the cache and is unlikely to be used // again. -const regexCacheSize = 10 +const regexCacheSize = 100 var regexCache *lru.Cache @@ -36,6 +37,10 @@ func regexFn(re, s string) (bool, error) { regexCache.Add(re, compiled) } else { compiled = entry.(*regexp.Regexp) + length := regexCache.Len() + if length > 1 { + fmt.Printf("cache usage %d", length) + } } return compiled.MatchString(s), nil

pkg/manager/task_autotag.go

kermieisinthehouse · 2022-03-07T15:56:47Z

I've been running autotag from this branch for 45 minutes, and I have no log entries after "starting auto tagging". Is this normal? It's using a ton of CPU so I assume it's busy, but I can't confirm it's actually working.

bnkai · 2022-03-07T20:06:55Z

I've been running autotag from this branch for 45 minutes, and I have no log entries after "starting auto tagging". Is this normal? It's using a ton of CPU so I assume it's busy, but I can't confirm it's actually working.

Its working fine and it is way faster than before
A single performer autotag takes <1 sec compared to 4 secs that it did before
I finished a full autotag task (performers only though) in ~30 minutes which i never run before because it was too slow
As for CPU usage it is CPU bound for me also. I do have the stash db in an ssd and a lot of ram though.

kermieisinthehouse · 2022-03-07T21:41:37Z

I think my issue is that because it goes by insert order now, if you've already autotagged some of it, it just spins with no log output for the majority of the time, since there are no new matches. All of the "useful" work is all toward the end, the media that you've added since your last run (unless you've added new tags).

Some progress logging would be nice, perhaps every 10,000 media steps, or when it switches from scenes to images or galleries?

bnkai · 2022-03-07T21:51:44Z

For my case i just saw the percentage increasing in the task status.....
I only have <100k scenes+images though so i guess that wouldnt update that fast for you.
A per fixed media steps or time interval log print could be handy though.

kermieisinthehouse · 2022-03-08T23:03:26Z

With this, I project I can tag my entire library in about 5 days, instead of 2-3 weeks! Still some performance left on the table, but it's a huge improvement

…tion

WithoutPants added 6 commits March 7, 2022 16:14

Add duration to autotag finish message

bd4ce97

No sorting scene/image/gallery where not specified

6dff623

Use an LRU cache for sqlite regexp function

148fc00

Compile path separator regex once

d1417fd

Cache objects with single letter first names

8e59b4a

Fix unit tests

6073e24

WithoutPants added the improvement Something needed tweaking. label Mar 7, 2022