Skip to content

Conversation

calewis
Copy link
Contributor

@calewis calewis commented Sep 29, 2025

Add a thread local cache in front of findTag to reduce lock contention.
findTag spends ~90% of its time contending on the lock. This pr lowers
that significantly, taking findTag from 3.9% of wall time to 0.8%.

There will be a small amount of Tag* leakage until the threads are
destroyed at which point the thread local caches will be automatically
cleaned up.

`findTag` spends ~90% of its time contending on the lock, this lowers
that significantly, taking findTag from 3.9% of walltime to 0.8%.

There will be a small amount of Tag* leakage until the threads are
destroyed at which point the thread local caches will be automatically
cleaned up.

Signed-off-by: Drew Lewis <cannada@google.com>
Signed-off-by: Drew Lewis <cannada@google.com>
@tspyrou
Copy link

tspyrou commented Sep 29, 2025

This cuts a full run from 8 hrs to 7.5 hrs.

@calewis
Copy link
Contributor Author

calewis commented Sep 29, 2025

I wrote most of this PR in a code base where TagHash and TagEqual didn't store pointers to a StaState and only updated that when making the PR. We'll need to be sure that that new StaState dependency doesn't make the thread_local hash and equal functions race.

@jjcherry56
Copy link
Collaborator

I sort of get the idea of what this is doing, but I don't completely understand all of the complications with "clear_counts". My concern is that I get stuck with fixing the non-trivial thread related bugs that show up after this goes in the code.

It needs a LOT of work to look like the other code in OpenSTA, which is another prerequisite. I can't even begin to list them there are so many, so I suggest you read read doc/CodingGuidelines.txt and look around and try to match the existing code.

Signed-off-by: Drew Lewis <cannada@google.com>
Signed-off-by: Drew Lewis <cannada@google.com>
@calewis
Copy link
Contributor Author

calewis commented Sep 30, 2025

@jjcherry56, I believe things should be formatted closer to what is in the docs now.

I don't completely understand all of the complications with "clear_counts".

When elements are removed from the interned set via deleteContentsClear or via eraseIf the cache will be stale and thus will need to be purged. The counter being larger than our saved value tells us that there has been a deletion event since we last accessed the cache.

I agree, the design of Search and especially the fact that TagHash and TagEqual aren't stateless makes reasoning about thread safety difficult, I would like your assistance to ensure this is safe. The performance benefits are fairly large ~6% by tspyrou's calculation and 3%+ for my small test so I believe it is worth the effort.

If you would be kind enough to point me towards instructions on how to do so, I'll test these changes in OR with tsan enabled to gain some confidence that there aren't lurking issues.

FWIW lock contention in search is a significant amount of total time in all the multithread runs we conduct, if you have less creative ways to reduce that contention that would be awesome.

@jjcherry56
Copy link
Collaborator

I think the idea of using a cache in front of the lock is a good one,
but I do not believe in using thread_local.
I think it is cleaner to just use the stack (eg, the requirement for tl_cache_map). In this
case the ArrivalVisitor or TagGroupBldr could hold the cache there a
copies for each sta/thread.

@calewis
Copy link
Contributor Author

calewis commented Oct 17, 2025

Replaced by #316

@calewis calewis closed this Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants