-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make call-counting, class probes, block counters cache-friendly #72387
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsWhen we start a multi-thread application (e.g. any web workload) it seems to me that we pay some penalty for accessing some common memory locations from different threads. Consider this method: void DoWork(IDoWork work) => work?.Do(); If on start we call it from multiple threads (e.g. processing incoming requests) we most likely will end up accessing the same 3 memory locations from multiple threads:
So we basically are going to do a lot of cache thrashing and it's especially painful for NUMA nodes. We should consider/experiment with adding some quick random-based checks on top of all 3, something like if (rand & 1)
dec [callCountingCell] It should slightly help and increase chances of accessing the same memory location from just one core and reduce number of cache thrashing in general. One might say that it's not that important because we have low callcounting thresholds but we need to take into account the fact that we start to promote methods to tier1 only if we didn't encounter new tier0 compilations in the last 100ms
|
There's also a good discussion of possible approaches at https://groups.google.com/g/llvm-dev/c/cDqYgnxNEhY. |
cc @AndyAyersMS probably can be added to the Dynamic PGO plan for 8.0 to experiment, I wonder if a quick experiment will show some improvements over time-to-first-request (and P99) |
I think Currently with the fairly low tiering thresholds we can't rely on getting lots of samples. So, we have tension between reducing sampling overhead and wanting to make sure we get enough samples. Class ProbesIdeally the reservoir sampling rate would diminish over time and so should be self-correcting and not lead to severe contention problems. However, I did not implement the sampling that way initially; once the reservoir fills up then the sampling rate is fixed. We should revisit this. Some variants of reservoir sampling pre-compute how many samples to skip until the next sample (more work when you sample but less in between). We might consider adopting one of those. There is a shared state update for the random source. This random state is small and perhaps could be more granular (say per thread or something) and suitably padded to avoid false sharing. With those adjustments, class probes should not cause contention issues. And (as we've noted elsewhere) we should try and remove the fully redundant probes. Counter ProbesFor count samples it seems like it will be tricky to sub-sample, because you want to sample related counters in a similar manner. Perhaps what's needed is some sort of bursty counting but it's hard to see how to do this without either bulking up Tier0 code with conditional counter updates or producing two Tier0 bodies and switching between them every so often (or making Tier0 instrumented be a "tier0.5" and switching out of it quickly once we have enough samples ... etc). |
Did a couple of experiments on ARM64 where it's cheaper: EgorBo@d5d2e9a via unfortunately, other metrics are too verbose on arm64 on our PerfLab (no obvious regressions/improvements), only "methods jitted" is stable. just FYI @AndyAyersMS @janvorli @jkotas upd: FullPGO mode: |
@EgorBo it seems that for the call counting cells we could possibly move the count to the stub data. Then it would be on the same or adjacent cache line with the target address that we need to read anyways. I was actually trying to make it that way originally, but it turned out it was a non-trivial change due to the fact how the |
If we're not going to use the reservoir sampler's count to reduce update likelihoods, we can stop incrementing it once the table is full; that would remove a frequent shared memory update and likely cut down the scaling impact. |
The counter in the class profile counter is used to determine when to switch to probabilistic updates of the reservior table, but it is currently not used to detemine the probability of an update. So there's no need to keep incrementing it once the table is full. Since the count is mutable shared state, bypassing updates should reduce cache contention somewhat. Contributes to dotnet#72387.
…ull (#82723) The counter in the class profile counter is used to determine when to switch to probabilistic updates of the reservior table, but it is currently not used to detemine the probability of an update. So there's no need to keep incrementing it once the table is full. Since the count is mutable shared state, bypassing updates should reduce cache contention somewhat. Contributes to #72387.
I think we can close this one actually, we did what we could to reduce contention. The contention in the call counting stubs is less important because we only do 30 calls (and the call countint stub becomes inactive). It was way more important to fix the contention in the PGO counters that we did in: |
When we start a multi-thread application (e.g. any web workload) it seems to me that we pay some penalty for accessing some common memory locations from different threads. Consider this method:
If on start we call it from multiple threads (e.g. processing incoming requests) we most likely will end up accessing the same 3 memory locations from multiple threads:
DoWork
(andDo
)DoWork
has a branch)So we basically are going to do a lot of cache thrashing and it's especially painful for NUMA nodes.
We should consider/experiment with adding some quick random-based checks on top of all 3, something like
It should slightly help and increase chances of accessing the same memory location from just one core and reduce number of cache thrashing in general.
On x86 we can rely on
rdtsc
for that (andcntvct_el0
on arm) to access perf counters.One might say that it's not that important because we have low callcounting thresholds but we need to take into account the fact that we start to promote methods to tier1 only if we didn't encounter new tier0 compilations in the last 100ms
category:proposal
theme:profile-feedback
The text was updated successfully, but these errors were encountered: