Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make call-counting, class probes, block counters cache-friendly #72387

Closed
EgorBo opened this issue Jul 18, 2022 · 9 comments
Closed

Make call-counting, class probes, block counters cache-friendly #72387

EgorBo opened this issue Jul 18, 2022 · 9 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Jul 18, 2022

When we start a multi-thread application (e.g. any web workload) it seems to me that we pay some penalty for accessing some common memory locations from different threads. Consider this method:

void DoWork(IDoWork work) => work?.Do();

If on start we call it from multiple threads (e.g. processing incoming requests) we most likely will end up accessing the same 3 memory locations from multiple threads:

  1. call counting cell in the callCountingStub for DoWork (and Do)
  2. BB counter in case of PGO (DoWork has a branch)
  3. Class probe

So we basically are going to do a lot of cache thrashing and it's especially painful for NUMA nodes.

We should consider/experiment with adding some quick random-based checks on top of all 3, something like

if (rand & 1)
    dec [callCountingCell]

It should slightly help and increase chances of accessing the same memory location from just one core and reduce number of cache thrashing in general.
On x86 we can rely on rdtsc for that (and cntvct_el0 on arm) to access perf counters.

One might say that it's not that important because we have low callcounting thresholds but we need to take into account the fact that we start to promote methods to tier1 only if we didn't encounter new tier0 compilations in the last 100ms

category:proposal
theme:profile-feedback

@EgorBo EgorBo added the tenet-performance Performance related issue label Jul 18, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jul 18, 2022
@EgorBo EgorBo added this to the 8.0.0 milestone Jul 18, 2022
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Jul 18, 2022
@EgorBo EgorBo added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 18, 2022
@ghost
Copy link

ghost commented Jul 18, 2022

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

When we start a multi-thread application (e.g. any web workload) it seems to me that we pay some penalty for accessing some common memory locations from different threads. Consider this method:

void DoWork(IDoWork work) => work?.Do();

If on start we call it from multiple threads (e.g. processing incoming requests) we most likely will end up accessing the same 3 memory locations from multiple threads:

  1. call counting cell in the callCountingStub for DoWork (and Do)
  2. BB counter in case of PGO (DoWork has a branch)
  3. Class probe

So we basically are going to do a lot of cache thrashing and it's especially painful for NUMA nodes.

We should consider/experiment with adding some quick random-based checks on top of all 3, something like

if (rand & 1)
    dec [callCountingCell]

It should slightly help and increase chances of accessing the same memory location from just one core and reduce number of cache thrashing in general.
On x86 we can rely on rdtsc for that (and cntvct_el0 on arm) to access perf counters.

One might say that it's not that important because we have low callcounting thresholds but we need to take into account the fact that we start to promote methods to tier1 only if we didn't encounter new tier0 compilations in the last 100ms

Author: EgorBo
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr

Milestone: 8.0.0

@jakobbotsch
Copy link
Member

There's also a good discussion of possible approaches at https://groups.google.com/g/llvm-dev/c/cDqYgnxNEhY.

@EgorBo
Copy link
Member Author

EgorBo commented Jul 18, 2022

cc @AndyAyersMS probably can be added to the Dynamic PGO plan for 8.0 to experiment, I wonder if a quick experiment will show some improvements over time-to-first-request (and P99)

@AndyAyersMS
Copy link
Member

I think rdtsc is also pretty slow (and serializing, perhaps).

Currently with the fairly low tiering thresholds we can't rely on getting lots of samples. So, we have tension between reducing sampling overhead and wanting to make sure we get enough samples.

Class Probes

Ideally the reservoir sampling rate would diminish over time and so should be self-correcting and not lead to severe contention problems. However, I did not implement the sampling that way initially; once the reservoir fills up then the sampling rate is fixed. We should revisit this.

Some variants of reservoir sampling pre-compute how many samples to skip until the next sample (more work when you sample but less in between). We might consider adopting one of those.

There is a shared state update for the random source. This random state is small and perhaps could be more granular (say per thread or something) and suitably padded to avoid false sharing.

With those adjustments, class probes should not cause contention issues. And (as we've noted elsewhere) we should try and remove the fully redundant probes.

Counter Probes

For count samples it seems like it will be tricky to sub-sample, because you want to sample related counters in a similar manner. Perhaps what's needed is some sort of bursty counting but it's hard to see how to do this without either bulking up Tier0 code with conditional counter updates or producing two Tier0 bodies and switching between them every so often (or making Tier0 instrumented be a "tier0.5" and switching out of it quickly once we have enough samples ... etc).

@EgorBo
Copy link
Member Author

EgorBo commented Oct 2, 2022

Did a couple of experiments on ARM64 where it's cheaper: EgorBo@d5d2e9a via cntvct_el0 that is a performance counter that should be available on all arm64 devices we support in user space.

image

unfortunately, other metrics are too verbose on arm64 on our PerfLab (no obvious regressions/improvements), only "methods jitted" is stable.

just FYI @AndyAyersMS @janvorli @jkotas

upd: FullPGO mode:

image

@janvorli
Copy link
Member

janvorli commented Oct 3, 2022

@EgorBo it seems that for the call counting cells we could possibly move the count to the stub data. Then it would be on the same or adjacent cache line with the target address that we need to read anyways. I was actually trying to make it that way originally, but it turned out it was a non-trivial change due to the fact how the CallCountingInfo is looked up from the CallCount and I didn't have more time to spend on that. The CallCount is part of the CallCountingInfo, so the lookup is a simple offset subtraction from the CallCount. If the call count was part of the CallCountingStub, then we would either have to expand the data stored in the stub to add a back pointer to the CallCountingInfo (which would lower the density of the stubs in memory) or figure out some more advanced way of lookup.

@AndyAyersMS
Copy link
Member

Ideally the reservoir sampling rate would diminish over time and so should be self-correcting and not lead to severe contention problems. However, I did not implement the sampling that way initially; once the reservoir fills up then the sampling rate is fixed. We should revisit this.

If we're not going to use the reservoir sampler's count to reduce update likelihoods, we can stop incrementing it once the table is full; that would remove a frequent shared memory update and likely cut down the scaling impact.

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Feb 27, 2023
The counter in the class profile counter is used to determine when to
switch to probabilistic updates of the reservior table, but it is currently
not used to detemine the probability of an update. So there's no need to keep
incrementing it once the table is full.

Since the count is mutable shared state, bypassing updates should reduce cache
contention somewhat.

Contributes to dotnet#72387.
AndyAyersMS added a commit that referenced this issue Feb 28, 2023
…ull (#82723)

The counter in the class profile counter is used to determine when to
switch to probabilistic updates of the reservior table, but it is currently
not used to detemine the probability of an update. So there's no need to keep
incrementing it once the table is full.

Since the count is mutable shared state, bypassing updates should reduce cache
contention somewhat.

Contributes to #72387.
@EgorBo
Copy link
Member Author

EgorBo commented May 18, 2023

I think we can close this one actually, we did what we could to reduce contention.

The contention in the call counting stubs is less important because we only do 30 calls (and the call countint stub becomes inactive). It was way more important to fix the contention in the PGO counters that we did in:

@EgorBo EgorBo closed this as completed May 18, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jun 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

4 participants