Make call-counting, class probes, block counters cache-friendly #72387

EgorBo · 2022-07-18T18:23:26Z

When we start a multi-thread application (e.g. any web workload) it seems to me that we pay some penalty for accessing some common memory locations from different threads. Consider this method:

void DoWork(IDoWork work) => work?.Do();

If on start we call it from multiple threads (e.g. processing incoming requests) we most likely will end up accessing the same 3 memory locations from multiple threads:

call counting cell in the callCountingStub for DoWork (and Do)
BB counter in case of PGO (DoWork has a branch)
Class probe

So we basically are going to do a lot of cache thrashing and it's especially painful for NUMA nodes.

We should consider/experiment with adding some quick random-based checks on top of all 3, something like

if (rand & 1)
    dec [callCountingCell]

It should slightly help and increase chances of accessing the same memory location from just one core and reduce number of cache thrashing in general.
On x86 we can rely on rdtsc for that (and cntvct_el0 on arm) to access perf counters.

One might say that it's not that important because we have low callcounting thresholds but we need to take into account the fact that we start to promote methods to tier1 only if we didn't encounter new tier0 compilations in the last 100ms

category:proposal
theme:profile-feedback

The text was updated successfully, but these errors were encountered:

dotnet-issue-labeler · 2022-07-18T18:23:32Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost · 2022-07-18T18:26:52Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

When we start a multi-thread application (e.g. any web workload) it seems to me that we pay some penalty for accessing some common memory locations from different threads. Consider this method:

void DoWork(IDoWork work) => work?.Do();

If on start we call it from multiple threads (e.g. processing incoming requests) we most likely will end up accessing the same 3 memory locations from multiple threads:

call counting cell in the callCountingStub for DoWork (and Do)
BB counter in case of PGO (DoWork has a branch)
Class probe

So we basically are going to do a lot of cache thrashing and it's especially painful for NUMA nodes.

We should consider/experiment with adding some quick random-based checks on top of all 3, something like

if (rand & 1)
    dec [callCountingCell]

It should slightly help and increase chances of accessing the same memory location from just one core and reduce number of cache thrashing in general.
On x86 we can rely on rdtsc for that (and cntvct_el0 on arm) to access perf counters.

One might say that it's not that important because we have low callcounting thresholds but we need to take into account the fact that we start to promote methods to tier1 only if we didn't encounter new tier0 compilations in the last 100ms

Author:	EgorBo
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`
Milestone:	8.0.0

jakobbotsch · 2022-07-18T18:30:10Z

There's also a good discussion of possible approaches at https://groups.google.com/g/llvm-dev/c/cDqYgnxNEhY.

EgorBo · 2022-07-18T18:47:42Z

cc @AndyAyersMS probably can be added to the Dynamic PGO plan for 8.0 to experiment, I wonder if a quick experiment will show some improvements over time-to-first-request (and P99)

AndyAyersMS · 2022-07-18T19:40:51Z

I think rdtsc is also pretty slow (and serializing, perhaps).

Currently with the fairly low tiering thresholds we can't rely on getting lots of samples. So, we have tension between reducing sampling overhead and wanting to make sure we get enough samples.

Class Probes

Ideally the reservoir sampling rate would diminish over time and so should be self-correcting and not lead to severe contention problems. However, I did not implement the sampling that way initially; once the reservoir fills up then the sampling rate is fixed. We should revisit this.

Some variants of reservoir sampling pre-compute how many samples to skip until the next sample (more work when you sample but less in between). We might consider adopting one of those.

There is a shared state update for the random source. This random state is small and perhaps could be more granular (say per thread or something) and suitably padded to avoid false sharing.

With those adjustments, class probes should not cause contention issues. And (as we've noted elsewhere) we should try and remove the fully redundant probes.

Counter Probes

For count samples it seems like it will be tricky to sub-sample, because you want to sample related counters in a similar manner. Perhaps what's needed is some sort of bursty counting but it's hard to see how to do this without either bulking up Tier0 code with conditional counter updates or producing two Tier0 bodies and switching between them every so often (or making Tier0 instrumented be a "tier0.5" and switching out of it quickly once we have enough samples ... etc).

EgorBo · 2022-10-02T14:14:04Z

Did a couple of experiments on ARM64 where it's cheaper: EgorBo@d5d2e9a via cntvct_el0 that is a performance counter that should be available on all arm64 devices we support in user space.

unfortunately, other metrics are too verbose on arm64 on our PerfLab (no obvious regressions/improvements), only "methods jitted" is stable.

just FYI @AndyAyersMS @janvorli @jkotas

upd: FullPGO mode:

janvorli · 2022-10-03T13:51:13Z

@EgorBo it seems that for the call counting cells we could possibly move the count to the stub data. Then it would be on the same or adjacent cache line with the target address that we need to read anyways. I was actually trying to make it that way originally, but it turned out it was a non-trivial change due to the fact how the CallCountingInfo is looked up from the CallCount and I didn't have more time to spend on that. The CallCount is part of the CallCountingInfo, so the lookup is a simple offset subtraction from the CallCount. If the call count was part of the CallCountingStub, then we would either have to expand the data stored in the stub to add a back pointer to the CallCountingInfo (which would lower the density of the stubs in memory) or figure out some more advanced way of lookup.

AndyAyersMS · 2023-02-17T00:39:44Z

Ideally the reservoir sampling rate would diminish over time and so should be self-correcting and not lead to severe contention problems. However, I did not implement the sampling that way initially; once the reservoir fills up then the sampling rate is fixed. We should revisit this.

If we're not going to use the reservoir sampler's count to reduce update likelihoods, we can stop incrementing it once the table is full; that would remove a frequent shared memory update and likely cut down the scaling impact.

The counter in the class profile counter is used to determine when to switch to probabilistic updates of the reservior table, but it is currently not used to detemine the probability of an update. So there's no need to keep incrementing it once the table is full. Since the count is mutable shared state, bypassing updates should reduce cache contention somewhat. Contributes to dotnet#72387.

…ull (#82723) The counter in the class profile counter is used to determine when to switch to probabilistic updates of the reservior table, but it is currently not used to detemine the probability of an update. So there's no need to keep incrementing it once the table is full. Since the count is mutable shared state, bypassing updates should reduce cache contention somewhat. Contributes to #72387.

EgorBo · 2023-05-18T21:30:46Z

I think we can close this one actually, we did what we could to reduce contention.

The contention in the call counting stubs is less important because we only do 30 calls (and the call countint stub becomes inactive). It was way more important to fix the contention in the PGO counters that we did in:

EgorBo added the tenet-performance Performance related issue label Jul 18, 2022

ghost added the untriaged New issue has not been triaged by the area owner label Jul 18, 2022

EgorBo added this to the 8.0.0 milestone Jul 18, 2022

ghost removed the untriaged New issue has not been triaged by the area owner label Jul 18, 2022

EgorBo added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 18, 2022

EgorBo mentioned this issue Oct 12, 2022

Dynamic PGO startup improvements in NET 8 #76969

Closed

23 tasks

EgorBo mentioned this issue Feb 10, 2023

Remove CPU cache contention for instrumented code #81932

Merged

AndyAyersMS mentioned this issue Feb 27, 2023

Stop incrementing class profile counter once the reservoir table is full #82723

Merged

JulieLeeMSFT assigned EgorBo May 18, 2023

EgorBo closed this as completed May 18, 2023

ghost locked as resolved and limited conversation to collaborators Jun 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make call-counting, class probes, block counters cache-friendly #72387

Make call-counting, class probes, block counters cache-friendly #72387

EgorBo commented Jul 18, 2022 •

edited by BruceForstall

Loading

dotnet-issue-labeler bot commented Jul 18, 2022

ghost commented Jul 18, 2022

jakobbotsch commented Jul 18, 2022

EgorBo commented Jul 18, 2022

AndyAyersMS commented Jul 18, 2022

EgorBo commented Oct 2, 2022 •

edited

Loading

janvorli commented Oct 3, 2022

AndyAyersMS commented Feb 17, 2023

EgorBo commented May 18, 2023

Make call-counting, class probes, block counters cache-friendly #72387

Make call-counting, class probes, block counters cache-friendly #72387

Comments

EgorBo commented Jul 18, 2022 • edited by BruceForstall Loading

dotnet-issue-labeler bot commented Jul 18, 2022

ghost commented Jul 18, 2022

jakobbotsch commented Jul 18, 2022

EgorBo commented Jul 18, 2022

AndyAyersMS commented Jul 18, 2022

Class Probes

Counter Probes

EgorBo commented Oct 2, 2022 • edited Loading

janvorli commented Oct 3, 2022

AndyAyersMS commented Feb 17, 2023

EgorBo commented May 18, 2023

EgorBo commented Jul 18, 2022 •

edited by BruceForstall

Loading

EgorBo commented Oct 2, 2022 •

edited

Loading