Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: scalable profile counter mode #84427

Merged
merged 8 commits into from
Apr 9, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
185 changes: 185 additions & 0 deletions docs/design/features/ScalableApproximateCounting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
# Scalable Approximate Counting

With particular relevance to Dynamic PGO

Andy Ayers
February 2023

## Introduction

Dynamic PGO works by instrumenting early versions of methods (aka Tier0 codegen) to produce a profile data set that the JIT can use to optimize subsequent versions of those methods (aka Tier1 codegen).

In this note we focus on just one aspect of this instrumentation: *counter based* instrumentation. With PGO enabled, the JIT adds code to each Tier0 method to count how often the parts of the methods execute so that later on in Tier1 it can focus optimization efforts on the parts of the method that seem to be most important for performance. More details on Dynamic PGO can be found [here](https://github.com/dotnet/runtime/blob/main/docs/design/features/DynamicPgo.md).

From the outset, Dynamic PGO has used fairly simplistic methods of counting—for each distinct counter, the code the JIT adds will simply increment a shared memory location. Let's call that aspect of counting the counter *implementation* and this particular way of counting *racing*.

(A related aspect of counting, not covered here, is that the JIT tries to be efficient with counter *placement*, relying on an approach first pioneered by [Ball](https://dl.acm.org/doi/10.1145/143165.143180) to try and reduce the total number of counters needed to a minimum, and to place them in the less often executed parts of the method.)

Recently we made two surprising observations about the *racing* implementation:
* We started measuring the runtime cost of counting in heavily multithreaded applications by forcing them to run only the Tier0 instrumented code. We saw that compared to uninstrumented Tier0 code, instrumented code was slower by a factor of 2 or 3. Some further experimentation revealed that cache contention (both true and false sharing) was a major contributor to the high instrumentation overhead.
* We started looking at how accurate the counts were in the Tier1 compilation and discovered widespread inaccuracies. The major contributor here was lost counter updates because of the unsynchronized access across many threads of execution (hence the name *racing*).

This was doubly bad news: not only were we paying a lot at runtime for our *racing* counter implementation, but it was also doing a pretty bad job of counting.

It seemed like we should be able to do better, and indeed we can. The rest of this note explores this problem space in more detail.

## Precision

The obvious fix for the lost counter updates is to stop racing and start synchronizing the updates. Our various platforms provide nice atomic counter updates in the form of `InterlockedIncrement` and similar, and the JIT can emit the proper machine code forms (say `lock inc[mem]`) that lay at the heart of these in place of the unsynchronized (`inc [mem]`) *racing* variant.

So let's call this new version the *interlocked* implementation scheme for a counter. It is simple to update the JIT to emit this variant and make various measurements. *Interlocked* largely fixes the accuracy problem we'd been seeing with *racing* counters, but creates even more runtime overhead. So while *interlocked* might serve as a component of a solution, it was not on its own a solution.

## Scalability

The counts gathered by Dynamic PGO have very wide dynamic range, and even when we are checking accuracy we are usually interested in relatively self consistent counts. For example, if we have a simple `if / then /else ` construct we would expect the count for the `if` block to equal sum of the counts for the `then` and `else` portions, and given a flow graph one can formulate this sort of conservation law more broadly (a good analogy here are Kirchhoff's laws for current flowing through a circuit). But there are some variances that we must tolerate: the method could throw an exception or the thread could be asynchronously stopped. So we typically are happy if the profile flows are accurate at each conservation point to say 1%, and a diminishing return on having results more accurate than this.

Given that we already need to tolerate some inaccuracy, the idea came about to see if we could leverage that to produce an intentionally approximate counting scheme that was nearly as accurate as *interlocked* but with less overhead.

A bit of research turned up an interesting paper by Dice, Lev, and Moir: [Scalable Statistics Counters](https://dl.acm.org/doi/10.1145/2486159.2486182) which was concerned with a similar problem. Here the key insight is that we can leverage randomization to count probabilistically: once the counter value is sufficiently large, instead of always trying to update its value by 1 for each increment, we update it by $N$ with probability $1/N$. So as the counter value grows we update it less and less often, thus reducing the amount of contention counter updates cause, and the counter's expected value is the correct count, with some controllable likelihood of the count being too large or to small. Because the total number of updates is limited, we can use *interlocked* operations for the updates without unduly harming scalability.

The approach in the paper relies either on using standard floating-point representations for counts or a "software float" representation where part of the counter storage is an exponent and the remainder a mantissa. Those representations were not well suited for our work, so we set about to find a similar scheme that would work for regular integer data types.

The key benefit of the exponent/mantissa form is that it makes it relatively simple to estimate the magnitude of the counter—the exponent is the integer part of the $\log_2$ of the counter value, and in particular, to compute an approximately correct quantity $1/N$ quickly without needing to divide if we restrict $N$ to a power of 2. For normal integer data we can compute the same thing by simply finding the highest set bit of the data, and can check the probability by computing the remainder left by dividing some random value by simple masks and compares.

## The Scalable Counter

Here is a C# implementation of our counter:
```C#
static void ScalableProfileCount(ref uint counter)
{
uint count = counter;
uint delta = 1;

if (count > 0)
{
int logCount = 31 - (int) uint.LeadingZeroCount(count);

if (logCount >= 13)
Copy link
Member

@EgorBo EgorBo Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a sort of fast path here? E.g.

if (count & 0x3FFFF) // is counter already large enough?
{
    int logCount = 31 - (int) uint.LeadingZeroCount(count);
    ..
}

e.g. LZC is not cheap on arm since there is no scalar version

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, we can probably inline it in JIT codegen if needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You confused me for a second commenting on the version in the doc and not the code in the runtime itself.

Yes we can do some sort of check like this.

{
delta = 1u << (logCount - 12);
uint rand = random();
bool update = (rand & (delta - 1)) == 0;
if (!update)
{
return;
}
}
}

Interlocked.Add(ref counter, delta);
}
```
Here `random` is a fast source of `uint` sized random numbers, ideally obtainable without synchronization (so say from a `thread_local` producer).

When the count value is small (less than $2^{13} = 8192$) the counter counts exactly by $1$. Once the value exceeds $8192$ the counter counts randomly, first by $2$, then $4$, then $8$, ... The only parameter value here is $13$, which controls the relative accuracy: a higher number would give more accurate (but less scalable) counts; a smaller number more scalable (but less accurate) counts.

Let's call this implementation *scalable*.

A value of $13$ with the simple and fast `xorshift` RNG we've historically used for PGO empirically gives the *scalable* count an overall $2\sigma$ accuracy of around 2%, with worst case deviation around 3% (see below on how we measured this). A better quality RNG might improve accuracy but would create more overhead.

## Results From Simulation

We first worked up a C# simulation to compare the three approaches introduced above: *racing*, *interlocked*, and *scalable*. We've already mentioned some of these results in text above.

### Accuracy

As noted, *interlocked* is perfectly accurate. If we compare *racing* and *scalable* to interlocked we see results similar to the following:

![Counter Accuracy](counter-accuracy.png)

Here the X axis is the intended final value of the count, where counting is done by #processors (here 12) threads each doing nothing more than incrementing the counter, and the Y axis shows the mean counter value along with max and min values, scaled by the total expected count.

As expected, *Interlocked* is just a horizontal line at Y=1.0.

*Racing* is losing counts even with a small final count of 12, and the loss percentage gets worse and worse the higher we try to count, ending up with at best 20% of the final count value.

*Scalable* is perfectly accurate up to 8192, then starts to deviate somewhat, but the total relative deviation remains constrained.

If we remove *Racing* from the picture and zoom in, we see that the error is indeed 2% for the 5/95 spread, and 3% for worst case, and that on average the count is very accurate.

![Counter Accuracy Detail](counter-accuracy-detail.png)

We can also adjust the accuracy by changing when we switch to probabilistic counting; the higher the switchover point, the more accurate the results are overall. In the below we vary the switchover point from $2^{10}$ to $2^{15}$ and plot the 5th and 95th percentiles relative to the accurate count:

![Alt text](counter-accuracy-tunable.png)

### Scalability

Benchmarking the same set of computations with, we end up with the following (here the switchover is at 8192):

![Counter Scaling](counter-scaling.png)

So *scalable* and *racing* have similar costs, while *interlocked* becomes quite a bit more costly at high counts.

### Notes

* Quite likely the benchmark has too much fixed overhead and could be improved, and if so we might see distinctions even for lower total counts.
* It was critical for *scalable* to use per-thread RNGs and to seed them each differently.

Mathematically we can model the *scalable* update distribution as a binomial distribution. If we make $N$ counter updates with probability $P$, the expected number of updates is $NP$ and the standard deviation in the number of updates is $\sqrt{NP(1-P)}$.

So if we start probabilistically incrementing by $2$ with probability $1/2$ at $8192$, then after $8192$ probabilistic updates we have added an expected value of $8192 \cdot 2 \cdot 1/2 = 8192$ to the counter.

The variance in the actual number of updates is $\sqrt{2^{13} \cdot 1/2 \cdot (1-1/2)} = \sqrt{2^{11}} \approx 45$. Each update is by 2, so the two standard deviation expected range for the change in the counter value is $2 \cdot 2 \cdot 45 \approx 180$. The relative error range is thus $\pm 180 / 8192 \approx \pm 0.022$. This is in reasonable agreement with the empirical study above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an interesting thing (coincidence?) going on here. The empirical study is going to include the effects of perfect measurement from 0-8192, so the reported empirical value for 16k is going to be half this relative error, or 0.011. This matches the graph very well.

This calculation for each additional section quickly goes to 0.03-0.031. However, again, the measured cumulative error is going to be smaller. It turns out to be close to that initial calculation.

One thing I can't figure out is the graph where you vary the starting point. If I do the calculation for 10, I get 0.0625 (and even halving it for the cumulative effect I get ~0.03). However, the first data point off the center line is 1.005. (Or maybe I should be looking at the next one, which is close to 1.03? The x-axis seems confusing here because that "first" point appears to be over 1k when it should be over 2k since 1k is the crossover point. Maybe? And I guess that I need to point out that the non-powers-of-2 on the x-axis are a little worrying :))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. I think there's a bug in my simulation code, the log should be

   int logCount = 32 - (int) uint.LeadingZeroCount(count);

So the simulation starts probabilistic mode sooner than it should. Let me rerun this and see if the graph makes more sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's not the issue, the threshold is OK. When I added a parallel mode to the simulation to I wanted to make sure there was evenly divisible counting I rounded up the count to the next multiple of the number of CPUs. So that skews the data a bit as you noted, and so yes you should look at "the next one".

Here is a revised plot without the extra counts:

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That chart matches what I found. It's fascinating how such a seemingly simple idea works out so well.


As we count higher the standard deviation is limited by $\sigma \approx \sqrt{NP}$, so when we double $N$ and halve $P$ the variance $\sigma$ remains roughly the same overall.

If (via the benchmark) we look at how tunable the scalability is, we see that the higher the threshold for switching to probabilistic counting, the higher the cost (but of course the better the accuracy):
Copy link
Member

@markples markples Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion of the threshold (both in your brief talk and here) has been the most confusing point for me. Essentially, why does switching earlier (say 512 vs 8192) cause counts 1-2 million to be so different?

The answer appears to be that the cutoff is associated with the beginning of the -scaling- of N, which is a byproduct of using a single tuning variable. If one used N=2 for 512-16k and then started increasing N, then the graph would differ between 512 and 8k (with the error being the worse in that interval at 512) and then match afterwards (aside from any cumulative differences from 512-8k though those would eventually become insignificant).

I don't know if you want to change the descriptions at all for this though. Maybe it only confused me.

I guess it does suggest that if profiling goals dictate, the start/scaling could be adjusted separately to meet those goals.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, there is considerable flexibility in deciding when to change the increment/probability and how much to change it by and the change points don't need to be powers of two or spaced in any orderly way.

The paper I referenced describes a mode where once the increments are large enough and probability of updates are low enough, they just plateau there, assuming the overhead is now so insignificant that further decreases in the probability of updates won't matter much, so counting ends up being relatively more accurate for very large counts then for smaller counts.


![Details on scaling vs threshold](counter-scaling-detail.png)

## Results From Dynamic PGO

We next implemented this same technique within the runtime and JIT, encoding the logic above as a new helper call.

Trying this out on some Tech Empower benchmarks where we force the benchmark to execute Tier0 instrumented code, we see significant improvements in server metrics (here on mvc/plaintext, windows, citrine, switchover at 8192):

| Metric (Tier0 instr) | Racing | Scalable | |
| ---------------------- | ---------------- | -------------------- | ------- |
| First Request (ms) | 155 | 155 | +0.22% |
| Requests/sec | 223,665 | 352,592 | +57.64% |
| Requests | 3,372,373 | 5,308,812 | +57.42% |
| Mean latency (ms) | 10.60 | 7.83 | -26.08% |
| Max latency (ms) | 237.25 | 279.12 | +17.65% |
| Read throughput (MB/s) | 34.77 | 54.81 | +57.64% |
| Latency 50th (ms) | 10.03 | 6.68 | -33.42% |
| Latency 75th (ms) | 14.69 | 9.70 | -33.98% |
| Latency 90th (ms) | 17.65 | 11.76 | -33.38% |
| Latency 99th (ms) | 29.59 | 0.00 | |

And in normal processing, the extra accuracy from *Scalable* seems to provide some small benefits to the Tier1 code as well:

| Metric (Tiered PGO) | Racing | Scalable | |
| ---------------------- | ----------- | --------------- | ------- |
| First Request (ms) | 90 | 89 | -1.12% |
| Requests/sec | 3,083,453 | 3,187,157 | +3.36% |
| Requests | 46,558,728 | 48,125,121 | +3.36% |
| Mean latency (ms) | 0.97 | 0.92 | -4.61% |
| Max latency (ms) | 98.20 | 109.77 | +11.77% |
| Read throughput (MB/s) | 479.32 | 495.44 | +3.36% |
| Latency 50th (ms) | 0.76 | 0.74 | -2.76% |
| Latency 75th (ms) | 1.08 | 1.05 | -2.77% |
| Latency 90th (ms) | 1.30 | 1.28 | -2.05% |
| Latency 99th (ms) | 0.00 | 0.00 | |

We haven't yet explored the impact across a wider set of benchmarks or tried varying the switch-over point.

## Summary

We have presented a technique to implement approximately correct counters with relatively low overhead and good scaling properties. The counter accuracy can be tuned by trading off some of the scalability benefit. The counting process uses minimal runtime state: one RNG per physical thread, plus one storage location per counter. The value of the counter is readily available without any post-processing.

This sort of counter seems well-suited for use in our Dynamic PGO instrumentation.

It may be that approximate counting will be useful in other application areas where scala
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentence got cut off here - presumably you just mean to say where scalability is needed but small errors are acceptable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, thanks.


## End Notes

The Dice *et al.* paper suggests that at high count values one might retain more accuracy by capping the granularity of updates. That is, once the likelihood of an update reaches say $1/1024$ there is not much scalability benefit to be had with further decreases, as the contention for the counter should be minimal. We have not tried this.

It might also be possible to continue with increment by one counting so long as there is no contention for the counter, so that a single-threaded application gets more precision. One could for example replace the `Interlocked.Add` with a compare exchange, and only switch over to probabilistic counting when the exchange fails.

Another approach not considered here is to shard the entire counter array into per-thread counters. Aside from a potentially large size increase, a sharded counter requires more work to query the counter's current value, as one must sum across the shards.

Our *scalable* scheme still suffers from false sharing; this seems more difficult to fix.

The paper by Dice *et al.* refers to a seminal paper by Morris, [Counting large numbers of events in small registers](https://dl.acm.org/doi/10.1145/359619.359627) where similar techniques are used to reduce the size of each counter. This might prove interesting in other contexts, for example compressing the size of some of our other DynamicPGO data, like the dynamic class profile histograms.
Binary file added docs/design/features/counter-accuracy-detail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/design/features/counter-accuracy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/design/features/counter-scaling-detail.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/design/features/counter-scaling.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions src/coreclr/inc/corinfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -653,6 +653,8 @@ enum CorInfoHelpFunc
CORINFO_HELP_DELEGATEPROFILE64, // Update 64-bit method profile for a delegate call site
CORINFO_HELP_VTABLEPROFILE32, // Update 32-bit method profile for a vtable call site
CORINFO_HELP_VTABLEPROFILE64, // Update 64-bit method profile for a vtable call site
CORINFO_HELP_COUNTPROFILE32, // Update 32-bit block or edge count profile
CORINFO_HELP_COUNTPROFILE64, // Update 64-bit block or edge count profile

CORINFO_HELP_VALIDATE_INDIRECT_CALL, // CFG: Validate function pointer
CORINFO_HELP_DISPATCH_INDIRECT_CALL, // CFG: Validate and dispatch to pointer
Expand Down
10 changes: 5 additions & 5 deletions src/coreclr/inc/jiteeversionguid.h
Original file line number Diff line number Diff line change
Expand Up @@ -43,11 +43,11 @@ typedef const GUID *LPCGUID;
#define GUID_DEFINED
#endif // !GUID_DEFINED

constexpr GUID JITEEVersionIdentifier = { /* C1A00D6C-2B60-4511-8AD2-6DB109224E37 */
0xc1a00d6c,
0x2b60,
0x4511,
{ 0x8a, 0xd2, 0x6d, 0xb1, 0x9, 0x22, 0x4e, 0x37 }
constexpr GUID JITEEVersionIdentifier = { /* 3054e9ba-bcfe-417c-9043-92ccc8738b80 */
0x3054e9ba,
0xbcfe,
0x417c,
{0x90, 0x43, 0x92, 0xcc, 0xc8, 0x73, 0x8b, 0x80}
};

//////////////////////////////////////////////////////////////////////////////////////////////////////////
Expand Down
2 changes: 2 additions & 0 deletions src/coreclr/inc/jithelpers.h
Original file line number Diff line number Diff line change
Expand Up @@ -341,6 +341,8 @@
JITHELPER(CORINFO_HELP_DELEGATEPROFILE64, JIT_DelegateProfile64, CORINFO_HELP_SIG_REG_ONLY)
JITHELPER(CORINFO_HELP_VTABLEPROFILE32, JIT_VTableProfile32, CORINFO_HELP_SIG_4_STACK)
JITHELPER(CORINFO_HELP_VTABLEPROFILE64, JIT_VTableProfile64, CORINFO_HELP_SIG_4_STACK)
JITHELPER(CORINFO_HELP_COUNTPROFILE32, JIT_CountProfile32, CORINFO_HELP_SIG_REG_ONLY)
JITHELPER(CORINFO_HELP_COUNTPROFILE64, JIT_CountProfile64, CORINFO_HELP_SIG_REG_ONLY)

#if defined(TARGET_AMD64) || defined(TARGET_ARM64)
JITHELPER(CORINFO_HELP_VALIDATE_INDIRECT_CALL, JIT_ValidateIndirectCall, CORINFO_HELP_SIG_REG_ONLY)
Expand Down
Loading