diff --git a/docs/design/features/RandomizedAllocationSampling.md b/docs/design/features/RandomizedAllocationSampling.md new file mode 100644 index 00000000000000..61d41c0ffd10c9 --- /dev/null +++ b/docs/design/features/RandomizedAllocationSampling.md @@ -0,0 +1,317 @@ +# Randomized Allocation Sampling + +Christophe Nasarre (@chrisnas), Noah Falk (@noahfalk) - 2024 + +## Introduction + +.NET developers want to understand the GC allocation behavior of their programs both for general observability and specifically to better understand performance costs. Although the runtime has a very high performance GC, reducing the number of bytes allocated in a scenario can have notable impact on the total execution time and frequency of GC pauses. Some ways developers understand these costs are measuring allocated bytes in: +1. Microbenchmarks such as Benchmark.DotNet +2. .NET APIs such as [GC.GetAllocatedBytesForCurrentThread()](https://learn.microsoft.com/dotnet/api/system.gc.getallocatedbytesforcurrentthread) +3. Memory profiling tools such as VS profiler, PerfView, and dotTrace +4. Metrics or other production telemetry + +Analysis of allocation behavior often starts simple using the total bytes allocated while executing a block of code or during some time duration. However for any non-trivial scenario gaining a deeper understanding requires attributing allocations to specific lines of source code, callstacks, types, and object sizes. .NET's current state of the art technique for doing this is using a profiling tool to sample using the [AllocationTick](https://learn.microsoft.com/en-us/dotnet/fundamentals/diagnostics/runtime-garbage-collection-events#gcallocationtick_v3-event) event. When enabled this event triggers approximately every time 100KB has been allocated. However this sampling is not a random sample. It has a fixed starting point and stride which can lead to significant sampling error for allocation patterns that are periodic. This has been observed in practice so it isn't merely a theoretical concern. The new randomized allocation sampling feature is intended to address the shortcomings of AllocationTick and offer more rigorous estimations of allocation behavior and probabilistic error bounds. We do this by creating a new `AllocationSampled` event that profilers can opt into via any of our standard event tracing technologies (ETW, EventPipe, Lttng, EventListener). The new event is completely independent of AllocationTick and we expect profilers will prefer to use the AllocationSampled event on runtime versions where it is available. + +The initial part of this document describe the conceptual sampling model and how we suggest the data be interpretted by profilers. The latter portion describes how the sampling model is implemented in runtime code efficiently. + +## The sampling model + +When the new AllocationSampled event is enabled, each managed thread starts sampling independent of one another. For a given thread there will be a sequence of allocated objects Object_1, Object_2, etc that may continue indefinitely. Each object has a corresponding .NET type and size. The size of an object includes the object header, method table, object fields, and trailing padding that aligns the size to be a multiple of the pointer size. It does not include any additional memory the GC may optionally allocate for more than pointer-sized alignment, filling gaps that are impossible/inefficient to use for objects, or other GC bookkeeping data structures. Also note that .NET does have a non-GC heap where some objects that stay alive for process lifetime are allocated. Those non-GC heap objects are ignored by this feature. + +When each new object is allocated, conceptually the runtime starts doing independent [Bernoulli Trials](https://en.wikipedia.org/wiki/Bernoulli_trial) (weighted coin flips) for every byte in the object. Each trial has probability p = 1/102,400 of being considered a success. As soon as one successful trial is observed no more trials are performed for that object and an AllocationSampled event is emitted. This event contains the object type, its size, and the 0-based offset of the byte where the successful trial occured. This means for a given object if an event was generated `offset` failed trials occured followed by a successful trial, and if no event is generated `size` failed trials occured. This process continues indefinitely for each new allocated object. + +This sampling process is closely related to the [Bernoulli process](https://en.wikipedia.org/wiki/Bernoulli_process) and is a well studied area of statistics. Skipping ahead to the end of an object once a successful sample has been produced does require some accomodations in the analysis, but many key results are still applicable. + +## Using the feature + +### Enabling sample events + +The allocation sampled events are enabled on the `Microsoft-Windows-DotNetRuntime` provider using keyword `0x80000000000` at informational or higher level. For more details on how to do this using different event tracing technologies see [here](https://learn.microsoft.com/en-us/dotnet/core/diagnostics/eventsource-collect-and-view-traces). + +### Interpretting sample events + +Although diagnostic tools are free to interpret the data in whatever way, we have some recommendations for analysis that we expect is useful and statistically sound. + +#### Definitions + +For all of this section assume that we enabled the AllocationSampling events and have observed `s` such sample events were generated from a specific thread - `event_1`, `event_2`, ... `event_s`. Each `event_i` contains corresponding fields `type_i`, `size_i`, and `offset_i`. Let `u_i = size_i - offset_i`. `u_i` represents the successful trial byte + the number bytes which remained after it in the same object. Let `u` = the sum of all the `u_i`, `i` = 1 to `s`. `p` is the constant 1/102400, the probability that each trial is successful. `q` is the complement 1 - 1/102400. + +#### Estimation strategies + +We have explored two different mathematical approaches for [estimating](https://en.wikipedia.org/wiki/Estimator) the number of bytes that were allocated given a set of observed sample events. Both approaches are unbiased which means if we repeated the same sampling procedure many times we expect the average of the estimates to match the number of bytes allocated. Where the approaches differ is on the particular distribution of the estimates. + +#### Estimation Approach 1: Weighted samples + +We believe this approach gives estimates with lower [Mean Squared Error](https://en.wikipedia.org/wiki/Mean_squared_error) but the exact shape of the distribution is hard to calculate so we don't know a good way to produce [confidence intervals](https://en.wikipedia.org/wiki/Confidence_interval) based on small numbers of samples. The distribution does approach a [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution) as the number of samples increase ([Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem)) but we haven't done any analysis attempting to define how rapidly that convergence occurs. + +To estimate the number of bytes using this technique let `estimate_i = size_i/(1 - q^size_i)` for each sample `i`. Then sum `estimate_i` over all samples to get a total estimate of the allocated bytes. With sufficiently many samples the estimate distribution should converge to a normal distribution with variance at most `N*q/p` for `N` total bytes of allocated objects. + +##### Statistics stuff + +Understanding this part isn't necessary to use the estimation formula above but may be helpful. + +Proving the weighted sample estimator is unbiased: +Consider the sequence of all objects allocated on a thread. Let `X_j` be a random indicator variable that has value `size_j/(1 - q^size_j)` if the `j`th object is sampled, otherwise zero. Our estimation formula above is the sum of all `X_j` because only the sampled objects will contribute a non-zero term. Based on our sampling procedure the probability for an object to be sampled is `1-q^size_j` which means `E(X_j) = size_j/(1 - q^size_j) * Pr(object j is sampled) = size_j/(1 - q^size_j) * (1 - q^size_j) = size_j`. By linearity of expectation, the expected value of the sum is the sum of the expected values = sum of `size_j` for all `j` = total size of allocated objects. + +The variance for this estimation is the sum of variances for each `X_j` term, `(size_j^2)*(q^size_j)/(1-q^size_j)`. If we assume there are `N` total bytes of objects divided up into `N/n` objects of size `n` then total variance for that set of objects would be `(N/n)*n^2*q^n/(1-q^n) = N*n*q^n/(1-q^n)`. That expression is maximized when `n=1` so the maximum variance for any collection objects with total size `N` is `N*1*q^1/(1-q^1) = N*q/(1-q) = N*q/p`. + +#### Estimation Approach 2: Estimating failed trials + +This is an alternate estimate that has a more predictable distribution, but potentially higher [Mean Squared Error](https://en.wikipedia.org/wiki/Mean_squared_error). You could use this approach to produce both estimates and confidence intervals, or use the weighted sample formula to produce estimates and use this one solely as a conservative confidence interval for the estimate. + +The estimation formula is `sq/p + u`. + +This estimate is based on the [Negative Bernoulli distribution](https://en.wikipedia.org/wiki/Negative_binomial_distribution) with `s` successes and `p` chance of success. The `sq/p` term is the mean of this distribution and represents the expected number of failed trials necessary to observe `s` successful trials. The `u` term then adds in the number of successful trials (1 per sample) and the number of bytes for which no trials were performed (`u_i-1` per sample). + +Here is an approach to calculate a [confidence interval](https://en.wikipedia.org/wiki/Confidence_interval) estimate based on this distribution: + +1. Decide on some probability `C` that the actual number of allocated bytes `N` should fall within the interval. You can pick a probability arbitrarily close to 1 however the higher the probability is the wider the estimated interval will be. For the remaining `1-C` probability that `N` is not within the interval we will pick the upper and lower bounds so that there is a `(1-C)/2` chance that `N` is below the interval and `(1-C)/2` chance that `N` is above the interval. We think `C=0.95` would be a reasonable choice for many tools which means there would be a 2.5% chance the actual value is below the lower bound, a 95% chance it is between the lower and upper bound, and a 2.5% chance it is above the upper bound. + +2. Implement some method to calculate the Negative Binomial [CDF](https://en.wikipedia.org/wiki/Cumulative_distribution_function). Unfortunately there is no trivial formula for this but there are a couple potential approaches: + a. The Negative Binomial Distribution has a CDF defined based on the [regularized incomplete beta function](https://en.wikipedia.org/wiki/Beta_function#Incomplete_beta_function). There are various numerical libraries such as scipy in Python that will calculate this for you. Alternately you could directly implement numerical approximation techniques to evaluate the function, either approximating the integral form or approximating the continued fraction expansion. + b. The Camp-Paulson approximation described in [Barko(66)](https://www.stat.cmu.edu/technometrics/59-69/VOL-08-02/v0802345.pdf). We validated that for p=0.00001 this approximation was within ~0.01 of the true CDF for any number of failures at s=1, within ~0.001 of the true CDF at s=5, and continues to get more accurate as the sample count increases. + +3. Do binary search on the CDF to locate the input number of failures for which `CDF(failures, s, p)` is closest to `(1-C)/2` and `C + (1-C)/2`. Assuming that `CDF(low_failures, s, p) = (1-C)/2` and `CDF(high_failures, s, p) = C + (1-C)/2` then the confidence interval for `N` is `[low_failures+u, high_failures+u]`. + +For example if we select `C=0.95`, observed 8 samples and `u=10,908` then we'd use binary search to find `CDF(353666, 8, 1/102400) ~= 0.025` and `CDF(1476870, 8, 1/102400) ~= 0.975`. Our interval estimate for the number of bytes allocated would be `[353666 + 10908, 1476870 + 10908]`. + +To get a rough idea of the error in proportion to the number of samples, here is table of calculated Negative Binomial failed trials for the 0.025 and 0.975 thresholds of the CDF: + +| # of samples (successes) | CDF() = 0.025 | CDF() = 0.975 | +| ---------------------------| ------------------------| --------------------------- | +| 1 | 2591 | 377738 | +| 2 | 24800 | 570531 | +| 3 | 63349 | 739802 | +| 4 | 111599 | 897761 | +| 5 | 166241 | 1048730 | +| 6 | 225469 | 1194827 | +| 7 | 288185 | 1337279 | +| 8 | 353666 | 1476870 | +| 9 | 421407 | 1614137 | +| 10 | 491039 | 1749469 | +| 20 | 1250954 | 3038270 | +| 30 | 2072639 | 4264804 | +| 40 | 2926207 | 5459335 | +| 50 | 3800118 | 6633475 | +| 100 | 8331581 | 12342053 | +| 200 | 17739679 | 23413825 | +| 300 | 27341465 | 34291862 | +| 400 | 37043463 | 45069676 | +| 500 | 46809487 | 55783459 | +| 1000 | 96149867 | 108842093 | +| 2000 | 195919830 | 213870137 | +| 3000 | 296301551 | 318286418 | +| 4000 | 396999923 | 422386047 | +| 5000 | 497900649 | 526283322 | +| 10000 | 1004017229 | 1044156743 | + +Notice that if we compare the expected total number of trials (102400 * # of samples) to the estimated ranges, at 10 samples the error bars extend more than 50% in each direction showing the predictions on so few samples are very imprecise. However at 1,000 samples the error is ~6% in each direction and at 10,000 samples ~2% in each direction. + +The variance for the Negative Binomial Distribution is `sq/p^2`. In the limit where all allocated objects have size 1 byte, `E(s)=Np` which gives an expected variance of `Nq/p`, the same as with the weighted sampled approach. However as object sizes increase the variance on approach 1 decreases more rapidly than in this approach. + +#### Compensating for bytes allocated on a thread in between events + +It is likely you want to estimate allocations starting and ending at arbitrary points in time that do not correspond exactly with the moment a sampling event was emitted. This means the initial sampling event covered more time than the allocation period we are interested in and the allocations at the end aren't included in any sampling event. You can conservatively adjust the error bounds to account for the uncertainty in the starting and ending allocations. If the starting point is not aligned with a sampling event calculate the lower bound of allocated bytes as if there was one fewer sample received. If the ending point is not aligned with a sampling event calculate the upper bound as if there was one more sample received. + +#### Estimating the total number of bytes allocated on all threads + +The per-thread estimations can be repeated for all threads and summed up. + +#### Estimating the number of bytes allocated for objects of a specific type, size or other characteristic + +Select from the sampling events only those events which occured in objects matching your criteria. For example if you want to estimate the number of bytes allocated for Foo typed objects, select the samples in Foo-typed objects. Using this reduced set of samples do the same estimation technique as above. The error on this estimation will also be based on the number of samples in your filtered subset. If there were 1000 initial samples but only 3 of those samples were in Foo-typed objects that might generate an estimate of 310K bytes of Foo objects but beware that the potential sampling error for a small number of samples is very large. + +## Implementation design + +Overall the implementation needs to do a few steps: +1. Determine if sampling events are enabled. If no, there is nothing else to do, but if yes we need to do steps (2) and (3). +2. Use a random number generator to simulate random trials for each allocated byte and determine which objects contain the successful trials +3. When a successful trial occurs, emit the AllocationSampled event + +Steps (1) and (3) are straightforward but step (2) is non-trivial to do correctly and performantly. For step (1) we use the existing macro ETW_TRACING_CATEGORY_ENABLED() which despite its name works for all our event tracing technologies. For step (3) we defined a method FireAllocationSampled() in gchelpers.cpp and the code to emit the event is in there. Like all runtime events the definition for the event itself is in ClrEtwAll.man. All the remaining discussion is how we accomplish step (2). + +Our conceptual sampling model involves doing Bernoulli trials for every byte of an object. In theory we could implement that very literally. Each object allocation would run a for loop n iterations for an n byte object and generate random coin flips with a pseudo random number generator (PRNG). However doing this would be incredibly slow. A good way to understand the actual implementation is to imagine we started with this simple slow approach and then did several iterative transformations to make it run faster while maintaining the same output. Imagine that we have some function `bool GetNextTrialResult(CLRRandom* pRandom)` that takes a PRNG and should randomly return true with probability 1 in 102,400. It might be implemented: + +``` +bool GetNextTrialResult(CLRRandom* pRandom) +{ + return pRandom->NextDouble() < 1/102400; +} +``` + +We don't have to generate random numbers at the instant we need them however, we are allowed to cache a batch of them at a time and dispense them later. For simplicity treat all the apparent global variables in these examples as being thread-local. In pseudo-code that looks like: + +``` +List _cachedTrials = PopulateTrials(pRandom); +List PopulateTrials(CLRRandom* pRandom) +{ + List trials = new List(); + for(int i = 0; i < 100; i++) + { + trials.Push(pRandom->NextDouble() < 1/102400); + } + return trials; +} +bool GetNextTrialResult(CLRRandom* pRandom) +{ + bool ret = _cachedTrials.Pop(); + // if we are out of trials, cache some more for next time + if(_cachedTrials.Count == 0) + { + _cachedTrials = PopulateTrials(pRandom); + } + return ret; +} +``` + +Notice that almost the every entry in the cached list will be false so this is an inefficient way to store it. Rather than storing a large number of false bits we could store a single number that represents a run of zero or more contiguous false bools followed by a single true bool. There is also no requirement that our cached batches of trials are the same size so we could cache exactly one run of false results. In pseudo-code that looks like: + +``` +BigInteger _cachedFailedTrials = PopulateTrials(pRandom); +BigInteger PopulateTrials(CLRRandom* pRandom) +{ + BigInteger failedTrials = 0; + while(pRandom->NextDouble() >= 1/102400) + { + failedTrials++; + } + return failedTrials; +} +bool GetNextTrialResult(CLRRandom* pRandom) +{ + bool ret = (_cachedFailedTrials == 0); + _cachedFailedTrials--; + // if we are out of trials, cache some more for next time + if(cachedTrials < 0) + { + _cachedFailedTrials = PopulateTrials(pRandom); + } + return ret; +} +``` + +Rather than generating `_cachedFailedTrials` by doing many independent queries to a random number generator we can use some math to speed this up. The probability `_cachedFailedTrials` has some particular value `X` is given by the [Geometric distribution](https://en.wikipedia.org/wiki/Geometric_distribution). We can use [Inverse Transform Sampling](https://en.wikipedia.org/wiki/Inverse_transform_sampling) to generate random values for this distribution directly. The CDF for the Geometric distribution is `1-(1-p)^(floor(x)+1)` which means the inverse is `floor(ln(1-y)/ln(1-p))`. + +We've been using BigInteger so far because mathmatically there is a non-zero probability of getting an arbitrarily large number of failed trials in a row. In practice however our PRNG has its outputs constrained to return a floating point number with value k/MAX_INT for an integer value of k between 0 and MAX_INT-1. The largest value PopulateTrials() can return under these constraints is ~2.148M which means a 32bit integer can easily accomodate the value. The perfect mathematical model of the Geometric distribution has a 0.00000005% chance of getting a larger run of failed trials but our PRNG rounds that incredibly unlikely case to zero probability. + +Both of these changes combined give the pseudo-code + +``` +int _cachedFailedTrials = CalculateGeometricRandom(pRandom); +// Previously this method was called PopulateTrials() +// Use Inverse Transform Sampling to calculate a random value from the Geometric distribution +int CalculateGeometricRandom(CLRRandom* pRandom) +{ + return floor(log(1-pRandom->NextDouble())/log(1-1/102400)); +} +bool GetNextTrialResult(CLRRandom* pRandom) +{ + bool ret = (_cachedFailedTrials == 0); + _cachedFailedTrials--; + // if we are out of trials, cache some more for next time + if(_cachedFailedTrials < 0) + { + _cachedFailedTrials = CalculateGeometricRandom(pRandom); + } + return ret; +} +``` + +When allocating an object we need to do many trials at once, one for each byte. A naive implementation of that would look like: + +``` +bool DoesAnyTrialSucceed(CLRRandom* pRandom, int countOfTrials) +{ + for(int i = 0; i < countOfTrials; i++) + { + if(GetNextTrialResult(pRandom)) return true; + } + return false; +} +``` + +However the `_cachedFailedTrials` representation lets us speed this up by checking if the number of failed trials in the cache covers the number of trials we need to perform without iterating through them one at a time: + +``` +bool DoesAnyTrialSucceed(CLRRandom* pRandom, int countOfTrials) +{ + bool ret = _cachedFailedTrials < countOfTrials; + _cachedFailedTrials -= countOfTrials; + // if we are out of trials, cache some more for next time + if(ret) + { + _cachedFailedTrials = CalculateGeometricRandom(pRandom); + } + return ret; +} +``` + + +We are getting closer to mapping our pseudo-code implementation to the real CLR code. The current CLR implementation for memory allocation has the GC sub-allocate blocks of memory 8KB in size which the runtime is allowed to sub-allocate from. The GC gives out an `alloc_context` to each thread which has a `alloc_ptr` and 'alloc_limit' fields. These fields define the memory range [alloc_ptr, alloc_limit) which can be used to sub-allocate objects. The runtime has optimized assembly code helper functions to increment `alloc_ptr` directly for objects that are small enough to fit in the current range and don't require any special handling. For all other objects the runtime invokes a slower allocation path that ultimately calls the GC's Alloc() function. If the alloc_context is exhausted, calling GC Alloc() also allocates a new 8KB block for future fast object allocations to use. In order to allocate objects we could naively do this: + +``` +void* FastAssemblyAllocate(int objectSize) +{ + Thread* pThread = GetThread(); + CLRRandom* pRandom = pThread->GetRandom(); + alloc_context* pAllocContext = pThread->GetAllocContext(); + void* alloc_end = pAllocContext->alloc_ptr + objectSize; + if(IsSamplingEnabled() && DoesAnyTrialSucceed(pRandom, objectSize)) + PublishSamplingEvent(); + if(alloc_limit < alloc_end) + return SlowAlloc(objectSize); + else + void* objectAddr = pAllocContext->alloc_ptr; + pAllocContext->alloc_ptr = alloc_end; + *objectAddr = methodTable + return objectAddr; +} +``` + +Although orders of magnitude faster than where we started, this is still too slow. We don't want to put extra conditional checks for IsSamplingEnabled() and DoesAnyTrialSucceed() in the fast path of every allocation. Instead we want to combine the two if conditions down to a single compare and jump, then handle publishing a sample event as part of the slow allocation path. Note that the value of the expression `alloc_ptr + _cachedFailedTrials` doesn't change by repeated calls to the FastAssemblyAllocate() as long as we don't go down the SlowAlloc path or the PublishSamplingEvent() path. Each invocation increments `alloc_ptr` by `objectSize` and decrements `_cachedFailedTrials` by the same amount leaving the sum unchanged. Lets define that sum `alloc_ptr + _cachedFailedTrials = sampling_limit`. You can imagine that if we started allocating objects contiguously from `alloc_ptr`, `sampling_limit` represents the point in the memory range where whatever object overlaps it contains the successful trial and emits the sampling event. A little more rigorously `DoesAnyTrialSucceed()` returns true when `_cachedFailedTrials < objectSize`. Adding `alloc_ptr` to each side shows this is the same as the condition `sampling_limit < alloc_end`: + +``` +_cachedFailedTrials < objectSize +_cachedFailedTrials + alloc_ptr < objectSize + alloc_ptr +sampling_limit < alloc_end +``` + +Last to combine the two if conditionals we can define a new field `combined_limit = min(sampling_limit, alloc_limit)`. If sampling events aren't enabled then we define `combined_limit = alloc_limit`. This means that a single check `if(alloc_end < combined_limit)` detects when either the object exceeds `alloc_limit` or it exceeds `sampling_limit`. The runtime actually has a bunch of different fast paths depending on the type of the object being allocated and the CPU architecture, but converted to pseudo-code they all look approximately like this: + +``` +void* FastAssemblyAllocate(int objectSize) +{ + Thread* pThread = GetThread(); + alloc_context* pAllocContext = pThread->GetAllocContext(); + void* alloc_end = pAllocContext->alloc_ptr + objectSize; + if(combined_limit < alloc_end) + return SlowAlloc(objectSize); + else + void* objectAddr = pAllocContext->alloc_ptr; + pAllocContext->alloc_ptr = alloc_end; + *objectAddr = methodTable + return objectAddr; +} +``` + +The only change we've made in the assembly helpers is doing a comparison against combined_limit instead of alloc_limit which should have no performance impact. Look at [JIT_TrialAllocSFastMP_InlineGetThread](https://github.com/dotnet/runtime/blob/5c8bb402e6a8274e8135dd00eda2248b4f57102f/src/coreclr/vm/amd64/JitHelpers_InlineGetThread.asm#L38) for an example of what one of these helpers looks like in assembly code. + +The pseudo-code and concepts we've been describing here are now close to matching the runtime code but there are still some important differences to call out to map it more exactly: + +1. In the real runtime code the assembly helpers call a variety of different C++ helpers depending on object type and all of those helpers in turn call into [Alloc()](https://github.com/dotnet/runtime/blob/5c8bb402e6a8274e8135dd00eda2248b4f57102f/src/coreclr/vm/gchelpers.cpp#L201). Here we've omitted the different per-type intermediate functions and represented all of them as the SlowAlloc() function in the pseudo-code. + +2. The combined_limit field is a member of ee_alloc_context rather than alloc_context. This was done to avoid creating a breaking change in the EE<->GC interface. The ee_alloc_context contains an alloc_context within it as well as any additional fields we want to add that are only visible to the EE. + +3. In order to reduce the number of per-thread fields being managed the real implementation doesn't have an explicit `sampling_limit`. Instead this only exists as the transient calculation of `alloc_ptr + CalculateGeometricRandom()` that is used when computing an updated value for `combined_limit`. Whenever `combined_limit < alloc_limit` then it is implied that `sampling_limit = combined_limit` and `_cachedFailedTrials = combined_limit - alloc_ptr`. However if `combined_limit == alloc_limit` that represents one of two possible states: +- Sampling is disabled +- Sampling is enabled and we have a batch of cached failed trials with size `alloc_limit - alloc_ptr`. In the examples above our batches were N failures followed by a success but this is just N failures without any success at the end. This means no objects allocated in the current AC are going to be sampled and whenever we allocate the N+1st byte we'll need to generate a new batch of trial results to determine whether that byte was sampled. +If it turns out to be easier to track `sampling_limit` with an explicit field when sampling is enabled we could do that, it just requires an extra pointer per-thread. As memory overhead its not much, but it will probably land in the L1 cache and wind up evicting some other field on the Thread object that now no longer fits in the cache line. The current implementation tries to minimize this cache impact. We never did any perf testing on alternative implementations that do track sampling_limit explicitly so its possible the difference isn't that meaningful. + +4. When we generate batches of trial results in the examples above we always used all the results before generating a new batch, however the real implementation sometimes discards part of a batch. Implicitly this happens when we calculate a value for `sampling_limit=alloc_ptr+CalculateGeometricRandom()`, determine that `alloc_limit` is smaller than `sampling_limit`, and then set `combined_limit=alloc_limit`. Discarding also happens any time we recompute the `sampling_limit` based on a new random value without having fully allocated bytes up to `combined_limit`. It may seem suspicious that we can do this and still generate the correct distribution of samples but it is OK if done properly. Bernoulli trials are independent from one another so it is legal to discard trials from our cache as long as the decision to discard a given trial result is independent of what that trial result is. For example in the very first pseudo-code sample with the List, it would be legal to generate 100 boolean trials and then arbitrarily truncate the list to size 50. The first 50 values in the list are still valid bernoulli trials with the original p=1/102,400 of being true, as will be all the future ones from the batches that are populated later. However if we scanned the list and conditionally discarded any trials that we observed had a success result that would be problematic. This type of selective removal changes the probability distribution for the items that remain. + +5. The GC Alloc() function isn't the only time that the GC updates alloc_ptr and alloc_limit. They also get updated during a GC in the callback inside of GCToEEInterface::GcEnumAllocContexts(). This is another place where combined_limit needs to be updated to ensure it stays synchronized with alloc_ptr and alloc_limit. + + +## Thanks + +Thanks to Christophe Nasarre (@chrisnas) at DataDog for implementing this feature and Mikelle Rogers for doing the investigation of the Camp-Paulson approximation. \ No newline at end of file diff --git a/src/coreclr/inc/eventtracebase.h b/src/coreclr/inc/eventtracebase.h index 316104f649a1d8..dbcb2e08302d50 100644 --- a/src/coreclr/inc/eventtracebase.h +++ b/src/coreclr/inc/eventtracebase.h @@ -1333,17 +1333,19 @@ namespace ETW #define ETWLoaderStaticLoad 0 // Static reference load #define ETWLoaderDynamicLoad 1 // Dynamic assembly load +#if defined (FEATURE_EVENT_TRACE) +EXTERN_C DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_DOTNET_Context; +EXTERN_C DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_PRIVATE_PROVIDER_DOTNET_Context; +EXTERN_C DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_RUNDOWN_PROVIDER_DOTNET_Context; +EXTERN_C DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_STRESS_PROVIDER_DOTNET_Context; +#endif // FEATURE_EVENT_TRACE + #if defined(FEATURE_EVENT_TRACE) && !defined(HOST_UNIX) // // The ONE and only ONE global instantiation of this class // extern ETW::CEtwTracer * g_pEtwTracer; -EXTERN_C DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_DOTNET_Context; -EXTERN_C DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_PRIVATE_PROVIDER_DOTNET_Context; -EXTERN_C DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_RUNDOWN_PROVIDER_DOTNET_Context; -EXTERN_C DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_STRESS_PROVIDER_DOTNET_Context; - // // Special Handling of Startup events // diff --git a/src/coreclr/minipal/Windows/CMakeLists.txt b/src/coreclr/minipal/Windows/CMakeLists.txt index b1f1cb88a3e6ac..4af6841c11ea4c 100644 --- a/src/coreclr/minipal/Windows/CMakeLists.txt +++ b/src/coreclr/minipal/Windows/CMakeLists.txt @@ -3,6 +3,7 @@ set(SOURCES dn-u16.cpp ${CLR_SRC_NATIVE_DIR}/minipal/utf8.c ${CLR_SRC_NATIVE_DIR}/minipal/time.c + ${CLR_SRC_NATIVE_DIR}/minipal/xoshiro128pp.c ) if(NOT CLR_CROSS_COMPONENTS_BUILD) diff --git a/src/coreclr/nativeaot/Runtime/GCHelpers.cpp b/src/coreclr/nativeaot/Runtime/GCHelpers.cpp index 41b8aa8463c2ea..bc0c5698a08266 100644 --- a/src/coreclr/nativeaot/Runtime/GCHelpers.cpp +++ b/src/coreclr/nativeaot/Runtime/GCHelpers.cpp @@ -29,6 +29,12 @@ #include "gcdesc.h" +#ifdef FEATURE_EVENT_TRACE + #include "clretwallmain.h" +#else // FEATURE_EVENT_TRACE + #include "etmdummy.h" +#endif // FEATURE_EVENT_TRACE + #define RH_LARGE_OBJECT_SIZE 85000 MethodTable g_FreeObjectEEType; @@ -471,6 +477,29 @@ EXTERN_C int64_t QCALLTYPE RhGetTotalAllocatedBytesPrecise() return allocated; } +void FireAllocationSampled(GC_ALLOC_FLAGS flags, size_t size, size_t samplingBudgetOffset, Object* orObject) +{ +#ifdef FEATURE_EVENT_TRACE + void* typeId = GetLastAllocEEType(); + // Note: Just as for AllocationTick, the type name cannot be retrieved + WCHAR* name = nullptr; + + if (typeId != nullptr) + { + unsigned int allocKind = + (flags & GC_ALLOC_PINNED_OBJECT_HEAP) ? 2 : + (flags & GC_ALLOC_LARGE_OBJECT_HEAP) ? 1 : + 0; // SOH + unsigned int heapIndex = 0; +#ifdef BACKGROUND_GC + gc_heap* hp = gc_heap::heap_of((BYTE*)orObject); + heapIndex = hp->heap_number; +#endif + FireEtwAllocationSampled(allocKind, GetClrInstanceId(), typeId, name, heapIndex, (BYTE*)orObject, size, samplingBudgetOffset); + } +#endif +} + static Object* GcAllocInternal(MethodTable* pEEType, uint32_t uFlags, uintptr_t numElements, Thread* pThread) { ASSERT(!pThread->IsDoNotTriggerGcSet()); @@ -539,8 +568,47 @@ static Object* GcAllocInternal(MethodTable* pEEType, uint32_t uFlags, uintptr_t // Save the MethodTable for instrumentation purposes. tls_pLastAllocationEEType = pEEType; - Object* pObject = GCHeapUtilities::GetGCHeap()->Alloc(pThread->GetAllocContext(), cbSize, uFlags); - pThread->GetEEAllocContext()->UpdateCombinedLimit(); + // check for dynamic allocation sampling + ee_alloc_context* pEEAllocContext = pThread->GetEEAllocContext(); + gc_alloc_context* pAllocContext = pEEAllocContext->GetGCAllocContext(); + bool isSampled = false; + size_t availableSpace = 0; + size_t samplingBudget = 0; + + bool isRandomizedSamplingEnabled = ee_alloc_context::IsRandomizedSamplingEnabled(); + if (isRandomizedSamplingEnabled) + { + // The number bytes we can allocate before we need to emit a sampling event. + // This calculation is only valid if combined_limit < alloc_limit. + samplingBudget = (size_t)(pEEAllocContext->combined_limit - pAllocContext->alloc_ptr); + + // The number of bytes available in the current allocation context + availableSpace = (size_t)(pAllocContext->alloc_limit - pAllocContext->alloc_ptr); + + // Check to see if the allocated object overlaps a sampled byte + // in this AC. This happens when both: + // 1) The AC contains a sampled byte (combined_limit < alloc_limit) + // 2) The object is large enough to overlap it (samplingBudget < aligned_size) + // + // Note that the AC could have no remaining space for allocations (alloc_ptr = + // alloc_limit = combined_limit). When a thread hasn't done any SOH allocations + // yet it also starts in an empty state where alloc_ptr = alloc_limit = + // combined_limit = nullptr. The (1) check handles both of these situations + // properly as an empty AC can not have a sampled byte inside of it. + isSampled = + (pEEAllocContext->combined_limit < pAllocContext->alloc_limit) && + (samplingBudget < cbSize); + + // if the object overflows the AC, we need to sample the remaining bytes + // the sampling budget only included at most the bytes inside the AC + if (cbSize > availableSpace && !isSampled) + { + samplingBudget = ee_alloc_context::ComputeGeometricRandom() + availableSpace; + isSampled = (samplingBudget < cbSize); + } + } + + Object* pObject = GCHeapUtilities::GetGCHeap()->Alloc(pAllocContext, cbSize, uFlags); if (pObject == NULL) return NULL; @@ -551,6 +619,19 @@ static Object* GcAllocInternal(MethodTable* pEEType, uint32_t uFlags, uintptr_t ((Array*)pObject)->InitArrayLength((uint32_t)numElements); } + if (isSampled) + { + FireAllocationSampled((GC_ALLOC_FLAGS)uFlags, cbSize, samplingBudget, pObject); + } + + // There are a variety of conditions that may have invalidated the previous combined_limit value + // such as not allocating the object in the AC memory region (UOH allocations), moving the AC, adding + // extra alignment padding, allocating a new AC, or allocating an object that consumed the sampling budget. + // Rather than test for all the different invalidation conditions individually we conservatively always + // recompute it. If sampling isn't enabled this inlined function is just trivially setting + // combined_limit=alloc_limit. + pEEAllocContext->UpdateCombinedLimit(isRandomizedSamplingEnabled); + if (uFlags & GC_ALLOC_USER_OLD_HEAP) GCHeapUtilities::GetGCHeap()->PublishObject((uint8_t*)pObject); diff --git a/src/coreclr/nativeaot/Runtime/disabledeventtrace.cpp b/src/coreclr/nativeaot/Runtime/disabledeventtrace.cpp index f0944fdf295179..7ed5c94e4eb963 100644 --- a/src/coreclr/nativeaot/Runtime/disabledeventtrace.cpp +++ b/src/coreclr/nativeaot/Runtime/disabledeventtrace.cpp @@ -12,6 +12,11 @@ void EventTracing_Initialize() { } +bool IsRuntimeProviderEnabled(uint8_t level, uint64_t keyword) +{ + return false; +} + void ETW::GCLog::FireGcStart(ETW_GC_INFO * pGcInfo) { } #ifdef FEATURE_ETW diff --git a/src/coreclr/nativeaot/Runtime/eventpipe/gen-eventing-event-inc.lst b/src/coreclr/nativeaot/Runtime/eventpipe/gen-eventing-event-inc.lst index 901af659ff84b6..77c9d8cb15a3da 100644 --- a/src/coreclr/nativeaot/Runtime/eventpipe/gen-eventing-event-inc.lst +++ b/src/coreclr/nativeaot/Runtime/eventpipe/gen-eventing-event-inc.lst @@ -1,5 +1,6 @@ # Native runtime events supported by aot runtime. +AllocationSampled BGC1stConEnd BGC1stNonConEnd BGC1stSweepEnd diff --git a/src/coreclr/nativeaot/Runtime/eventtrace.cpp b/src/coreclr/nativeaot/Runtime/eventtrace.cpp index 8b3d134f5c4f24..4e3c80ed1f75d4 100644 --- a/src/coreclr/nativeaot/Runtime/eventtrace.cpp +++ b/src/coreclr/nativeaot/Runtime/eventtrace.cpp @@ -37,6 +37,11 @@ DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_PRIVATE_PROVIDER_DOTNET_Con MICROSOFT_WINDOWS_DOTNETRUNTIME_PRIVATE_PROVIDER_EVENTPIPE_Context }; +bool IsRuntimeProviderEnabled(uint8_t level, uint64_t keyword) +{ + return RUNTIME_PROVIDER_CATEGORY_ENABLED(level, keyword); +} + volatile LONGLONG ETW::GCLog::s_l64LastClientSequenceNumber = 0; //--------------------------------------------------------------------------------------- @@ -245,4 +250,4 @@ void EventPipeEtwCallbackDotNETRuntimePrivate( _Inout_opt_ PVOID CallbackContext) { EtwCallbackCommon(DotNETRuntimePrivate, ControlCode, Level, MatchAnyKeyword, FilterData, true); -} \ No newline at end of file +} diff --git a/src/coreclr/nativeaot/Runtime/eventtracebase.h b/src/coreclr/nativeaot/Runtime/eventtracebase.h index 241c795c0d02fc..0e778cafb76571 100644 --- a/src/coreclr/nativeaot/Runtime/eventtracebase.h +++ b/src/coreclr/nativeaot/Runtime/eventtracebase.h @@ -30,6 +30,8 @@ void InitializeEventTracing(); #ifdef FEATURE_EVENT_TRACE +bool IsRuntimeProviderEnabled(uint8_t level, uint64_t keyword); + // !!!!!!! NOTE !!!!!!!! // The flags must match those in the ETW manifest exactly // !!!!!!! NOTE !!!!!!!! @@ -102,6 +104,7 @@ struct ProfilingScanContext; #define CLR_GCHEAPSURVIVALANDMOVEMENT_KEYWORD 0x400000 #define CLR_MANAGEDHEAPCOLLECT_KEYWORD 0x800000 #define CLR_GCHEAPANDTYPENAMES_KEYWORD 0x1000000 +#define CLR_ALLOCATIONSAMPLING_KEYWORD 0x80000000000 // // Using KEYWORDZERO means when checking the events category ignore the keyword diff --git a/src/coreclr/nativeaot/Runtime/gctoclreventsink.cpp b/src/coreclr/nativeaot/Runtime/gctoclreventsink.cpp index b8ffaad1ffe88c..3709a414fdf0e2 100644 --- a/src/coreclr/nativeaot/Runtime/gctoclreventsink.cpp +++ b/src/coreclr/nativeaot/Runtime/gctoclreventsink.cpp @@ -4,6 +4,7 @@ #include "common.h" #include "gctoclreventsink.h" #include "thread.h" +#include "eventtracebase.h" GCToCLREventSink g_gcToClrEventSink; @@ -174,6 +175,14 @@ void GCToCLREventSink::FireGCAllocationTick_V4(uint64_t allocationAmount, { LIMITED_METHOD_CONTRACT; +#ifdef FEATURE_EVENT_TRACE + if (IsRuntimeProviderEnabled(TRACE_LEVEL_INFORMATION, CLR_ALLOCATIONSAMPLING_KEYWORD)) + { + // skip AllocationTick if AllocationSampled is emitted + return; + } +#endif // FEATURE_EVENT_TRACE + void * typeId = GetLastAllocEEType(); WCHAR * name = nullptr; diff --git a/src/coreclr/nativeaot/Runtime/thread.cpp b/src/coreclr/nativeaot/Runtime/thread.cpp index b796b052182260..5437edd62b32ed 100644 --- a/src/coreclr/nativeaot/Runtime/thread.cpp +++ b/src/coreclr/nativeaot/Runtime/thread.cpp @@ -35,6 +35,13 @@ static Thread* g_RuntimeInitializingThread; #endif //!DACCESS_COMPILE +ee_alloc_context::PerThreadRandom::PerThreadRandom() +{ + minipal_xoshiro128pp_init(&random_state, (uint32_t)PalGetTickCount64()); +} + +thread_local ee_alloc_context::PerThreadRandom ee_alloc_context::t_random = PerThreadRandom(); + PInvokeTransitionFrame* Thread::GetTransitionFrame() { if (ThreadStore::GetSuspendingThread() == this) diff --git a/src/coreclr/nativeaot/Runtime/thread.h b/src/coreclr/nativeaot/Runtime/thread.h index 70f776de2ee9a1..f289332500f093 100644 --- a/src/coreclr/nativeaot/Runtime/thread.h +++ b/src/coreclr/nativeaot/Runtime/thread.h @@ -6,6 +6,7 @@ #include "StackFrameIterator.h" #include "slist.h" // DefaultSListTraits +#include struct gc_alloc_context; class RuntimeInstance; @@ -113,7 +114,19 @@ struct ee_alloc_context gc_alloc_context* GetGCAllocContext(); uint8_t* GetCombinedLimit(); - void UpdateCombinedLimit(); + void UpdateCombinedLimit(bool samplingEnabled); + static bool IsRandomizedSamplingEnabled(); + static int ComputeGeometricRandom(); + + struct PerThreadRandom + { + minipal_xoshiro128pp random_state; + + PerThreadRandom(); + float NextFloat(); + }; + + static thread_local PerThreadRandom t_random; }; diff --git a/src/coreclr/nativeaot/Runtime/thread.inl b/src/coreclr/nativeaot/Runtime/thread.inl index 5c17da3e61f3f3..5e113809c5642e 100644 --- a/src/coreclr/nativeaot/Runtime/thread.inl +++ b/src/coreclr/nativeaot/Runtime/thread.inl @@ -3,7 +3,9 @@ #ifndef DACCESS_COMPILE +#include "eventtracebase.h" +const uint32_t SamplingDistributionMean = (100 * 1024); inline gc_alloc_context* ee_alloc_context::GetGCAllocContext() { @@ -22,11 +24,43 @@ struct _thread_inl_gc_alloc_context uint8_t* alloc_limit; }; -inline void ee_alloc_context::UpdateCombinedLimit() + +inline bool ee_alloc_context::IsRandomizedSamplingEnabled() +{ +#ifdef FEATURE_EVENT_TRACE + return IsRuntimeProviderEnabled(TRACE_LEVEL_INFORMATION, CLR_ALLOCATIONSAMPLING_KEYWORD); +#else + return false; +#endif // FEATURE_EVENT_TRACE +} + +inline void ee_alloc_context::UpdateCombinedLimit(bool samplingEnabled) +{ + if (!samplingEnabled) + { + combined_limit = ((_thread_inl_gc_alloc_context*)GetGCAllocContext())->alloc_limit; + } + else + { + // compute the next sampling limit based on a geometric distribution + uint8_t* sampling_limit = ((_thread_inl_gc_alloc_context*)GetGCAllocContext())->alloc_ptr + ComputeGeometricRandom(); + + // if the sampling limit is larger than the allocation context, no sampling will occur in this AC + combined_limit = min(sampling_limit, ((_thread_inl_gc_alloc_context*)GetGCAllocContext())->alloc_limit); + } +} + +inline int ee_alloc_context::ComputeGeometricRandom() +{ + // compute a random sample from the Geometric distribution + float probability = t_random.NextFloat(); + int threshold = (int)(-log(1 - probability) * SamplingDistributionMean); + return threshold; +} + +inline float ee_alloc_context::PerThreadRandom::NextFloat() { - // The randomized allocation sampling feature is being submitted in stages. For now sampling is never enabled so - // combined_limit is always the same as alloc_limit. - combined_limit = ((_thread_inl_gc_alloc_context*)GetGCAllocContext())->alloc_limit; + return (float)minipal_xoshiro128pp_next(&random_state) * (1.0F/(float)UINT32_MAX); } // Set the m_pDeferredTransitionFrame field for GC allocation helpers that setup transition frame diff --git a/src/coreclr/vm/ClrEtwAll.man b/src/coreclr/vm/ClrEtwAll.man index 265d7a07726cf6..1760a55766d54c 100644 --- a/src/coreclr/vm/ClrEtwAll.man +++ b/src/coreclr/vm/ClrEtwAll.man @@ -91,6 +91,8 @@ message="$(string.RuntimePublisher.ProfilerKeywordMessage)" symbol="CLR_PROFILER_KEYWORD" /> + @@ -461,7 +463,13 @@ - + + + + + @@ -3164,6 +3172,30 @@ + + @@ -4256,7 +4288,11 @@ keywords="WaitHandleKeyword" opcode="win:Stop" task="WaitHandleWait" symbol="WaitHandleWaitStop" message="$(string.RuntimePublisher.WaitHandleWaitStopEventMessage)"/> - + + @@ -8616,7 +8652,8 @@ - + + @@ -8791,7 +8828,8 @@ - + + @@ -9155,7 +9193,8 @@ - + + diff --git a/src/coreclr/vm/gcheaputilities.cpp b/src/coreclr/vm/gcheaputilities.cpp index 32ea33a91cc3ce..cffe9185d5e61b 100644 --- a/src/coreclr/vm/gcheaputilities.cpp +++ b/src/coreclr/vm/gcheaputilities.cpp @@ -43,6 +43,8 @@ bool g_sw_ww_enabled_for_gc_heap = false; GVAL_IMPL_INIT(ee_alloc_context, g_global_alloc_context, {}); +thread_local ee_alloc_context::PerThreadRandom ee_alloc_context::t_random = PerThreadRandom(); + enum GC_LOAD_STATUS { GC_LOAD_STATUS_BEFORE_START, GC_LOAD_STATUS_START, diff --git a/src/coreclr/vm/gcheaputilities.h b/src/coreclr/vm/gcheaputilities.h index bbefdc9cd7bc32..cde662653c28be 100644 --- a/src/coreclr/vm/gcheaputilities.h +++ b/src/coreclr/vm/gcheaputilities.h @@ -4,7 +4,10 @@ #ifndef _GCHEAPUTILITIES_H_ #define _GCHEAPUTILITIES_H_ +#include "eventtracebase.h" #include "gcinterface.h" +#include "math.h" +#include // The singular heap instance. GPTR_DECL(IGCHeap, g_pGCHeap); @@ -13,6 +16,8 @@ GPTR_DECL(IGCHeap, g_pGCHeap); extern "C" { #endif // !DACCESS_COMPILE +const DWORD SamplingDistributionMean = (100 * 1024); + // This struct allows adding some state that is only visible to the EE onto the standard gc_alloc_context struct ee_alloc_context { @@ -66,13 +71,57 @@ struct ee_alloc_context return offsetof(ee_alloc_context, combined_limit); } - // Regenerate the randomized sampling limit and update the combined_limit field. - inline void UpdateCombinedLimit() + static inline bool IsRandomizedSamplingEnabled() + { +#ifdef FEATURE_EVENT_TRACE + return ETW_TRACING_CATEGORY_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_DOTNET_Context, + TRACE_LEVEL_INFORMATION, + CLR_ALLOCATIONSAMPLING_KEYWORD); +#else + return false; +#endif // FEATURE_EVENT_TRACE + } + + inline void UpdateCombinedLimit(bool samplingEnabled) { - // The randomized sampling feature is being submitted in stages. At this point the sampling is never - // activated so combined_limit is always equal to alloc_limit. - combined_limit = gc_allocation_context.alloc_limit; + if (!samplingEnabled) + { + combined_limit = gc_allocation_context.alloc_limit; + } + else + { + // compute the next sampling limit based on a geometric distribution + uint8_t* sampling_limit = gc_allocation_context.alloc_ptr + ComputeGeometricRandom(); + + // if the sampling limit is larger than the allocation context, no sampling will occur in this AC + combined_limit = Min(sampling_limit, gc_allocation_context.alloc_limit); + } } + + static inline int ComputeGeometricRandom() + { + // compute a random sample from the Geometric distribution + float probability = t_random.NextFloat(); + int threshold = (int)(-log(1 - probability) * SamplingDistributionMean); + return threshold; + } + + struct PerThreadRandom + { + minipal_xoshiro128pp random_state; + + PerThreadRandom() + { + minipal_xoshiro128pp_init(&random_state, GetRandomInt(INT_MAX)); + } + + float NextFloat() + { + return (float)minipal_xoshiro128pp_next(&random_state) * (1.0F/(float)UINT32_MAX); + } + }; + + static thread_local PerThreadRandom t_random; }; GPTR_DECL(uint8_t,g_lowest_address); diff --git a/src/coreclr/vm/gchelpers.cpp b/src/coreclr/vm/gchelpers.cpp index cab5986ecd66a3..6e482b77e99f3e 100644 --- a/src/coreclr/vm/gchelpers.cpp +++ b/src/coreclr/vm/gchelpers.cpp @@ -183,6 +183,118 @@ inline void CheckObjectSize(size_t alloc_size) } } +void FireAllocationSampled(GC_ALLOC_FLAGS flags, size_t size, size_t samplingBudgetOffset, Object* orObject) +{ + // Note: this code is duplicated from GCToCLREventSink::FireGCAllocationTick_V4 + void* typeId = nullptr; + const WCHAR* name = nullptr; + InlineSString strTypeName; + EX_TRY + { + TypeHandle th = GetThread()->GetTHAllocContextObj(); + + if (th != 0) + { + th.GetName(strTypeName); + name = strTypeName.GetUnicode(); + typeId = th.GetMethodTable(); + } + } + EX_CATCH{} + EX_END_CATCH(SwallowAllExceptions) + // end of duplication + + if (typeId != nullptr) + { + unsigned int allocKind = + (flags & GC_ALLOC_PINNED_OBJECT_HEAP) ? 2 : + (flags & GC_ALLOC_LARGE_OBJECT_HEAP) ? 1 : + 0; // SOH + unsigned int heapIndex = 0; +#ifdef BACKGROUND_GC + gc_heap* hp = gc_heap::heap_of((BYTE*)orObject); + heapIndex = hp->heap_number; +#endif + FireEtwAllocationSampled(allocKind, GetClrInstanceId(), typeId, name, heapIndex, (BYTE*)orObject, size, samplingBudgetOffset); + } +} + +inline Object* Alloc(ee_alloc_context* pEEAllocContext, size_t size, GC_ALLOC_FLAGS flags) +{ + CONTRACTL { + THROWS; + GC_TRIGGERS; + MODE_COOPERATIVE; // returns an objref without pinning it => cooperative + } CONTRACTL_END; + + Object* retVal = nullptr; + gc_alloc_context* pAllocContext = &pEEAllocContext->gc_allocation_context; + bool isSampled = false; + size_t availableSpace = 0; + size_t aligned_size = 0; + size_t samplingBudget = 0; + bool isRandomizedSamplingEnabled = ee_alloc_context::IsRandomizedSamplingEnabled(); + if (isRandomizedSamplingEnabled) + { + // object allocations are always padded up to pointer size + aligned_size = AlignUp(size, sizeof(uintptr_t)); + + // The number bytes we can allocate before we need to emit a sampling event. + // This calculation is only valid if combined_limit < alloc_limit. + samplingBudget = (size_t)(pEEAllocContext->combined_limit - pAllocContext->alloc_ptr); + + // The number of bytes available in the current allocation context + availableSpace = (size_t)(pAllocContext->alloc_limit - pAllocContext->alloc_ptr); + + // Check to see if the allocated object overlaps a sampled byte + // in this AC. This happens when both: + // 1) The AC contains a sampled byte (combined_limit < alloc_limit) + // 2) The object is large enough to overlap it (samplingBudget < aligned_size) + // + // Note that the AC could have no remaining space for allocations (alloc_ptr = + // alloc_limit = combined_limit). When a thread hasn't done any SOH allocations + // yet it also starts in an empty state where alloc_ptr = alloc_limit = + // combined_limit = nullptr. The (1) check handles both of these situations + // properly as an empty AC can not have a sampled byte inside of it. + isSampled = + (pEEAllocContext->combined_limit < pAllocContext->alloc_limit) && + (samplingBudget < aligned_size); + + // if the object overflows the AC, we need to sample the remaining bytes + // the sampling budget only included at most the bytes inside the AC + if (aligned_size > availableSpace && !isSampled) + { + samplingBudget = ee_alloc_context::ComputeGeometricRandom() + availableSpace; + isSampled = (samplingBudget < aligned_size); + } + } + + GCStress::MaybeTrigger(pAllocContext); + + // for SOH, if there is enough space in the current allocation context, then + // the allocation will be done in place (like in the fast path), + // otherwise a new allocation context will be provided + retVal = GCHeapUtilities::GetGCHeap()->Alloc(pAllocContext, size, flags); + + if (isSampled) + { + // At this point the object methodtable isn't initialized yet but it doesn't matter when we are + // just emitting an ETW/EventPipe event. If we want this event to be more useful from ICorProfiler + // in the future we probably want to pass the isSampled flag back to callers so that the event + // can be raised after the MethodTable is initialized. + FireAllocationSampled(flags, aligned_size, samplingBudget, retVal); + } + + // There are a variety of conditions that may have invalidated the previous combined_limit value + // such as not allocating the object in the AC memory region (UOH allocations), moving the AC, adding + // extra alignment padding, allocating a new AC, or allocating an object that consumed the sampling budget. + // Rather than test for all the different invalidation conditions individually we conservatively always + // recompute it. If sampling isn't enabled this inlined function is just trivially setting + // combined_limit=alloc_limit. + pEEAllocContext->UpdateCombinedLimit(isRandomizedSamplingEnabled); + + return retVal; +} // There are only two ways to allocate an object. // * Call optimized helpers that were generated on the fly. This is how JIT compiled code does most @@ -222,19 +334,13 @@ inline Object* Alloc(size_t size, GC_ALLOC_FLAGS flags) if (GCHeapUtilities::UseThreadAllocationContexts()) { - ee_alloc_context *threadContext = GetThreadEEAllocContext(); - GCStress::MaybeTrigger(&threadContext->gc_allocation_context); - retVal = GCHeapUtilities::GetGCHeap()->Alloc(&threadContext->gc_allocation_context, size, flags); - threadContext->UpdateCombinedLimit(); + retVal = Alloc(GetThreadEEAllocContext(), size, flags); } else { GlobalAllocLockHolder holder(&g_global_alloc_lock); - ee_alloc_context *globalContext = &g_global_alloc_context; - GCStress::MaybeTrigger(&globalContext->gc_allocation_context); - retVal = GCHeapUtilities::GetGCHeap()->Alloc(&globalContext->gc_allocation_context, size, flags); - globalContext->UpdateCombinedLimit(); + retVal = Alloc(&g_global_alloc_context, size, flags); } diff --git a/src/coreclr/vm/gctoclreventsink.cpp b/src/coreclr/vm/gctoclreventsink.cpp index fff929d51567a5..ce75e4cc661830 100644 --- a/src/coreclr/vm/gctoclreventsink.cpp +++ b/src/coreclr/vm/gctoclreventsink.cpp @@ -162,6 +162,16 @@ void GCToCLREventSink::FireGCAllocationTick_V4(uint64_t allocationAmount, { LIMITED_METHOD_CONTRACT; +#ifdef FEATURE_EVENT_TRACE + if (ETW_TRACING_CATEGORY_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_DOTNET_Context, + TRACE_LEVEL_INFORMATION, + CLR_ALLOCATIONSAMPLING_KEYWORD)) + { + // skip AllocationTick if AllocationSampled is emitted + return; + } +#endif // FEATURE_EVENT_TRACE + void * typeId = nullptr; const WCHAR * name = nullptr; InlineSString strTypeName; diff --git a/src/native/minipal/xoshiro128pp.c b/src/native/minipal/xoshiro128pp.c new file mode 100644 index 00000000000000..01e8e80696b3bd --- /dev/null +++ b/src/native/minipal/xoshiro128pp.c @@ -0,0 +1,81 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the MIT license. + +#include + +// This code is a slightly modified version of the xoshiro128++ generator from http://prng.di.unimi.it/xoshiro128plusplus.c + +/* Written in 2019 by David Blackman and Sebastiano Vigna (vigna@acm.org) +To the extent possible under law, the author has dedicated all copyright +and related and neighboring rights to this software to the public domain +worldwide. + +See . +*/ + +static inline uint32_t rotl(const uint32_t x, int k) { + return (x << k) | (x >> (32 - k)); +} + +/* This is the jump function for the generator. It is equivalent + to 2^64 calls to next(); it can be used to generate 2^64 + non-overlapping subsequences for parallel computations. */ + +static void jump(struct minipal_xoshiro128pp* pState) { + static const uint32_t JUMP[] = { 0x8764000b, 0xf542d2d3, 0x6fa035c3, 0x77f2db5b }; + + uint32_t* s = pState->s; + uint32_t s0 = 0; + uint32_t s1 = 0; + uint32_t s2 = 0; + uint32_t s3 = 0; + for (int i = 0; i < sizeof JUMP / sizeof * JUMP; i++) + for (int b = 0; b < 32; b++) { + if (JUMP[i] & UINT32_C(1) << b) { + s0 ^= s[0]; + s1 ^= s[1]; + s2 ^= s[2]; + s3 ^= s[3]; + } + minipal_xoshiro128pp_next(pState); + } + + s[0] = s0; + s[1] = s1; + s[2] = s2; + s[3] = s3; +} + +void minipal_xoshiro128pp_init(struct minipal_xoshiro128pp* pState, uint32_t seed) { + uint32_t* s = pState->s; + if (seed == 0) + { + seed = 997; + } + + s[0] = seed; + s[1] = seed; + s[2] = seed; + s[3] = seed; + jump(pState); +} + +uint32_t minipal_xoshiro128pp_next(struct minipal_xoshiro128pp* pState) { + uint32_t* s = pState->s; + const uint32_t result = rotl(s[0] + s[3], 7) + s[0]; + + const uint32_t t = s[1] << 9; + + s[2] ^= s[0]; + s[3] ^= s[1]; + s[1] ^= s[2]; + s[0] ^= s[3]; + + s[2] ^= t; + + s[3] = rotl(s[3], 11); + + return result; +} + + diff --git a/src/native/minipal/xoshiro128pp.h b/src/native/minipal/xoshiro128pp.h new file mode 100644 index 00000000000000..485705115ca5b8 --- /dev/null +++ b/src/native/minipal/xoshiro128pp.h @@ -0,0 +1,26 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the MIT license. + +#ifndef HAVE_MINIPAL_XOSHIRO128PP_H +#define HAVE_MINIPAL_XOSHIRO128PP_H + +#include + +#ifdef __cplusplus +extern "C" +{ +#endif // __cplusplus + +struct minipal_xoshiro128pp +{ + uint32_t s[4]; +}; + +void minipal_xoshiro128pp_init(struct minipal_xoshiro128pp* pState, uint32_t seed); + +uint32_t minipal_xoshiro128pp_next(struct minipal_xoshiro128pp* pState); + +#ifdef __cplusplus +} +#endif // __cplusplus +#endif /* HAVE_MINIPAL_XOSHIRO128PP_H */