From 91b51bebf970fc1ba60b26c82c64028e1f252680 Mon Sep 17 00:00:00 2001 From: Noah Falk Date: Tue, 16 Jul 2024 04:37:52 -0700 Subject: [PATCH] Add the randomized allocation sampling feature This feature allows profilers to do allocation profiling based off randomized samples. It has better theoretical and empirically observed accuracy than our current allocation profiling approaches while also maintaining low performance overhead. It is designed for use in production profiling scenarios. For more information about usage and implementation, see the included doc docs/design/features/RandomizedAllocationSampling.md --- .../features/RandomizedAllocationSampling.md | 317 ++++++++++++ src/coreclr/inc/eventtracebase.h | 12 +- src/coreclr/minipal/Unix/CMakeLists.txt | 1 + src/coreclr/minipal/Windows/CMakeLists.txt | 1 + src/coreclr/nativeaot/Runtime/CMakeLists.txt | 1 + src/coreclr/nativeaot/Runtime/GCHelpers.cpp | 85 +++- .../nativeaot/Runtime/disabledeventtrace.cpp | 5 + .../eventpipe/gen-eventing-event-inc.lst | 1 + src/coreclr/nativeaot/Runtime/eventtrace.cpp | 7 +- .../nativeaot/Runtime/eventtracebase.h | 3 + .../nativeaot/Runtime/gctoclreventsink.cpp | 9 + src/coreclr/nativeaot/Runtime/thread.cpp | 7 + src/coreclr/nativeaot/Runtime/thread.h | 15 +- src/coreclr/nativeaot/Runtime/thread.inl | 52 +- src/coreclr/vm/ClrEtwAll.man | 49 +- src/coreclr/vm/gcheaputilities.cpp | 2 + src/coreclr/vm/gcheaputilities.h | 67 ++- src/coreclr/vm/gchelpers.cpp | 122 ++++- src/coreclr/vm/gctoclreventsink.cpp | 10 + src/native/minipal/xoshiro128pp.c | 81 +++ src/native/minipal/xoshiro128pp.h | 26 + .../allocationsampling.cs | 174 +++++++ .../allocationsampling.csproj | 25 + .../manual/Allocate/Allocate.csproj | 8 + .../Allocate/AllocateArraysOfDoubles.cs | 21 + .../manual/Allocate/AllocateDifferentTypes.cs | 49 ++ .../Allocate/AllocateRatioSizedArrays.cs | 44 ++ .../manual/Allocate/AllocateSmallAndBig.cs | 180 +++++++ .../Allocate/AllocationsRunEventSource.cs | 34 ++ .../manual/Allocate/IAllocations.cs | 8 + .../manual/Allocate/Program.cs | 133 +++++ .../manual/Allocate/ThreadedAllocations.cs | 176 +++++++ .../manual/AllocationProfiler.sln | 51 ++ .../AllocationProfiler.csproj | 13 + .../manual/AllocationProfiler/Program.cs | 474 ++++++++++++++++++ .../manual/README.md | 110 ++++ 36 files changed, 2342 insertions(+), 31 deletions(-) create mode 100644 docs/design/features/RandomizedAllocationSampling.md create mode 100644 src/native/minipal/xoshiro128pp.c create mode 100644 src/native/minipal/xoshiro128pp.h create mode 100644 src/tests/tracing/eventpipe/randomizedallocationsampling/allocationsampling.cs create mode 100644 src/tests/tracing/eventpipe/randomizedallocationsampling/allocationsampling.csproj create mode 100644 src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/Allocate.csproj create mode 100644 src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocateArraysOfDoubles.cs create mode 100644 src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocateDifferentTypes.cs create mode 100644 src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocateRatioSizedArrays.cs create mode 100644 src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocateSmallAndBig.cs create mode 100644 src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocationsRunEventSource.cs create mode 100644 src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/IAllocations.cs create mode 100644 src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/Program.cs create mode 100644 src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/ThreadedAllocations.cs create mode 100644 src/tests/tracing/eventpipe/randomizedallocationsampling/manual/AllocationProfiler.sln create mode 100644 src/tests/tracing/eventpipe/randomizedallocationsampling/manual/AllocationProfiler/AllocationProfiler.csproj create mode 100644 src/tests/tracing/eventpipe/randomizedallocationsampling/manual/AllocationProfiler/Program.cs create mode 100644 src/tests/tracing/eventpipe/randomizedallocationsampling/manual/README.md diff --git a/docs/design/features/RandomizedAllocationSampling.md b/docs/design/features/RandomizedAllocationSampling.md new file mode 100644 index 00000000000000..61d41c0ffd10c9 --- /dev/null +++ b/docs/design/features/RandomizedAllocationSampling.md @@ -0,0 +1,317 @@ +# Randomized Allocation Sampling + +Christophe Nasarre (@chrisnas), Noah Falk (@noahfalk) - 2024 + +## Introduction + +.NET developers want to understand the GC allocation behavior of their programs both for general observability and specifically to better understand performance costs. Although the runtime has a very high performance GC, reducing the number of bytes allocated in a scenario can have notable impact on the total execution time and frequency of GC pauses. Some ways developers understand these costs are measuring allocated bytes in: +1. Microbenchmarks such as Benchmark.DotNet +2. .NET APIs such as [GC.GetAllocatedBytesForCurrentThread()](https://learn.microsoft.com/dotnet/api/system.gc.getallocatedbytesforcurrentthread) +3. Memory profiling tools such as VS profiler, PerfView, and dotTrace +4. Metrics or other production telemetry + +Analysis of allocation behavior often starts simple using the total bytes allocated while executing a block of code or during some time duration. However for any non-trivial scenario gaining a deeper understanding requires attributing allocations to specific lines of source code, callstacks, types, and object sizes. .NET's current state of the art technique for doing this is using a profiling tool to sample using the [AllocationTick](https://learn.microsoft.com/en-us/dotnet/fundamentals/diagnostics/runtime-garbage-collection-events#gcallocationtick_v3-event) event. When enabled this event triggers approximately every time 100KB has been allocated. However this sampling is not a random sample. It has a fixed starting point and stride which can lead to significant sampling error for allocation patterns that are periodic. This has been observed in practice so it isn't merely a theoretical concern. The new randomized allocation sampling feature is intended to address the shortcomings of AllocationTick and offer more rigorous estimations of allocation behavior and probabilistic error bounds. We do this by creating a new `AllocationSampled` event that profilers can opt into via any of our standard event tracing technologies (ETW, EventPipe, Lttng, EventListener). The new event is completely independent of AllocationTick and we expect profilers will prefer to use the AllocationSampled event on runtime versions where it is available. + +The initial part of this document describe the conceptual sampling model and how we suggest the data be interpretted by profilers. The latter portion describes how the sampling model is implemented in runtime code efficiently. + +## The sampling model + +When the new AllocationSampled event is enabled, each managed thread starts sampling independent of one another. For a given thread there will be a sequence of allocated objects Object_1, Object_2, etc that may continue indefinitely. Each object has a corresponding .NET type and size. The size of an object includes the object header, method table, object fields, and trailing padding that aligns the size to be a multiple of the pointer size. It does not include any additional memory the GC may optionally allocate for more than pointer-sized alignment, filling gaps that are impossible/inefficient to use for objects, or other GC bookkeeping data structures. Also note that .NET does have a non-GC heap where some objects that stay alive for process lifetime are allocated. Those non-GC heap objects are ignored by this feature. + +When each new object is allocated, conceptually the runtime starts doing independent [Bernoulli Trials](https://en.wikipedia.org/wiki/Bernoulli_trial) (weighted coin flips) for every byte in the object. Each trial has probability p = 1/102,400 of being considered a success. As soon as one successful trial is observed no more trials are performed for that object and an AllocationSampled event is emitted. This event contains the object type, its size, and the 0-based offset of the byte where the successful trial occured. This means for a given object if an event was generated `offset` failed trials occured followed by a successful trial, and if no event is generated `size` failed trials occured. This process continues indefinitely for each new allocated object. + +This sampling process is closely related to the [Bernoulli process](https://en.wikipedia.org/wiki/Bernoulli_process) and is a well studied area of statistics. Skipping ahead to the end of an object once a successful sample has been produced does require some accomodations in the analysis, but many key results are still applicable. + +## Using the feature + +### Enabling sample events + +The allocation sampled events are enabled on the `Microsoft-Windows-DotNetRuntime` provider using keyword `0x80000000000` at informational or higher level. For more details on how to do this using different event tracing technologies see [here](https://learn.microsoft.com/en-us/dotnet/core/diagnostics/eventsource-collect-and-view-traces). + +### Interpretting sample events + +Although diagnostic tools are free to interpret the data in whatever way, we have some recommendations for analysis that we expect is useful and statistically sound. + +#### Definitions + +For all of this section assume that we enabled the AllocationSampling events and have observed `s` such sample events were generated from a specific thread - `event_1`, `event_2`, ... `event_s`. Each `event_i` contains corresponding fields `type_i`, `size_i`, and `offset_i`. Let `u_i = size_i - offset_i`. `u_i` represents the successful trial byte + the number bytes which remained after it in the same object. Let `u` = the sum of all the `u_i`, `i` = 1 to `s`. `p` is the constant 1/102400, the probability that each trial is successful. `q` is the complement 1 - 1/102400. + +#### Estimation strategies + +We have explored two different mathematical approaches for [estimating](https://en.wikipedia.org/wiki/Estimator) the number of bytes that were allocated given a set of observed sample events. Both approaches are unbiased which means if we repeated the same sampling procedure many times we expect the average of the estimates to match the number of bytes allocated. Where the approaches differ is on the particular distribution of the estimates. + +#### Estimation Approach 1: Weighted samples + +We believe this approach gives estimates with lower [Mean Squared Error](https://en.wikipedia.org/wiki/Mean_squared_error) but the exact shape of the distribution is hard to calculate so we don't know a good way to produce [confidence intervals](https://en.wikipedia.org/wiki/Confidence_interval) based on small numbers of samples. The distribution does approach a [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution) as the number of samples increase ([Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem)) but we haven't done any analysis attempting to define how rapidly that convergence occurs. + +To estimate the number of bytes using this technique let `estimate_i = size_i/(1 - q^size_i)` for each sample `i`. Then sum `estimate_i` over all samples to get a total estimate of the allocated bytes. With sufficiently many samples the estimate distribution should converge to a normal distribution with variance at most `N*q/p` for `N` total bytes of allocated objects. + +##### Statistics stuff + +Understanding this part isn't necessary to use the estimation formula above but may be helpful. + +Proving the weighted sample estimator is unbiased: +Consider the sequence of all objects allocated on a thread. Let `X_j` be a random indicator variable that has value `size_j/(1 - q^size_j)` if the `j`th object is sampled, otherwise zero. Our estimation formula above is the sum of all `X_j` because only the sampled objects will contribute a non-zero term. Based on our sampling procedure the probability for an object to be sampled is `1-q^size_j` which means `E(X_j) = size_j/(1 - q^size_j) * Pr(object j is sampled) = size_j/(1 - q^size_j) * (1 - q^size_j) = size_j`. By linearity of expectation, the expected value of the sum is the sum of the expected values = sum of `size_j` for all `j` = total size of allocated objects. + +The variance for this estimation is the sum of variances for each `X_j` term, `(size_j^2)*(q^size_j)/(1-q^size_j)`. If we assume there are `N` total bytes of objects divided up into `N/n` objects of size `n` then total variance for that set of objects would be `(N/n)*n^2*q^n/(1-q^n) = N*n*q^n/(1-q^n)`. That expression is maximized when `n=1` so the maximum variance for any collection objects with total size `N` is `N*1*q^1/(1-q^1) = N*q/(1-q) = N*q/p`. + +#### Estimation Approach 2: Estimating failed trials + +This is an alternate estimate that has a more predictable distribution, but potentially higher [Mean Squared Error](https://en.wikipedia.org/wiki/Mean_squared_error). You could use this approach to produce both estimates and confidence intervals, or use the weighted sample formula to produce estimates and use this one solely as a conservative confidence interval for the estimate. + +The estimation formula is `sq/p + u`. + +This estimate is based on the [Negative Bernoulli distribution](https://en.wikipedia.org/wiki/Negative_binomial_distribution) with `s` successes and `p` chance of success. The `sq/p` term is the mean of this distribution and represents the expected number of failed trials necessary to observe `s` successful trials. The `u` term then adds in the number of successful trials (1 per sample) and the number of bytes for which no trials were performed (`u_i-1` per sample). + +Here is an approach to calculate a [confidence interval](https://en.wikipedia.org/wiki/Confidence_interval) estimate based on this distribution: + +1. Decide on some probability `C` that the actual number of allocated bytes `N` should fall within the interval. You can pick a probability arbitrarily close to 1 however the higher the probability is the wider the estimated interval will be. For the remaining `1-C` probability that `N` is not within the interval we will pick the upper and lower bounds so that there is a `(1-C)/2` chance that `N` is below the interval and `(1-C)/2` chance that `N` is above the interval. We think `C=0.95` would be a reasonable choice for many tools which means there would be a 2.5% chance the actual value is below the lower bound, a 95% chance it is between the lower and upper bound, and a 2.5% chance it is above the upper bound. + +2. Implement some method to calculate the Negative Binomial [CDF](https://en.wikipedia.org/wiki/Cumulative_distribution_function). Unfortunately there is no trivial formula for this but there are a couple potential approaches: + a. The Negative Binomial Distribution has a CDF defined based on the [regularized incomplete beta function](https://en.wikipedia.org/wiki/Beta_function#Incomplete_beta_function). There are various numerical libraries such as scipy in Python that will calculate this for you. Alternately you could directly implement numerical approximation techniques to evaluate the function, either approximating the integral form or approximating the continued fraction expansion. + b. The Camp-Paulson approximation described in [Barko(66)](https://www.stat.cmu.edu/technometrics/59-69/VOL-08-02/v0802345.pdf). We validated that for p=0.00001 this approximation was within ~0.01 of the true CDF for any number of failures at s=1, within ~0.001 of the true CDF at s=5, and continues to get more accurate as the sample count increases. + +3. Do binary search on the CDF to locate the input number of failures for which `CDF(failures, s, p)` is closest to `(1-C)/2` and `C + (1-C)/2`. Assuming that `CDF(low_failures, s, p) = (1-C)/2` and `CDF(high_failures, s, p) = C + (1-C)/2` then the confidence interval for `N` is `[low_failures+u, high_failures+u]`. + +For example if we select `C=0.95`, observed 8 samples and `u=10,908` then we'd use binary search to find `CDF(353666, 8, 1/102400) ~= 0.025` and `CDF(1476870, 8, 1/102400) ~= 0.975`. Our interval estimate for the number of bytes allocated would be `[353666 + 10908, 1476870 + 10908]`. + +To get a rough idea of the error in proportion to the number of samples, here is table of calculated Negative Binomial failed trials for the 0.025 and 0.975 thresholds of the CDF: + +| # of samples (successes) | CDF() = 0.025 | CDF() = 0.975 | +| ---------------------------| ------------------------| --------------------------- | +| 1 | 2591 | 377738 | +| 2 | 24800 | 570531 | +| 3 | 63349 | 739802 | +| 4 | 111599 | 897761 | +| 5 | 166241 | 1048730 | +| 6 | 225469 | 1194827 | +| 7 | 288185 | 1337279 | +| 8 | 353666 | 1476870 | +| 9 | 421407 | 1614137 | +| 10 | 491039 | 1749469 | +| 20 | 1250954 | 3038270 | +| 30 | 2072639 | 4264804 | +| 40 | 2926207 | 5459335 | +| 50 | 3800118 | 6633475 | +| 100 | 8331581 | 12342053 | +| 200 | 17739679 | 23413825 | +| 300 | 27341465 | 34291862 | +| 400 | 37043463 | 45069676 | +| 500 | 46809487 | 55783459 | +| 1000 | 96149867 | 108842093 | +| 2000 | 195919830 | 213870137 | +| 3000 | 296301551 | 318286418 | +| 4000 | 396999923 | 422386047 | +| 5000 | 497900649 | 526283322 | +| 10000 | 1004017229 | 1044156743 | + +Notice that if we compare the expected total number of trials (102400 * # of samples) to the estimated ranges, at 10 samples the error bars extend more than 50% in each direction showing the predictions on so few samples are very imprecise. However at 1,000 samples the error is ~6% in each direction and at 10,000 samples ~2% in each direction. + +The variance for the Negative Binomial Distribution is `sq/p^2`. In the limit where all allocated objects have size 1 byte, `E(s)=Np` which gives an expected variance of `Nq/p`, the same as with the weighted sampled approach. However as object sizes increase the variance on approach 1 decreases more rapidly than in this approach. + +#### Compensating for bytes allocated on a thread in between events + +It is likely you want to estimate allocations starting and ending at arbitrary points in time that do not correspond exactly with the moment a sampling event was emitted. This means the initial sampling event covered more time than the allocation period we are interested in and the allocations at the end aren't included in any sampling event. You can conservatively adjust the error bounds to account for the uncertainty in the starting and ending allocations. If the starting point is not aligned with a sampling event calculate the lower bound of allocated bytes as if there was one fewer sample received. If the ending point is not aligned with a sampling event calculate the upper bound as if there was one more sample received. + +#### Estimating the total number of bytes allocated on all threads + +The per-thread estimations can be repeated for all threads and summed up. + +#### Estimating the number of bytes allocated for objects of a specific type, size or other characteristic + +Select from the sampling events only those events which occured in objects matching your criteria. For example if you want to estimate the number of bytes allocated for Foo typed objects, select the samples in Foo-typed objects. Using this reduced set of samples do the same estimation technique as above. The error on this estimation will also be based on the number of samples in your filtered subset. If there were 1000 initial samples but only 3 of those samples were in Foo-typed objects that might generate an estimate of 310K bytes of Foo objects but beware that the potential sampling error for a small number of samples is very large. + +## Implementation design + +Overall the implementation needs to do a few steps: +1. Determine if sampling events are enabled. If no, there is nothing else to do, but if yes we need to do steps (2) and (3). +2. Use a random number generator to simulate random trials for each allocated byte and determine which objects contain the successful trials +3. When a successful trial occurs, emit the AllocationSampled event + +Steps (1) and (3) are straightforward but step (2) is non-trivial to do correctly and performantly. For step (1) we use the existing macro ETW_TRACING_CATEGORY_ENABLED() which despite its name works for all our event tracing technologies. For step (3) we defined a method FireAllocationSampled() in gchelpers.cpp and the code to emit the event is in there. Like all runtime events the definition for the event itself is in ClrEtwAll.man. All the remaining discussion is how we accomplish step (2). + +Our conceptual sampling model involves doing Bernoulli trials for every byte of an object. In theory we could implement that very literally. Each object allocation would run a for loop n iterations for an n byte object and generate random coin flips with a pseudo random number generator (PRNG). However doing this would be incredibly slow. A good way to understand the actual implementation is to imagine we started with this simple slow approach and then did several iterative transformations to make it run faster while maintaining the same output. Imagine that we have some function `bool GetNextTrialResult(CLRRandom* pRandom)` that takes a PRNG and should randomly return true with probability 1 in 102,400. It might be implemented: + +``` +bool GetNextTrialResult(CLRRandom* pRandom) +{ + return pRandom->NextDouble() < 1/102400; +} +``` + +We don't have to generate random numbers at the instant we need them however, we are allowed to cache a batch of them at a time and dispense them later. For simplicity treat all the apparent global variables in these examples as being thread-local. In pseudo-code that looks like: + +``` +List _cachedTrials = PopulateTrials(pRandom); +List PopulateTrials(CLRRandom* pRandom) +{ + List trials = new List(); + for(int i = 0; i < 100; i++) + { + trials.Push(pRandom->NextDouble() < 1/102400); + } + return trials; +} +bool GetNextTrialResult(CLRRandom* pRandom) +{ + bool ret = _cachedTrials.Pop(); + // if we are out of trials, cache some more for next time + if(_cachedTrials.Count == 0) + { + _cachedTrials = PopulateTrials(pRandom); + } + return ret; +} +``` + +Notice that almost the every entry in the cached list will be false so this is an inefficient way to store it. Rather than storing a large number of false bits we could store a single number that represents a run of zero or more contiguous false bools followed by a single true bool. There is also no requirement that our cached batches of trials are the same size so we could cache exactly one run of false results. In pseudo-code that looks like: + +``` +BigInteger _cachedFailedTrials = PopulateTrials(pRandom); +BigInteger PopulateTrials(CLRRandom* pRandom) +{ + BigInteger failedTrials = 0; + while(pRandom->NextDouble() >= 1/102400) + { + failedTrials++; + } + return failedTrials; +} +bool GetNextTrialResult(CLRRandom* pRandom) +{ + bool ret = (_cachedFailedTrials == 0); + _cachedFailedTrials--; + // if we are out of trials, cache some more for next time + if(cachedTrials < 0) + { + _cachedFailedTrials = PopulateTrials(pRandom); + } + return ret; +} +``` + +Rather than generating `_cachedFailedTrials` by doing many independent queries to a random number generator we can use some math to speed this up. The probability `_cachedFailedTrials` has some particular value `X` is given by the [Geometric distribution](https://en.wikipedia.org/wiki/Geometric_distribution). We can use [Inverse Transform Sampling](https://en.wikipedia.org/wiki/Inverse_transform_sampling) to generate random values for this distribution directly. The CDF for the Geometric distribution is `1-(1-p)^(floor(x)+1)` which means the inverse is `floor(ln(1-y)/ln(1-p))`. + +We've been using BigInteger so far because mathmatically there is a non-zero probability of getting an arbitrarily large number of failed trials in a row. In practice however our PRNG has its outputs constrained to return a floating point number with value k/MAX_INT for an integer value of k between 0 and MAX_INT-1. The largest value PopulateTrials() can return under these constraints is ~2.148M which means a 32bit integer can easily accomodate the value. The perfect mathematical model of the Geometric distribution has a 0.00000005% chance of getting a larger run of failed trials but our PRNG rounds that incredibly unlikely case to zero probability. + +Both of these changes combined give the pseudo-code + +``` +int _cachedFailedTrials = CalculateGeometricRandom(pRandom); +// Previously this method was called PopulateTrials() +// Use Inverse Transform Sampling to calculate a random value from the Geometric distribution +int CalculateGeometricRandom(CLRRandom* pRandom) +{ + return floor(log(1-pRandom->NextDouble())/log(1-1/102400)); +} +bool GetNextTrialResult(CLRRandom* pRandom) +{ + bool ret = (_cachedFailedTrials == 0); + _cachedFailedTrials--; + // if we are out of trials, cache some more for next time + if(_cachedFailedTrials < 0) + { + _cachedFailedTrials = CalculateGeometricRandom(pRandom); + } + return ret; +} +``` + +When allocating an object we need to do many trials at once, one for each byte. A naive implementation of that would look like: + +``` +bool DoesAnyTrialSucceed(CLRRandom* pRandom, int countOfTrials) +{ + for(int i = 0; i < countOfTrials; i++) + { + if(GetNextTrialResult(pRandom)) return true; + } + return false; +} +``` + +However the `_cachedFailedTrials` representation lets us speed this up by checking if the number of failed trials in the cache covers the number of trials we need to perform without iterating through them one at a time: + +``` +bool DoesAnyTrialSucceed(CLRRandom* pRandom, int countOfTrials) +{ + bool ret = _cachedFailedTrials < countOfTrials; + _cachedFailedTrials -= countOfTrials; + // if we are out of trials, cache some more for next time + if(ret) + { + _cachedFailedTrials = CalculateGeometricRandom(pRandom); + } + return ret; +} +``` + + +We are getting closer to mapping our pseudo-code implementation to the real CLR code. The current CLR implementation for memory allocation has the GC sub-allocate blocks of memory 8KB in size which the runtime is allowed to sub-allocate from. The GC gives out an `alloc_context` to each thread which has a `alloc_ptr` and 'alloc_limit' fields. These fields define the memory range [alloc_ptr, alloc_limit) which can be used to sub-allocate objects. The runtime has optimized assembly code helper functions to increment `alloc_ptr` directly for objects that are small enough to fit in the current range and don't require any special handling. For all other objects the runtime invokes a slower allocation path that ultimately calls the GC's Alloc() function. If the alloc_context is exhausted, calling GC Alloc() also allocates a new 8KB block for future fast object allocations to use. In order to allocate objects we could naively do this: + +``` +void* FastAssemblyAllocate(int objectSize) +{ + Thread* pThread = GetThread(); + CLRRandom* pRandom = pThread->GetRandom(); + alloc_context* pAllocContext = pThread->GetAllocContext(); + void* alloc_end = pAllocContext->alloc_ptr + objectSize; + if(IsSamplingEnabled() && DoesAnyTrialSucceed(pRandom, objectSize)) + PublishSamplingEvent(); + if(alloc_limit < alloc_end) + return SlowAlloc(objectSize); + else + void* objectAddr = pAllocContext->alloc_ptr; + pAllocContext->alloc_ptr = alloc_end; + *objectAddr = methodTable + return objectAddr; +} +``` + +Although orders of magnitude faster than where we started, this is still too slow. We don't want to put extra conditional checks for IsSamplingEnabled() and DoesAnyTrialSucceed() in the fast path of every allocation. Instead we want to combine the two if conditions down to a single compare and jump, then handle publishing a sample event as part of the slow allocation path. Note that the value of the expression `alloc_ptr + _cachedFailedTrials` doesn't change by repeated calls to the FastAssemblyAllocate() as long as we don't go down the SlowAlloc path or the PublishSamplingEvent() path. Each invocation increments `alloc_ptr` by `objectSize` and decrements `_cachedFailedTrials` by the same amount leaving the sum unchanged. Lets define that sum `alloc_ptr + _cachedFailedTrials = sampling_limit`. You can imagine that if we started allocating objects contiguously from `alloc_ptr`, `sampling_limit` represents the point in the memory range where whatever object overlaps it contains the successful trial and emits the sampling event. A little more rigorously `DoesAnyTrialSucceed()` returns true when `_cachedFailedTrials < objectSize`. Adding `alloc_ptr` to each side shows this is the same as the condition `sampling_limit < alloc_end`: + +``` +_cachedFailedTrials < objectSize +_cachedFailedTrials + alloc_ptr < objectSize + alloc_ptr +sampling_limit < alloc_end +``` + +Last to combine the two if conditionals we can define a new field `combined_limit = min(sampling_limit, alloc_limit)`. If sampling events aren't enabled then we define `combined_limit = alloc_limit`. This means that a single check `if(alloc_end < combined_limit)` detects when either the object exceeds `alloc_limit` or it exceeds `sampling_limit`. The runtime actually has a bunch of different fast paths depending on the type of the object being allocated and the CPU architecture, but converted to pseudo-code they all look approximately like this: + +``` +void* FastAssemblyAllocate(int objectSize) +{ + Thread* pThread = GetThread(); + alloc_context* pAllocContext = pThread->GetAllocContext(); + void* alloc_end = pAllocContext->alloc_ptr + objectSize; + if(combined_limit < alloc_end) + return SlowAlloc(objectSize); + else + void* objectAddr = pAllocContext->alloc_ptr; + pAllocContext->alloc_ptr = alloc_end; + *objectAddr = methodTable + return objectAddr; +} +``` + +The only change we've made in the assembly helpers is doing a comparison against combined_limit instead of alloc_limit which should have no performance impact. Look at [JIT_TrialAllocSFastMP_InlineGetThread](https://github.com/dotnet/runtime/blob/5c8bb402e6a8274e8135dd00eda2248b4f57102f/src/coreclr/vm/amd64/JitHelpers_InlineGetThread.asm#L38) for an example of what one of these helpers looks like in assembly code. + +The pseudo-code and concepts we've been describing here are now close to matching the runtime code but there are still some important differences to call out to map it more exactly: + +1. In the real runtime code the assembly helpers call a variety of different C++ helpers depending on object type and all of those helpers in turn call into [Alloc()](https://github.com/dotnet/runtime/blob/5c8bb402e6a8274e8135dd00eda2248b4f57102f/src/coreclr/vm/gchelpers.cpp#L201). Here we've omitted the different per-type intermediate functions and represented all of them as the SlowAlloc() function in the pseudo-code. + +2. The combined_limit field is a member of ee_alloc_context rather than alloc_context. This was done to avoid creating a breaking change in the EE<->GC interface. The ee_alloc_context contains an alloc_context within it as well as any additional fields we want to add that are only visible to the EE. + +3. In order to reduce the number of per-thread fields being managed the real implementation doesn't have an explicit `sampling_limit`. Instead this only exists as the transient calculation of `alloc_ptr + CalculateGeometricRandom()` that is used when computing an updated value for `combined_limit`. Whenever `combined_limit < alloc_limit` then it is implied that `sampling_limit = combined_limit` and `_cachedFailedTrials = combined_limit - alloc_ptr`. However if `combined_limit == alloc_limit` that represents one of two possible states: +- Sampling is disabled +- Sampling is enabled and we have a batch of cached failed trials with size `alloc_limit - alloc_ptr`. In the examples above our batches were N failures followed by a success but this is just N failures without any success at the end. This means no objects allocated in the current AC are going to be sampled and whenever we allocate the N+1st byte we'll need to generate a new batch of trial results to determine whether that byte was sampled. +If it turns out to be easier to track `sampling_limit` with an explicit field when sampling is enabled we could do that, it just requires an extra pointer per-thread. As memory overhead its not much, but it will probably land in the L1 cache and wind up evicting some other field on the Thread object that now no longer fits in the cache line. The current implementation tries to minimize this cache impact. We never did any perf testing on alternative implementations that do track sampling_limit explicitly so its possible the difference isn't that meaningful. + +4. When we generate batches of trial results in the examples above we always used all the results before generating a new batch, however the real implementation sometimes discards part of a batch. Implicitly this happens when we calculate a value for `sampling_limit=alloc_ptr+CalculateGeometricRandom()`, determine that `alloc_limit` is smaller than `sampling_limit`, and then set `combined_limit=alloc_limit`. Discarding also happens any time we recompute the `sampling_limit` based on a new random value without having fully allocated bytes up to `combined_limit`. It may seem suspicious that we can do this and still generate the correct distribution of samples but it is OK if done properly. Bernoulli trials are independent from one another so it is legal to discard trials from our cache as long as the decision to discard a given trial result is independent of what that trial result is. For example in the very first pseudo-code sample with the List, it would be legal to generate 100 boolean trials and then arbitrarily truncate the list to size 50. The first 50 values in the list are still valid bernoulli trials with the original p=1/102,400 of being true, as will be all the future ones from the batches that are populated later. However if we scanned the list and conditionally discarded any trials that we observed had a success result that would be problematic. This type of selective removal changes the probability distribution for the items that remain. + +5. The GC Alloc() function isn't the only time that the GC updates alloc_ptr and alloc_limit. They also get updated during a GC in the callback inside of GCToEEInterface::GcEnumAllocContexts(). This is another place where combined_limit needs to be updated to ensure it stays synchronized with alloc_ptr and alloc_limit. + + +## Thanks + +Thanks to Christophe Nasarre (@chrisnas) at DataDog for implementing this feature and Mikelle Rogers for doing the investigation of the Camp-Paulson approximation. \ No newline at end of file diff --git a/src/coreclr/inc/eventtracebase.h b/src/coreclr/inc/eventtracebase.h index 316104f649a1d8..dbcb2e08302d50 100644 --- a/src/coreclr/inc/eventtracebase.h +++ b/src/coreclr/inc/eventtracebase.h @@ -1333,17 +1333,19 @@ namespace ETW #define ETWLoaderStaticLoad 0 // Static reference load #define ETWLoaderDynamicLoad 1 // Dynamic assembly load +#if defined (FEATURE_EVENT_TRACE) +EXTERN_C DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_DOTNET_Context; +EXTERN_C DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_PRIVATE_PROVIDER_DOTNET_Context; +EXTERN_C DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_RUNDOWN_PROVIDER_DOTNET_Context; +EXTERN_C DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_STRESS_PROVIDER_DOTNET_Context; +#endif // FEATURE_EVENT_TRACE + #if defined(FEATURE_EVENT_TRACE) && !defined(HOST_UNIX) // // The ONE and only ONE global instantiation of this class // extern ETW::CEtwTracer * g_pEtwTracer; -EXTERN_C DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_DOTNET_Context; -EXTERN_C DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_PRIVATE_PROVIDER_DOTNET_Context; -EXTERN_C DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_RUNDOWN_PROVIDER_DOTNET_Context; -EXTERN_C DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_STRESS_PROVIDER_DOTNET_Context; - // // Special Handling of Startup events // diff --git a/src/coreclr/minipal/Unix/CMakeLists.txt b/src/coreclr/minipal/Unix/CMakeLists.txt index 6cb70d1e344697..4ca2e5a1c54110 100644 --- a/src/coreclr/minipal/Unix/CMakeLists.txt +++ b/src/coreclr/minipal/Unix/CMakeLists.txt @@ -2,6 +2,7 @@ set(SOURCES doublemapping.cpp dn-u16.cpp ${CLR_SRC_NATIVE_DIR}/minipal/time.c + ${CLR_SRC_NATIVE_DIR}/minipal/xoshiro128pp.c ) if(NOT CLR_CROSS_COMPONENTS_BUILD) diff --git a/src/coreclr/minipal/Windows/CMakeLists.txt b/src/coreclr/minipal/Windows/CMakeLists.txt index b1f1cb88a3e6ac..4af6841c11ea4c 100644 --- a/src/coreclr/minipal/Windows/CMakeLists.txt +++ b/src/coreclr/minipal/Windows/CMakeLists.txt @@ -3,6 +3,7 @@ set(SOURCES dn-u16.cpp ${CLR_SRC_NATIVE_DIR}/minipal/utf8.c ${CLR_SRC_NATIVE_DIR}/minipal/time.c + ${CLR_SRC_NATIVE_DIR}/minipal/xoshiro128pp.c ) if(NOT CLR_CROSS_COMPONENTS_BUILD) diff --git a/src/coreclr/nativeaot/Runtime/CMakeLists.txt b/src/coreclr/nativeaot/Runtime/CMakeLists.txt index b763e76af80e96..5608f0f0ec4bfc 100644 --- a/src/coreclr/nativeaot/Runtime/CMakeLists.txt +++ b/src/coreclr/nativeaot/Runtime/CMakeLists.txt @@ -52,6 +52,7 @@ set(COMMON_RUNTIME_SOURCES ${CLR_SRC_NATIVE_DIR}/minipal/cpufeatures.c ${CLR_SRC_NATIVE_DIR}/minipal/time.c + ${CLR_SRC_NATIVE_DIR}/minipal/xoshiro128pp.c ) set(SERVER_GC_SOURCES diff --git a/src/coreclr/nativeaot/Runtime/GCHelpers.cpp b/src/coreclr/nativeaot/Runtime/GCHelpers.cpp index 41b8aa8463c2ea..bc0c5698a08266 100644 --- a/src/coreclr/nativeaot/Runtime/GCHelpers.cpp +++ b/src/coreclr/nativeaot/Runtime/GCHelpers.cpp @@ -29,6 +29,12 @@ #include "gcdesc.h" +#ifdef FEATURE_EVENT_TRACE + #include "clretwallmain.h" +#else // FEATURE_EVENT_TRACE + #include "etmdummy.h" +#endif // FEATURE_EVENT_TRACE + #define RH_LARGE_OBJECT_SIZE 85000 MethodTable g_FreeObjectEEType; @@ -471,6 +477,29 @@ EXTERN_C int64_t QCALLTYPE RhGetTotalAllocatedBytesPrecise() return allocated; } +void FireAllocationSampled(GC_ALLOC_FLAGS flags, size_t size, size_t samplingBudgetOffset, Object* orObject) +{ +#ifdef FEATURE_EVENT_TRACE + void* typeId = GetLastAllocEEType(); + // Note: Just as for AllocationTick, the type name cannot be retrieved + WCHAR* name = nullptr; + + if (typeId != nullptr) + { + unsigned int allocKind = + (flags & GC_ALLOC_PINNED_OBJECT_HEAP) ? 2 : + (flags & GC_ALLOC_LARGE_OBJECT_HEAP) ? 1 : + 0; // SOH + unsigned int heapIndex = 0; +#ifdef BACKGROUND_GC + gc_heap* hp = gc_heap::heap_of((BYTE*)orObject); + heapIndex = hp->heap_number; +#endif + FireEtwAllocationSampled(allocKind, GetClrInstanceId(), typeId, name, heapIndex, (BYTE*)orObject, size, samplingBudgetOffset); + } +#endif +} + static Object* GcAllocInternal(MethodTable* pEEType, uint32_t uFlags, uintptr_t numElements, Thread* pThread) { ASSERT(!pThread->IsDoNotTriggerGcSet()); @@ -539,8 +568,47 @@ static Object* GcAllocInternal(MethodTable* pEEType, uint32_t uFlags, uintptr_t // Save the MethodTable for instrumentation purposes. tls_pLastAllocationEEType = pEEType; - Object* pObject = GCHeapUtilities::GetGCHeap()->Alloc(pThread->GetAllocContext(), cbSize, uFlags); - pThread->GetEEAllocContext()->UpdateCombinedLimit(); + // check for dynamic allocation sampling + ee_alloc_context* pEEAllocContext = pThread->GetEEAllocContext(); + gc_alloc_context* pAllocContext = pEEAllocContext->GetGCAllocContext(); + bool isSampled = false; + size_t availableSpace = 0; + size_t samplingBudget = 0; + + bool isRandomizedSamplingEnabled = ee_alloc_context::IsRandomizedSamplingEnabled(); + if (isRandomizedSamplingEnabled) + { + // The number bytes we can allocate before we need to emit a sampling event. + // This calculation is only valid if combined_limit < alloc_limit. + samplingBudget = (size_t)(pEEAllocContext->combined_limit - pAllocContext->alloc_ptr); + + // The number of bytes available in the current allocation context + availableSpace = (size_t)(pAllocContext->alloc_limit - pAllocContext->alloc_ptr); + + // Check to see if the allocated object overlaps a sampled byte + // in this AC. This happens when both: + // 1) The AC contains a sampled byte (combined_limit < alloc_limit) + // 2) The object is large enough to overlap it (samplingBudget < aligned_size) + // + // Note that the AC could have no remaining space for allocations (alloc_ptr = + // alloc_limit = combined_limit). When a thread hasn't done any SOH allocations + // yet it also starts in an empty state where alloc_ptr = alloc_limit = + // combined_limit = nullptr. The (1) check handles both of these situations + // properly as an empty AC can not have a sampled byte inside of it. + isSampled = + (pEEAllocContext->combined_limit < pAllocContext->alloc_limit) && + (samplingBudget < cbSize); + + // if the object overflows the AC, we need to sample the remaining bytes + // the sampling budget only included at most the bytes inside the AC + if (cbSize > availableSpace && !isSampled) + { + samplingBudget = ee_alloc_context::ComputeGeometricRandom() + availableSpace; + isSampled = (samplingBudget < cbSize); + } + } + + Object* pObject = GCHeapUtilities::GetGCHeap()->Alloc(pAllocContext, cbSize, uFlags); if (pObject == NULL) return NULL; @@ -551,6 +619,19 @@ static Object* GcAllocInternal(MethodTable* pEEType, uint32_t uFlags, uintptr_t ((Array*)pObject)->InitArrayLength((uint32_t)numElements); } + if (isSampled) + { + FireAllocationSampled((GC_ALLOC_FLAGS)uFlags, cbSize, samplingBudget, pObject); + } + + // There are a variety of conditions that may have invalidated the previous combined_limit value + // such as not allocating the object in the AC memory region (UOH allocations), moving the AC, adding + // extra alignment padding, allocating a new AC, or allocating an object that consumed the sampling budget. + // Rather than test for all the different invalidation conditions individually we conservatively always + // recompute it. If sampling isn't enabled this inlined function is just trivially setting + // combined_limit=alloc_limit. + pEEAllocContext->UpdateCombinedLimit(isRandomizedSamplingEnabled); + if (uFlags & GC_ALLOC_USER_OLD_HEAP) GCHeapUtilities::GetGCHeap()->PublishObject((uint8_t*)pObject); diff --git a/src/coreclr/nativeaot/Runtime/disabledeventtrace.cpp b/src/coreclr/nativeaot/Runtime/disabledeventtrace.cpp index f0944fdf295179..7ed5c94e4eb963 100644 --- a/src/coreclr/nativeaot/Runtime/disabledeventtrace.cpp +++ b/src/coreclr/nativeaot/Runtime/disabledeventtrace.cpp @@ -12,6 +12,11 @@ void EventTracing_Initialize() { } +bool IsRuntimeProviderEnabled(uint8_t level, uint64_t keyword) +{ + return false; +} + void ETW::GCLog::FireGcStart(ETW_GC_INFO * pGcInfo) { } #ifdef FEATURE_ETW diff --git a/src/coreclr/nativeaot/Runtime/eventpipe/gen-eventing-event-inc.lst b/src/coreclr/nativeaot/Runtime/eventpipe/gen-eventing-event-inc.lst index 901af659ff84b6..77c9d8cb15a3da 100644 --- a/src/coreclr/nativeaot/Runtime/eventpipe/gen-eventing-event-inc.lst +++ b/src/coreclr/nativeaot/Runtime/eventpipe/gen-eventing-event-inc.lst @@ -1,5 +1,6 @@ # Native runtime events supported by aot runtime. +AllocationSampled BGC1stConEnd BGC1stNonConEnd BGC1stSweepEnd diff --git a/src/coreclr/nativeaot/Runtime/eventtrace.cpp b/src/coreclr/nativeaot/Runtime/eventtrace.cpp index 8b3d134f5c4f24..4e3c80ed1f75d4 100644 --- a/src/coreclr/nativeaot/Runtime/eventtrace.cpp +++ b/src/coreclr/nativeaot/Runtime/eventtrace.cpp @@ -37,6 +37,11 @@ DOTNET_TRACE_CONTEXT MICROSOFT_WINDOWS_DOTNETRUNTIME_PRIVATE_PROVIDER_DOTNET_Con MICROSOFT_WINDOWS_DOTNETRUNTIME_PRIVATE_PROVIDER_EVENTPIPE_Context }; +bool IsRuntimeProviderEnabled(uint8_t level, uint64_t keyword) +{ + return RUNTIME_PROVIDER_CATEGORY_ENABLED(level, keyword); +} + volatile LONGLONG ETW::GCLog::s_l64LastClientSequenceNumber = 0; //--------------------------------------------------------------------------------------- @@ -245,4 +250,4 @@ void EventPipeEtwCallbackDotNETRuntimePrivate( _Inout_opt_ PVOID CallbackContext) { EtwCallbackCommon(DotNETRuntimePrivate, ControlCode, Level, MatchAnyKeyword, FilterData, true); -} \ No newline at end of file +} diff --git a/src/coreclr/nativeaot/Runtime/eventtracebase.h b/src/coreclr/nativeaot/Runtime/eventtracebase.h index 241c795c0d02fc..0e778cafb76571 100644 --- a/src/coreclr/nativeaot/Runtime/eventtracebase.h +++ b/src/coreclr/nativeaot/Runtime/eventtracebase.h @@ -30,6 +30,8 @@ void InitializeEventTracing(); #ifdef FEATURE_EVENT_TRACE +bool IsRuntimeProviderEnabled(uint8_t level, uint64_t keyword); + // !!!!!!! NOTE !!!!!!!! // The flags must match those in the ETW manifest exactly // !!!!!!! NOTE !!!!!!!! @@ -102,6 +104,7 @@ struct ProfilingScanContext; #define CLR_GCHEAPSURVIVALANDMOVEMENT_KEYWORD 0x400000 #define CLR_MANAGEDHEAPCOLLECT_KEYWORD 0x800000 #define CLR_GCHEAPANDTYPENAMES_KEYWORD 0x1000000 +#define CLR_ALLOCATIONSAMPLING_KEYWORD 0x80000000000 // // Using KEYWORDZERO means when checking the events category ignore the keyword diff --git a/src/coreclr/nativeaot/Runtime/gctoclreventsink.cpp b/src/coreclr/nativeaot/Runtime/gctoclreventsink.cpp index b8ffaad1ffe88c..3709a414fdf0e2 100644 --- a/src/coreclr/nativeaot/Runtime/gctoclreventsink.cpp +++ b/src/coreclr/nativeaot/Runtime/gctoclreventsink.cpp @@ -4,6 +4,7 @@ #include "common.h" #include "gctoclreventsink.h" #include "thread.h" +#include "eventtracebase.h" GCToCLREventSink g_gcToClrEventSink; @@ -174,6 +175,14 @@ void GCToCLREventSink::FireGCAllocationTick_V4(uint64_t allocationAmount, { LIMITED_METHOD_CONTRACT; +#ifdef FEATURE_EVENT_TRACE + if (IsRuntimeProviderEnabled(TRACE_LEVEL_INFORMATION, CLR_ALLOCATIONSAMPLING_KEYWORD)) + { + // skip AllocationTick if AllocationSampled is emitted + return; + } +#endif // FEATURE_EVENT_TRACE + void * typeId = GetLastAllocEEType(); WCHAR * name = nullptr; diff --git a/src/coreclr/nativeaot/Runtime/thread.cpp b/src/coreclr/nativeaot/Runtime/thread.cpp index b796b052182260..5437edd62b32ed 100644 --- a/src/coreclr/nativeaot/Runtime/thread.cpp +++ b/src/coreclr/nativeaot/Runtime/thread.cpp @@ -35,6 +35,13 @@ static Thread* g_RuntimeInitializingThread; #endif //!DACCESS_COMPILE +ee_alloc_context::PerThreadRandom::PerThreadRandom() +{ + minipal_xoshiro128pp_init(&random_state, (uint32_t)PalGetTickCount64()); +} + +thread_local ee_alloc_context::PerThreadRandom ee_alloc_context::t_random = PerThreadRandom(); + PInvokeTransitionFrame* Thread::GetTransitionFrame() { if (ThreadStore::GetSuspendingThread() == this) diff --git a/src/coreclr/nativeaot/Runtime/thread.h b/src/coreclr/nativeaot/Runtime/thread.h index 70f776de2ee9a1..c75ec9a5c4a722 100644 --- a/src/coreclr/nativeaot/Runtime/thread.h +++ b/src/coreclr/nativeaot/Runtime/thread.h @@ -6,6 +6,7 @@ #include "StackFrameIterator.h" #include "slist.h" // DefaultSListTraits +#include struct gc_alloc_context; class RuntimeInstance; @@ -113,7 +114,19 @@ struct ee_alloc_context gc_alloc_context* GetGCAllocContext(); uint8_t* GetCombinedLimit(); - void UpdateCombinedLimit(); + void UpdateCombinedLimit(bool samplingEnabled); + static bool IsRandomizedSamplingEnabled(); + static uint32_t ComputeGeometricRandom(); + + struct PerThreadRandom + { + minipal_xoshiro128pp random_state; + + PerThreadRandom(); + double NextDouble(); + }; + + static thread_local PerThreadRandom t_random; }; diff --git a/src/coreclr/nativeaot/Runtime/thread.inl b/src/coreclr/nativeaot/Runtime/thread.inl index 5c17da3e61f3f3..c692bbdeff95af 100644 --- a/src/coreclr/nativeaot/Runtime/thread.inl +++ b/src/coreclr/nativeaot/Runtime/thread.inl @@ -3,7 +3,9 @@ #ifndef DACCESS_COMPILE +#include "eventtracebase.h" +const uint32_t SamplingDistributionMean = (100 * 1024); inline gc_alloc_context* ee_alloc_context::GetGCAllocContext() { @@ -22,11 +24,53 @@ struct _thread_inl_gc_alloc_context uint8_t* alloc_limit; }; -inline void ee_alloc_context::UpdateCombinedLimit() + +inline bool ee_alloc_context::IsRandomizedSamplingEnabled() +{ +#ifdef FEATURE_EVENT_TRACE + return IsRuntimeProviderEnabled(TRACE_LEVEL_INFORMATION, CLR_ALLOCATIONSAMPLING_KEYWORD); +#else + return false; +#endif // FEATURE_EVENT_TRACE +} + +inline void ee_alloc_context::UpdateCombinedLimit(bool samplingEnabled) +{ + _thread_inl_gc_alloc_context* gc_alloc_context = (_thread_inl_gc_alloc_context*)GetGCAllocContext(); + if (!samplingEnabled) + { + combined_limit = gc_alloc_context->alloc_limit; + } + else + { + // compute the next sampling budget based on a geometric distribution + size_t samplingBudget = ComputeGeometricRandom(); + + // if the sampling limit is larger than the allocation context, no sampling will occur in this AC + // We do Min() prior to adding to alloc_ptr to ensure alloc_ptr+samplingBudget doesn't cause an overflow. + + size_t size = gc_alloc_context->alloc_limit - gc_alloc_context->alloc_ptr; + combined_limit = gc_alloc_context->alloc_ptr + min(samplingBudget, size); + } +} + +inline uint32_t ee_alloc_context::ComputeGeometricRandom() +{ + // compute a random sample from the Geometric distribution. + double probability = t_random.NextDouble(); + uint32_t threshold = (uint32_t)(-log(1 - probability) * SamplingDistributionMean); + return threshold; +} + +// Returns a random double in the range [0, 1). +inline double ee_alloc_context::PerThreadRandom::NextDouble() { - // The randomized allocation sampling feature is being submitted in stages. For now sampling is never enabled so - // combined_limit is always the same as alloc_limit. - combined_limit = ((_thread_inl_gc_alloc_context*)GetGCAllocContext())->alloc_limit; + uint32_t value = minipal_xoshiro128pp_next(&random_state); + if(value == UINT32_MAX) + { + value--; + } + return value * (1.0/UINT32_MAX); } // Set the m_pDeferredTransitionFrame field for GC allocation helpers that setup transition frame diff --git a/src/coreclr/vm/ClrEtwAll.man b/src/coreclr/vm/ClrEtwAll.man index 265d7a07726cf6..1760a55766d54c 100644 --- a/src/coreclr/vm/ClrEtwAll.man +++ b/src/coreclr/vm/ClrEtwAll.man @@ -91,6 +91,8 @@ message="$(string.RuntimePublisher.ProfilerKeywordMessage)" symbol="CLR_PROFILER_KEYWORD" /> + @@ -461,7 +463,13 @@ - + + + + + @@ -3164,6 +3172,30 @@ + + @@ -4256,7 +4288,11 @@ keywords="WaitHandleKeyword" opcode="win:Stop" task="WaitHandleWait" symbol="WaitHandleWaitStop" message="$(string.RuntimePublisher.WaitHandleWaitStopEventMessage)"/> - + + @@ -8616,7 +8652,8 @@ - + + @@ -8791,7 +8828,8 @@ - + + @@ -9155,7 +9193,8 @@ - + + diff --git a/src/coreclr/vm/gcheaputilities.cpp b/src/coreclr/vm/gcheaputilities.cpp index 32ea33a91cc3ce..cffe9185d5e61b 100644 --- a/src/coreclr/vm/gcheaputilities.cpp +++ b/src/coreclr/vm/gcheaputilities.cpp @@ -43,6 +43,8 @@ bool g_sw_ww_enabled_for_gc_heap = false; GVAL_IMPL_INIT(ee_alloc_context, g_global_alloc_context, {}); +thread_local ee_alloc_context::PerThreadRandom ee_alloc_context::t_random = PerThreadRandom(); + enum GC_LOAD_STATUS { GC_LOAD_STATUS_BEFORE_START, GC_LOAD_STATUS_START, diff --git a/src/coreclr/vm/gcheaputilities.h b/src/coreclr/vm/gcheaputilities.h index bbefdc9cd7bc32..2a94d86d995ddf 100644 --- a/src/coreclr/vm/gcheaputilities.h +++ b/src/coreclr/vm/gcheaputilities.h @@ -4,7 +4,10 @@ #ifndef _GCHEAPUTILITIES_H_ #define _GCHEAPUTILITIES_H_ +#include "eventtracebase.h" #include "gcinterface.h" +#include "math.h" +#include // The singular heap instance. GPTR_DECL(IGCHeap, g_pGCHeap); @@ -13,6 +16,8 @@ GPTR_DECL(IGCHeap, g_pGCHeap); extern "C" { #endif // !DACCESS_COMPILE +const DWORD SamplingDistributionMean = (100 * 1024); + // This struct allows adding some state that is only visible to the EE onto the standard gc_alloc_context struct ee_alloc_context { @@ -66,13 +71,65 @@ struct ee_alloc_context return offsetof(ee_alloc_context, combined_limit); } - // Regenerate the randomized sampling limit and update the combined_limit field. - inline void UpdateCombinedLimit() + static inline bool IsRandomizedSamplingEnabled() + { +#ifdef FEATURE_EVENT_TRACE + return ETW_TRACING_CATEGORY_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_DOTNET_Context, + TRACE_LEVEL_INFORMATION, + CLR_ALLOCATIONSAMPLING_KEYWORD); +#else + return false; +#endif // FEATURE_EVENT_TRACE + } + + inline void UpdateCombinedLimit(bool samplingEnabled) { - // The randomized sampling feature is being submitted in stages. At this point the sampling is never - // activated so combined_limit is always equal to alloc_limit. - combined_limit = gc_allocation_context.alloc_limit; + if (!samplingEnabled) + { + combined_limit = gc_allocation_context.alloc_limit; + } + else + { + // compute the next sampling budget based on a geometric distribution + size_t samplingBudget = ComputeGeometricRandom(); + + // if the sampling limit is larger than the allocation context, no sampling will occur in this AC + // We do Min() prior to adding to alloc_ptr to ensure alloc_ptr+samplingBudget doesn't cause an overflow. + size_t size = gc_allocation_context.alloc_limit - gc_allocation_context.alloc_ptr; + combined_limit = gc_allocation_context.alloc_ptr + Min(samplingBudget, size); + } } + + static inline uint32_t ComputeGeometricRandom() + { + // compute a random sample from the Geometric distribution. + double probability = t_random.NextDouble(); + uint32_t threshold = (uint32_t)(-log(1 - probability) * SamplingDistributionMean); + return threshold; + } + + struct PerThreadRandom + { + minipal_xoshiro128pp random_state; + + PerThreadRandom() + { + minipal_xoshiro128pp_init(&random_state, GetRandomInt(INT_MAX)); + } + + // Returns a random double in the range [0, 1). + double NextDouble() + { + uint32_t value = minipal_xoshiro128pp_next(&random_state); + if(value == UINT32_MAX) + { + value--; + } + return value * (1.0/UINT32_MAX); + } + }; + + static thread_local PerThreadRandom t_random; }; GPTR_DECL(uint8_t,g_lowest_address); diff --git a/src/coreclr/vm/gchelpers.cpp b/src/coreclr/vm/gchelpers.cpp index cab5986ecd66a3..6e482b77e99f3e 100644 --- a/src/coreclr/vm/gchelpers.cpp +++ b/src/coreclr/vm/gchelpers.cpp @@ -183,6 +183,118 @@ inline void CheckObjectSize(size_t alloc_size) } } +void FireAllocationSampled(GC_ALLOC_FLAGS flags, size_t size, size_t samplingBudgetOffset, Object* orObject) +{ + // Note: this code is duplicated from GCToCLREventSink::FireGCAllocationTick_V4 + void* typeId = nullptr; + const WCHAR* name = nullptr; + InlineSString strTypeName; + EX_TRY + { + TypeHandle th = GetThread()->GetTHAllocContextObj(); + + if (th != 0) + { + th.GetName(strTypeName); + name = strTypeName.GetUnicode(); + typeId = th.GetMethodTable(); + } + } + EX_CATCH{} + EX_END_CATCH(SwallowAllExceptions) + // end of duplication + + if (typeId != nullptr) + { + unsigned int allocKind = + (flags & GC_ALLOC_PINNED_OBJECT_HEAP) ? 2 : + (flags & GC_ALLOC_LARGE_OBJECT_HEAP) ? 1 : + 0; // SOH + unsigned int heapIndex = 0; +#ifdef BACKGROUND_GC + gc_heap* hp = gc_heap::heap_of((BYTE*)orObject); + heapIndex = hp->heap_number; +#endif + FireEtwAllocationSampled(allocKind, GetClrInstanceId(), typeId, name, heapIndex, (BYTE*)orObject, size, samplingBudgetOffset); + } +} + +inline Object* Alloc(ee_alloc_context* pEEAllocContext, size_t size, GC_ALLOC_FLAGS flags) +{ + CONTRACTL { + THROWS; + GC_TRIGGERS; + MODE_COOPERATIVE; // returns an objref without pinning it => cooperative + } CONTRACTL_END; + + Object* retVal = nullptr; + gc_alloc_context* pAllocContext = &pEEAllocContext->gc_allocation_context; + bool isSampled = false; + size_t availableSpace = 0; + size_t aligned_size = 0; + size_t samplingBudget = 0; + bool isRandomizedSamplingEnabled = ee_alloc_context::IsRandomizedSamplingEnabled(); + if (isRandomizedSamplingEnabled) + { + // object allocations are always padded up to pointer size + aligned_size = AlignUp(size, sizeof(uintptr_t)); + + // The number bytes we can allocate before we need to emit a sampling event. + // This calculation is only valid if combined_limit < alloc_limit. + samplingBudget = (size_t)(pEEAllocContext->combined_limit - pAllocContext->alloc_ptr); + + // The number of bytes available in the current allocation context + availableSpace = (size_t)(pAllocContext->alloc_limit - pAllocContext->alloc_ptr); + + // Check to see if the allocated object overlaps a sampled byte + // in this AC. This happens when both: + // 1) The AC contains a sampled byte (combined_limit < alloc_limit) + // 2) The object is large enough to overlap it (samplingBudget < aligned_size) + // + // Note that the AC could have no remaining space for allocations (alloc_ptr = + // alloc_limit = combined_limit). When a thread hasn't done any SOH allocations + // yet it also starts in an empty state where alloc_ptr = alloc_limit = + // combined_limit = nullptr. The (1) check handles both of these situations + // properly as an empty AC can not have a sampled byte inside of it. + isSampled = + (pEEAllocContext->combined_limit < pAllocContext->alloc_limit) && + (samplingBudget < aligned_size); + + // if the object overflows the AC, we need to sample the remaining bytes + // the sampling budget only included at most the bytes inside the AC + if (aligned_size > availableSpace && !isSampled) + { + samplingBudget = ee_alloc_context::ComputeGeometricRandom() + availableSpace; + isSampled = (samplingBudget < aligned_size); + } + } + + GCStress::MaybeTrigger(pAllocContext); + + // for SOH, if there is enough space in the current allocation context, then + // the allocation will be done in place (like in the fast path), + // otherwise a new allocation context will be provided + retVal = GCHeapUtilities::GetGCHeap()->Alloc(pAllocContext, size, flags); + + if (isSampled) + { + // At this point the object methodtable isn't initialized yet but it doesn't matter when we are + // just emitting an ETW/EventPipe event. If we want this event to be more useful from ICorProfiler + // in the future we probably want to pass the isSampled flag back to callers so that the event + // can be raised after the MethodTable is initialized. + FireAllocationSampled(flags, aligned_size, samplingBudget, retVal); + } + + // There are a variety of conditions that may have invalidated the previous combined_limit value + // such as not allocating the object in the AC memory region (UOH allocations), moving the AC, adding + // extra alignment padding, allocating a new AC, or allocating an object that consumed the sampling budget. + // Rather than test for all the different invalidation conditions individually we conservatively always + // recompute it. If sampling isn't enabled this inlined function is just trivially setting + // combined_limit=alloc_limit. + pEEAllocContext->UpdateCombinedLimit(isRandomizedSamplingEnabled); + + return retVal; +} // There are only two ways to allocate an object. // * Call optimized helpers that were generated on the fly. This is how JIT compiled code does most @@ -222,19 +334,13 @@ inline Object* Alloc(size_t size, GC_ALLOC_FLAGS flags) if (GCHeapUtilities::UseThreadAllocationContexts()) { - ee_alloc_context *threadContext = GetThreadEEAllocContext(); - GCStress::MaybeTrigger(&threadContext->gc_allocation_context); - retVal = GCHeapUtilities::GetGCHeap()->Alloc(&threadContext->gc_allocation_context, size, flags); - threadContext->UpdateCombinedLimit(); + retVal = Alloc(GetThreadEEAllocContext(), size, flags); } else { GlobalAllocLockHolder holder(&g_global_alloc_lock); - ee_alloc_context *globalContext = &g_global_alloc_context; - GCStress::MaybeTrigger(&globalContext->gc_allocation_context); - retVal = GCHeapUtilities::GetGCHeap()->Alloc(&globalContext->gc_allocation_context, size, flags); - globalContext->UpdateCombinedLimit(); + retVal = Alloc(&g_global_alloc_context, size, flags); } diff --git a/src/coreclr/vm/gctoclreventsink.cpp b/src/coreclr/vm/gctoclreventsink.cpp index fff929d51567a5..ce75e4cc661830 100644 --- a/src/coreclr/vm/gctoclreventsink.cpp +++ b/src/coreclr/vm/gctoclreventsink.cpp @@ -162,6 +162,16 @@ void GCToCLREventSink::FireGCAllocationTick_V4(uint64_t allocationAmount, { LIMITED_METHOD_CONTRACT; +#ifdef FEATURE_EVENT_TRACE + if (ETW_TRACING_CATEGORY_ENABLED(MICROSOFT_WINDOWS_DOTNETRUNTIME_PROVIDER_DOTNET_Context, + TRACE_LEVEL_INFORMATION, + CLR_ALLOCATIONSAMPLING_KEYWORD)) + { + // skip AllocationTick if AllocationSampled is emitted + return; + } +#endif // FEATURE_EVENT_TRACE + void * typeId = nullptr; const WCHAR * name = nullptr; InlineSString strTypeName; diff --git a/src/native/minipal/xoshiro128pp.c b/src/native/minipal/xoshiro128pp.c new file mode 100644 index 00000000000000..01e8e80696b3bd --- /dev/null +++ b/src/native/minipal/xoshiro128pp.c @@ -0,0 +1,81 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the MIT license. + +#include + +// This code is a slightly modified version of the xoshiro128++ generator from http://prng.di.unimi.it/xoshiro128plusplus.c + +/* Written in 2019 by David Blackman and Sebastiano Vigna (vigna@acm.org) +To the extent possible under law, the author has dedicated all copyright +and related and neighboring rights to this software to the public domain +worldwide. + +See . +*/ + +static inline uint32_t rotl(const uint32_t x, int k) { + return (x << k) | (x >> (32 - k)); +} + +/* This is the jump function for the generator. It is equivalent + to 2^64 calls to next(); it can be used to generate 2^64 + non-overlapping subsequences for parallel computations. */ + +static void jump(struct minipal_xoshiro128pp* pState) { + static const uint32_t JUMP[] = { 0x8764000b, 0xf542d2d3, 0x6fa035c3, 0x77f2db5b }; + + uint32_t* s = pState->s; + uint32_t s0 = 0; + uint32_t s1 = 0; + uint32_t s2 = 0; + uint32_t s3 = 0; + for (int i = 0; i < sizeof JUMP / sizeof * JUMP; i++) + for (int b = 0; b < 32; b++) { + if (JUMP[i] & UINT32_C(1) << b) { + s0 ^= s[0]; + s1 ^= s[1]; + s2 ^= s[2]; + s3 ^= s[3]; + } + minipal_xoshiro128pp_next(pState); + } + + s[0] = s0; + s[1] = s1; + s[2] = s2; + s[3] = s3; +} + +void minipal_xoshiro128pp_init(struct minipal_xoshiro128pp* pState, uint32_t seed) { + uint32_t* s = pState->s; + if (seed == 0) + { + seed = 997; + } + + s[0] = seed; + s[1] = seed; + s[2] = seed; + s[3] = seed; + jump(pState); +} + +uint32_t minipal_xoshiro128pp_next(struct minipal_xoshiro128pp* pState) { + uint32_t* s = pState->s; + const uint32_t result = rotl(s[0] + s[3], 7) + s[0]; + + const uint32_t t = s[1] << 9; + + s[2] ^= s[0]; + s[3] ^= s[1]; + s[1] ^= s[2]; + s[0] ^= s[3]; + + s[2] ^= t; + + s[3] = rotl(s[3], 11); + + return result; +} + + diff --git a/src/native/minipal/xoshiro128pp.h b/src/native/minipal/xoshiro128pp.h new file mode 100644 index 00000000000000..485705115ca5b8 --- /dev/null +++ b/src/native/minipal/xoshiro128pp.h @@ -0,0 +1,26 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the MIT license. + +#ifndef HAVE_MINIPAL_XOSHIRO128PP_H +#define HAVE_MINIPAL_XOSHIRO128PP_H + +#include + +#ifdef __cplusplus +extern "C" +{ +#endif // __cplusplus + +struct minipal_xoshiro128pp +{ + uint32_t s[4]; +}; + +void minipal_xoshiro128pp_init(struct minipal_xoshiro128pp* pState, uint32_t seed); + +uint32_t minipal_xoshiro128pp_next(struct minipal_xoshiro128pp* pState); + +#ifdef __cplusplus +} +#endif // __cplusplus +#endif /* HAVE_MINIPAL_XOSHIRO128PP_H */ diff --git a/src/tests/tracing/eventpipe/randomizedallocationsampling/allocationsampling.cs b/src/tests/tracing/eventpipe/randomizedallocationsampling/allocationsampling.cs new file mode 100644 index 00000000000000..1856bfc082cff9 --- /dev/null +++ b/src/tests/tracing/eventpipe/randomizedallocationsampling/allocationsampling.cs @@ -0,0 +1,174 @@ +// Licensed to the .NET Foundation under one or more agreements. +// The .NET Foundation licenses this file to you under the MIT license. + +using System; +using System.Collections.Generic; +using System.Diagnostics; +using System.Diagnostics.Tracing; +using System.IO; +using System.Linq; +using System.Text; +using System.Threading; +using System.Threading.Tasks; +using Microsoft.Diagnostics.Tracing; +using Microsoft.Diagnostics.Tracing.Parsers.Clr; +using Microsoft.Diagnostics.NETCore.Client; +using Tracing.Tests.Common; +using Xunit; + +namespace Tracing.Tests +{ + public class AllocationSamplingValidation + { + [Fact] + public static int TestEntryPoint() + { + // check that AllocationSampled events are generated and size + type name are correct + var ret = IpcTraceTest.RunAndValidateEventCounts( + new Dictionary() { { "Microsoft-Windows-DotNETRuntime", -1 } }, + _eventGeneratingActionForAllocations, + // AllocationSamplingKeyword (0x80000000000): 0b1000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000 + new List() { new EventPipeProvider("Microsoft-Windows-DotNETRuntime", EventLevel.Informational, 0x80000000000) }, + 1024, _DoesTraceContainEnoughAllocationSampledEvents, enableRundownProvider: false); + if (ret != 100) + return ret; + + return 100; + } + + const int InstanceCount = 2000000; + const int MinExpectedEvents = 1; + static List _objects128s = new List(InstanceCount); + + // allocate objects to trigger dynamic allocation sampling events + private static Action _eventGeneratingActionForAllocations = () => + { + _objects128s.Clear(); + for (int i = 0; i < InstanceCount; i++) + { + if ((i != 0) && (i % (InstanceCount/5) == 0)) + Logger.logger.Log($"Allocated {i} instances..."); + + Object128 obj = new Object128(); + _objects128s.Add(obj); + } + + Logger.logger.Log($"{_objects128s.Count} instances allocated"); + }; + + private static Func> _DoesTraceContainEnoughAllocationSampledEvents = (source) => + { + int AllocationSampledEvents = 0; + int Object128Count = 0; + source.Dynamic.All += (eventData) => + { + if (eventData.ID == (TraceEventID)303) // AllocationSampled is not defined in TraceEvent yet + { + AllocationSampledEvents++; + + AllocationSampledData payload = new AllocationSampledData(eventData, source.PointerSize); + // uncomment to see the allocation events payload + //Logger.logger.Log($"{payload.HeapIndex} - {payload.AllocationKind} | ({payload.ObjectSize}) {payload.TypeName} = 0x{payload.Address}"); + if (payload.TypeName == "Tracing.Tests.Object128") + { + Object128Count++; + } + } + }; + return () => { + Logger.logger.Log("AllocationSampled counts validation"); + Logger.logger.Log("Nb events: " + AllocationSampledEvents); + Logger.logger.Log("Nb object128: " + Object128Count); + return (AllocationSampledEvents >= MinExpectedEvents) && (Object128Count != 0) ? 100 : -1; + }; + }; + } + + internal class Object0 + { + } + + internal class Object128 : Object0 + { + private readonly UInt64 _x1; + private readonly UInt64 _x2; + private readonly UInt64 _x3; + private readonly UInt64 _x4; + private readonly UInt64 _x5; + private readonly UInt64 _x6; + private readonly UInt64 _x7; + private readonly UInt64 _x8; + private readonly UInt64 _x9; + private readonly UInt64 _x10; + private readonly UInt64 _x11; + private readonly UInt64 _x12; + private readonly UInt64 _x13; + private readonly UInt64 _x14; + private readonly UInt64 _x15; + private readonly UInt64 _x16; + } + + // AllocationSampled is not defined in TraceEvent yet + // + // + // + // + // + // + // + // + // + // + class AllocationSampledData + { + const int EndOfStringCharLength = 2; + private TraceEvent _payload; + private int _pointerSize; + public AllocationSampledData(TraceEvent payload, int pointerSize) + { + _payload = payload; + _pointerSize = pointerSize; + TypeName = "?"; + + ComputeFields(); + } + + public GCAllocationKind AllocationKind; + public int ClrInstanceID; + public UInt64 TypeID; + public string TypeName; + public int HeapIndex; + public UInt64 Address; + public long ObjectSize; + public long SampledByteOffset; + + private void ComputeFields() + { + int offsetBeforeString = 4 + 2 + _pointerSize; + + Span data = _payload.EventData().AsSpan(); + AllocationKind = (GCAllocationKind)BitConverter.ToInt32(data.Slice(0, 4)); + ClrInstanceID = BitConverter.ToInt16(data.Slice(4, 2)); + if (_pointerSize == 4) + { + TypeID = BitConverter.ToUInt32(data.Slice(6, _pointerSize)); + } + else + { + TypeID = BitConverter.ToUInt64(data.Slice(6, _pointerSize)); + } + TypeName = Encoding.Unicode.GetString(data.Slice(offsetBeforeString, _payload.EventDataLength - offsetBeforeString - EndOfStringCharLength - 4 - _pointerSize - 8 - 8)); + HeapIndex = BitConverter.ToInt32(data.Slice(offsetBeforeString + TypeName.Length * 2 + EndOfStringCharLength, 4)); + if (_pointerSize == 4) + { + Address = BitConverter.ToUInt32(data.Slice(offsetBeforeString + TypeName.Length * 2 + EndOfStringCharLength + 4, _pointerSize)); + } + else + { + Address = BitConverter.ToUInt64(data.Slice(offsetBeforeString + TypeName.Length * 2 + EndOfStringCharLength + 4, _pointerSize)); + } + ObjectSize = BitConverter.ToInt64(data.Slice(offsetBeforeString + TypeName.Length * 2 + EndOfStringCharLength + 4 + _pointerSize, 8)); + SampledByteOffset = BitConverter.ToInt64(data.Slice(offsetBeforeString + TypeName.Length * 2 + EndOfStringCharLength + 4 + _pointerSize + 8, 8)); + } + } +} diff --git a/src/tests/tracing/eventpipe/randomizedallocationsampling/allocationsampling.csproj b/src/tests/tracing/eventpipe/randomizedallocationsampling/allocationsampling.csproj new file mode 100644 index 00000000000000..d89955bb638d4c --- /dev/null +++ b/src/tests/tracing/eventpipe/randomizedallocationsampling/allocationsampling.csproj @@ -0,0 +1,25 @@ + + + + true + .NETCoreApp + true + true + + true + + + true + + + + + guard + + + + + + + + diff --git a/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/Allocate.csproj b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/Allocate.csproj new file mode 100644 index 00000000000000..21114b11099d68 --- /dev/null +++ b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/Allocate.csproj @@ -0,0 +1,8 @@ + + + + Exe + net8.0 + + + diff --git a/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocateArraysOfDoubles.cs b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocateArraysOfDoubles.cs new file mode 100644 index 00000000000000..ee18309c547acf --- /dev/null +++ b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocateArraysOfDoubles.cs @@ -0,0 +1,21 @@ +using System; +using System.Collections.Generic; +using System.Linq; + +namespace Allocate +{ + public class AllocateArraysOfDoubles : IAllocations + { + public void Allocate(int count) + { + List arrays = new List(count); + + for (int i = 0; i < count; i++) + { + arrays.Add(new double[1] { i }); + } + + Console.WriteLine($"Sum {arrays.Count} arrays of one double = {arrays.Sum(doubles => doubles[0])}"); + } + } +} diff --git a/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocateDifferentTypes.cs b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocateDifferentTypes.cs new file mode 100644 index 00000000000000..8dfaecb0cf3509 --- /dev/null +++ b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocateDifferentTypes.cs @@ -0,0 +1,49 @@ +using System; +using System.Collections.Generic; + +namespace Allocate +{ + public class AllocateDifferentTypes : IAllocations + { + public void Allocate(int count) + { + List objects = new List(count); + + for (int i = 0; i < count; i++) + { + objects.Add(new string('c', 37)); + objects.Add(new WithFinalizer(i)); + objects.Add(new byte[173]); + int[,] matrix = { { 1, 2 }, { 3, 4 }, { 5, 6 }, { 7, 8 } }; + objects.Add(matrix); + } + + Console.WriteLine($"{objects.Count} objects"); + } + } + + public class WithFinalizer + { + private static int _counter; + + private readonly UInt16 _x1; + private readonly UInt16 _x2; + private readonly UInt16 _x3; + + public static int Counter => _counter; + + public WithFinalizer(int id) + { + _counter++; + + _x1 = (UInt16)(id % 10); + _x2 = (UInt16)(id % 100); + _x3 = (UInt16)(id % 1000); + } + + ~WithFinalizer() + { + _counter--; + } + } +} diff --git a/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocateRatioSizedArrays.cs b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocateRatioSizedArrays.cs new file mode 100644 index 00000000000000..5af08b3991593f --- /dev/null +++ b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocateRatioSizedArrays.cs @@ -0,0 +1,44 @@ +using System; +using System.Collections.Generic; + +namespace Allocate +{ + public class AllocateRatioSizedArrays : IAllocations + { + public void Allocate(int count) + { + // We can't keep the objects in memory, just keep their size + List sizes= new List(count * 5); + + var gcCount = GC.CollectionCount(0); + + for (int i = 0; i < count; i++) + { + var bytes1 = new byte[1024]; + bytes1[1] = 1; + sizes.Add(bytes1.Length); + var bytes2 = new byte[10240]; + bytes2[2] = 2; + sizes.Add(bytes2.Length); + var bytes3 = new byte[102400]; + bytes3[3] = 3; + sizes.Add(bytes3.Length); + var bytes4 = new byte[1024000]; + bytes4[4] = 4; + sizes.Add(bytes4.Length); + var bytes5 = new byte[10240000]; + bytes5[5] = 5; + sizes.Add(bytes5.Length); + } + + Console.WriteLine($"+ {GC.CollectionCount(0) - gcCount} collections"); + + long totalAllocated = 0; + foreach (int size in sizes) + { + totalAllocated += size; + } + Console.WriteLine($"{sizes.Count} arrays for {totalAllocated / 1024} KB"); + } + } +} diff --git a/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocateSmallAndBig.cs b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocateSmallAndBig.cs new file mode 100644 index 00000000000000..5f8660be6a74d3 --- /dev/null +++ b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocateSmallAndBig.cs @@ -0,0 +1,180 @@ +#pragma warning disable CS0169 // Remove unused private members +#pragma warning disable IDE0049 // Simplify Names + +using System; +using System.Collections.Generic; + +namespace Allocate +{ + public class AllocateSmallAndBig : IAllocations + { + public void Allocate(int count) + { + Dictionary allocations = Initialize(); + List objects = new List(1024 * 1024); + + AllocateSmallThenBig(count/2, objects, allocations); + Console.WriteLine(); + AllocateBigThenSmall(count/2, objects, allocations); + Console.WriteLine(); + } + + private void AllocateSmallThenBig(int count, List objects, Dictionary allocations) + { + for (int i = 0; i < count; i++) + { + // allocate from smaller to larger + objects.Add(new Object24()); + objects.Add(new Object32()); + objects.Add(new Object48()); + objects.Add(new Object80()); + objects.Add(new Object144()); + } + + allocations[nameof(Object24)].Count = count; + allocations[nameof(Object24)].Size = count * 24; + allocations[nameof(Object32)].Count = count; + allocations[nameof(Object32)].Size = count * 32; + allocations[nameof(Object48)].Count = count; + allocations[nameof(Object48)].Size = count * 48; + allocations[nameof(Object80)].Count = count; + allocations[nameof(Object80)].Size = count * 80; + allocations[nameof(Object144)].Count = count; + allocations[nameof(Object144)].Size = count * 144; + + DumpAllocations(allocations); + Clear(allocations); + objects.Clear(); + } + + private void AllocateBigThenSmall(int count, List objects, Dictionary allocations) + { + for (int i = 0; i < count; i++) + { + // allocate from larger to smaller + objects.Add(new Object144()); + objects.Add(new Object80()); + objects.Add(new Object48()); + objects.Add(new Object32()); + objects.Add(new Object24()); + } + + allocations[nameof(Object24)].Count = count; + allocations[nameof(Object24)].Size = count * 24; + allocations[nameof(Object32)].Count = count; + allocations[nameof(Object32)].Size = count * 32; + allocations[nameof(Object48)].Count = count; + allocations[nameof(Object48)].Size = count * 48; + allocations[nameof(Object80)].Count = count; + allocations[nameof(Object80)].Size = count * 80; + allocations[nameof(Object144)].Count = count; + allocations[nameof(Object144)].Size = count * 144; + + DumpAllocations(allocations); + Clear(allocations); + objects.Clear(); + } + + private Dictionary Initialize() + { + var allocations = new Dictionary(16); + allocations[nameof(Object24)] = new AllocStats(); + allocations[nameof(Object32)] = new AllocStats(); + allocations[nameof(Object48)] = new AllocStats(); + allocations[nameof(Object80)] = new AllocStats(); + allocations[nameof(Object144)] = new AllocStats(); + + Clear(allocations); + return allocations; + } + + private void Clear(Dictionary allocations) + { + allocations[nameof(Object24)].Count = 0; + allocations[nameof(Object24)].Size = 0; + allocations[nameof(Object32)].Count = 0; + allocations[nameof(Object32)].Size = 0; + allocations[nameof(Object48)].Count = 0; + allocations[nameof(Object48)].Size = 0; + allocations[nameof(Object80)].Count = 0; + allocations[nameof(Object80)].Size = 0; + allocations[nameof(Object144)].Count = 0; + allocations[nameof(Object144)].Size = 0; + } + + private void DumpAllocations(Dictionary objects) + { + Console.WriteLine("Allocations start"); + foreach (var allocation in objects) + { + Console.WriteLine($"{allocation.Key}={allocation.Value.Count},{allocation.Value.Size}"); + } + + Console.WriteLine("Allocations end"); + } + + internal class AllocStats + { + public int Count { get; set; } + public long Size { get; set; } + } + + internal class Object0 + { + } + + internal class Object24 : Object0 + { + private readonly UInt32 _x1; + private readonly UInt32 _x2; + } + + internal class Object32 : Object0 + { + private readonly UInt64 _x1; + private readonly UInt64 _x2; + } + + internal class Object48 : Object0 + { + private readonly UInt64 _x1; + private readonly UInt64 _x2; + private readonly UInt64 _x3; + private readonly UInt64 _x4; + } + + internal class Object80 : Object0 + { + private readonly UInt64 _x1; + private readonly UInt64 _x2; + private readonly UInt64 _x3; + private readonly UInt64 _x4; + private readonly UInt64 _x5; + private readonly UInt64 _x6; + private readonly UInt64 _x7; + private readonly UInt64 _x8; + } + + internal class Object144 : Object0 + { + private readonly UInt64 _x1; + private readonly UInt64 _x2; + private readonly UInt64 _x3; + private readonly UInt64 _x4; + private readonly UInt64 _x5; + private readonly UInt64 _x6; + private readonly UInt64 _x7; + private readonly UInt64 _x8; + private readonly UInt64 _x9; + private readonly UInt64 _x10; + private readonly UInt64 _x11; + private readonly UInt64 _x12; + private readonly UInt64 _x13; + private readonly UInt64 _x14; + private readonly UInt64 _x15; + private readonly UInt64 _x16; + } + } +} +#pragma warning restore IDE0049 // Simplify Names +#pragma warning restore CS0169 // Remove unused private members \ No newline at end of file diff --git a/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocationsRunEventSource.cs b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocationsRunEventSource.cs new file mode 100644 index 00000000000000..ee21414d2c2ddb --- /dev/null +++ b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/AllocationsRunEventSource.cs @@ -0,0 +1,34 @@ +using System.Diagnostics.Tracing; + +namespace Allocate +{ + [EventSource(Name = "Allocations-Run")] + public class AllocationsRunEventSource : EventSource + { + public static readonly AllocationsRunEventSource Log = new AllocationsRunEventSource(); + + [Event(600, Level = EventLevel.Informational)] + public void StartRun(int iterationsCount, int allocationCount, string listOfTypes) + { + WriteEvent(eventId: 600, iterationsCount, allocationCount, listOfTypes); + } + + [Event(601, Level = EventLevel.Informational)] + public void StopRun() + { + WriteEvent(eventId: 601); + } + + [Event(602, Level = EventLevel.Informational)] + public void StartIteration(int iteration) + { + WriteEvent(eventId: 602, iteration); + } + + [Event(603, Level = EventLevel.Informational)] + public void StopIteration(int iteration) + { + WriteEvent(eventId: 603, iteration); + } + } +} diff --git a/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/IAllocations.cs b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/IAllocations.cs new file mode 100644 index 00000000000000..3ee00f39adcdfe --- /dev/null +++ b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/IAllocations.cs @@ -0,0 +1,8 @@ + +namespace Allocate +{ + public interface IAllocations + { + public void Allocate(int count); + } +} diff --git a/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/Program.cs b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/Program.cs new file mode 100644 index 00000000000000..f7220a11289752 --- /dev/null +++ b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/Program.cs @@ -0,0 +1,133 @@ +using System; +using System.Diagnostics; + +namespace Allocate +{ + public enum Scenario + { + SmallAndBig = 1, + PerThread = 2, + ArrayOfDouble = 3, + FinalizerAndArraysAndStrings = 4, + RatioSizedArrays = 5, + } + + + internal class Program + { + static void Main(string[] args) + { + if (args.Length < 1) + { + Console.WriteLine("Usage: Allocate --scenario (1|2|3|4|5) [--iterations (number of iterations)] [--allocations (allocations count)]"); + Console.WriteLine(" 1: small and big allocations"); + Console.WriteLine(" 2: allocations per thread"); + Console.WriteLine(" 3: arrays of double (for x86)"); + Console.WriteLine(" 4: different types of objects"); + Console.WriteLine(" 5: ratio sized arrays"); + return; + } + ParseCommandLine(args, out Scenario scenario, out int allocationsCount, out int iterations); + + IAllocations allocationsRun = null; + string allocatedTypes = string.Empty; + + switch(scenario) + { + case Scenario.SmallAndBig: + allocationsRun = new AllocateSmallAndBig(); + allocatedTypes = "Object24;Object32;Object48;Object80;Object144"; + break; + case Scenario.PerThread: + allocationsRun = new ThreadedAllocations(); + allocatedTypes = "Object24;Object48;Object72;Object32;Object64;Object96"; + break; + case Scenario.ArrayOfDouble: + allocationsRun = new AllocateArraysOfDoubles(); + allocatedTypes = "System.Double[]"; + break; + case Scenario.FinalizerAndArraysAndStrings: + allocationsRun = new AllocateDifferentTypes(); + allocatedTypes = "System.String;Allocate.WithFinalizer;System.Byte[]"; + break; + case Scenario.RatioSizedArrays: + allocationsRun = new AllocateRatioSizedArrays(); + allocatedTypes = "System.Byte[]"; + break; + default: + Console.WriteLine($"Invalid scenario: '{scenario}'"); + return; + } + + Console.WriteLine($"pid = {Process.GetCurrentProcess().Id}"); + Console.ReadLine(); + + if (allocationsRun != null) + { + Stopwatch clock = new Stopwatch(); + clock.Start(); + + AllocationsRunEventSource.Log.StartRun(iterations, allocationsCount, allocatedTypes); + for (int i = 0; i < iterations; i++) + { + AllocationsRunEventSource.Log.StartIteration(i); + allocationsRun.Allocate(allocationsCount); + AllocationsRunEventSource.Log.StopIteration(i); + } + AllocationsRunEventSource.Log.StopRun(); + + clock.Stop(); + Console.WriteLine($"Duration = {clock.ElapsedMilliseconds} ms"); + } + } + + private static void ParseCommandLine(string[] args, out Scenario scenario, out int allocationsCount, out int iterations) + { + iterations = 100; + allocationsCount = 1_000_000; + scenario = Scenario.SmallAndBig; + + for (int i = 0; i < args.Length; i++) + { + string arg = args[i]; + + if ("--scenario".Equals(arg, StringComparison.OrdinalIgnoreCase)) + { + int valueOffset = i + 1; + if (valueOffset < args.Length && int.TryParse(args[valueOffset], out var number)) + { + scenario = (Scenario)number; + } + } + else + if ("--iterations".Equals(arg, StringComparison.OrdinalIgnoreCase)) + { + int valueOffset = i + 1; + if (valueOffset < args.Length && int.TryParse(args[valueOffset], out var number)) + { + if (number <= 0) + { + throw new ArgumentOutOfRangeException($"Invalid iterations count '{number}': must be > 0"); + } + + iterations = number; + } + } + else + if ("--allocations".Equals(arg, StringComparison.OrdinalIgnoreCase)) + { + int valueOffset = i + 1; + if (valueOffset < args.Length && int.TryParse(args[valueOffset], out var number)) + { + if (number <= 0) + { + throw new ArgumentOutOfRangeException($"Invalid numbers of allocations '{number}: must be > 0"); + } + + allocationsCount = number; + } + } + } + } + } +} diff --git a/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/ThreadedAllocations.cs b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/ThreadedAllocations.cs new file mode 100644 index 00000000000000..8172a19a9fa822 --- /dev/null +++ b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/Allocate/ThreadedAllocations.cs @@ -0,0 +1,176 @@ +#pragma warning disable CS0169 // Remove unused private members +#pragma warning disable IDE0049 // Simplify Names + +using System; +using System.Collections.Generic; +using System.Threading; + +namespace Allocate +{ + public class ThreadedAllocations : IAllocations + { + public void Allocate(int count) + { + List objects1 = new List(1024 * 1024); + List objects2 = new List(1024 * 1024); + + Thread[] threads = new Thread[2]; + threads[0] = new Thread(() => Allocate1(count, objects1)); + threads[1] = new Thread(() => Allocate2(count, objects2)); + + for (int i = 0; i < threads.Length; i++) { threads[i].Start(); } + for (int i = 0; i < threads.Length; i++) { threads[i].Join(); } + + Console.WriteLine($"Allocated {objects1.Count + objects2.Count} objects"); + } + + private void Allocate1(int count, List objects) + { + for (int i = 0; i < count; i++) + { + objects.Add(new Object24()); + objects.Add(new Object48()); + objects.Add(new Object72()); + } + } + + private void Allocate2(int count, List objects) + { + for (int i = 0; i < count; i++) + { + objects.Add(new Object32()); + objects.Add(new Object64()); + objects.Add(new Object96()); + } + } + + internal class Object0 + { + } + + internal class Object24 : Object0 + { + private readonly UInt16 _x1; + private readonly UInt16 _x2; + private readonly UInt16 _x3; + } + + internal class Object32 : Object0 + { + private readonly UInt16 _x1; + private readonly UInt16 _x2; + private readonly UInt16 _x3; + private readonly UInt16 _x4; + private readonly UInt16 _x5; + private readonly UInt16 _x6; + private readonly UInt16 _x7; + } + + internal class Object48 : Object0 + { + private readonly UInt16 _x1; + private readonly UInt16 _x2; + private readonly UInt16 _x3; + private readonly UInt16 _x4; + private readonly UInt16 _x5; + private readonly UInt16 _x6; + private readonly UInt16 _x7; + private readonly UInt16 _x8; + private readonly UInt16 _x9; + private readonly UInt16 _x10; + private readonly UInt16 _x11; + private readonly UInt16 _x12; + private readonly UInt16 _x13; + private readonly UInt16 _x14; + private readonly UInt16 _x15; + } + + internal class Object64 : Object0 + { + private readonly UInt16 _x1; + private readonly UInt16 _x2; + private readonly UInt16 _x3; + private readonly UInt16 _x4; + private readonly UInt16 _x5; + private readonly UInt16 _x6; + private readonly UInt16 _x7; + private readonly UInt16 _x8; + private readonly UInt16 _x9; + private readonly UInt16 _x10; + private readonly UInt16 _x11; + private readonly UInt16 _x12; + private readonly UInt16 _x13; + private readonly UInt16 _x14; + private readonly UInt16 _x15; + private readonly UInt16 _x16; + private readonly UInt16 _x17; + private readonly UInt16 _x18; + private readonly UInt16 _x19; + private readonly UInt16 _x20; + private readonly UInt16 _x21; + private readonly UInt16 _x22; + private readonly UInt16 _x23; + private readonly UInt16 _x24; + } + + internal class Object72 : Object0 + { + private readonly UInt16 _x1; + private readonly UInt16 _x2; + private readonly UInt16 _x3; + private readonly UInt16 _x4; + private readonly UInt16 _x5; + private readonly UInt16 _x6; + private readonly UInt16 _x7; + private readonly UInt16 _x8; + private readonly UInt16 _x9; + private readonly UInt16 _x10; + private readonly UInt16 _x11; + private readonly UInt16 _x12; + private readonly UInt16 _x13; + private readonly UInt16 _x14; + private readonly UInt16 _x15; + private readonly UInt16 _x16; + private readonly UInt16 _x17; + private readonly UInt16 _x18; + private readonly UInt16 _x19; + private readonly UInt16 _x20; + private readonly UInt16 _x21; + private readonly UInt16 _x22; + private readonly UInt16 _x23; + private readonly UInt16 _x24; + private readonly UInt16 _x25; + private readonly UInt16 _x26; + private readonly UInt16 _x27; + private readonly UInt16 _x28; + } + + internal class Object96 : Object0 + { + private readonly UInt32 _x1; + private readonly UInt32 _x2; + private readonly UInt32 _x3; + private readonly UInt32 _x4; + private readonly UInt32 _x5; + private readonly UInt32 _x6; + private readonly UInt32 _x7; + private readonly UInt32 _x8; + private readonly UInt32 _x9; + private readonly UInt32 _x10; + private readonly UInt32 _x11; + private readonly UInt32 _x12; + private readonly UInt32 _x13; + private readonly UInt32 _x14; + private readonly UInt32 _x15; + private readonly UInt32 _x16; + private readonly UInt32 _x17; + private readonly UInt32 _x18; + private readonly UInt32 _x19; + private readonly UInt32 _x20; + } + } +} + + +#pragma warning restore IDE0049 // Simplify Names +#pragma warning restore CS0169 // Remove unused private members \ No newline at end of file diff --git a/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/AllocationProfiler.sln b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/AllocationProfiler.sln new file mode 100644 index 00000000000000..5da918e8dd3ea6 --- /dev/null +++ b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/AllocationProfiler.sln @@ -0,0 +1,51 @@ + +Microsoft Visual Studio Solution File, Format Version 12.00 +# Visual Studio Version 17 +VisualStudioVersion = 17.9.34616.47 +MinimumVisualStudioVersion = 10.0.40219.1 +Project("{9A19103F-16F7-4668-BE54-9A1E7A4F7556}") = "Allocate", "Allocate\Allocate.csproj", "{883FD439-6B92-421F-A68B-D22FFC21BF0A}" +EndProject +Project("{9A19103F-16F7-4668-BE54-9A1E7A4F7556}") = "AllocationProfiler", "AllocationProfiler\AllocationProfiler.csproj", "{D1BD9F0F-CEF9-4EED-9730-AFF7902FC0BA}" +EndProject +Global + GlobalSection(SolutionConfigurationPlatforms) = preSolution + Debug|Any CPU = Debug|Any CPU + Debug|x64 = Debug|x64 + Debug|x86 = Debug|x86 + Release|Any CPU = Release|Any CPU + Release|x64 = Release|x64 + Release|x86 = Release|x86 + EndGlobalSection + GlobalSection(ProjectConfigurationPlatforms) = postSolution + {883FD439-6B92-421F-A68B-D22FFC21BF0A}.Debug|Any CPU.ActiveCfg = Debug|Any CPU + {883FD439-6B92-421F-A68B-D22FFC21BF0A}.Debug|Any CPU.Build.0 = Debug|Any CPU + {883FD439-6B92-421F-A68B-D22FFC21BF0A}.Debug|x64.ActiveCfg = Debug|Any CPU + {883FD439-6B92-421F-A68B-D22FFC21BF0A}.Debug|x64.Build.0 = Debug|Any CPU + {883FD439-6B92-421F-A68B-D22FFC21BF0A}.Debug|x86.ActiveCfg = Debug|Any CPU + {883FD439-6B92-421F-A68B-D22FFC21BF0A}.Debug|x86.Build.0 = Debug|Any CPU + {883FD439-6B92-421F-A68B-D22FFC21BF0A}.Release|Any CPU.ActiveCfg = Release|Any CPU + {883FD439-6B92-421F-A68B-D22FFC21BF0A}.Release|Any CPU.Build.0 = Release|Any CPU + {883FD439-6B92-421F-A68B-D22FFC21BF0A}.Release|x64.ActiveCfg = Release|Any CPU + {883FD439-6B92-421F-A68B-D22FFC21BF0A}.Release|x64.Build.0 = Release|Any CPU + {883FD439-6B92-421F-A68B-D22FFC21BF0A}.Release|x86.ActiveCfg = Release|Any CPU + {883FD439-6B92-421F-A68B-D22FFC21BF0A}.Release|x86.Build.0 = Release|Any CPU + {D1BD9F0F-CEF9-4EED-9730-AFF7902FC0BA}.Debug|Any CPU.ActiveCfg = Debug|Any CPU + {D1BD9F0F-CEF9-4EED-9730-AFF7902FC0BA}.Debug|Any CPU.Build.0 = Debug|Any CPU + {D1BD9F0F-CEF9-4EED-9730-AFF7902FC0BA}.Debug|x64.ActiveCfg = Debug|Any CPU + {D1BD9F0F-CEF9-4EED-9730-AFF7902FC0BA}.Debug|x64.Build.0 = Debug|Any CPU + {D1BD9F0F-CEF9-4EED-9730-AFF7902FC0BA}.Debug|x86.ActiveCfg = Debug|Any CPU + {D1BD9F0F-CEF9-4EED-9730-AFF7902FC0BA}.Debug|x86.Build.0 = Debug|Any CPU + {D1BD9F0F-CEF9-4EED-9730-AFF7902FC0BA}.Release|Any CPU.ActiveCfg = Release|Any CPU + {D1BD9F0F-CEF9-4EED-9730-AFF7902FC0BA}.Release|Any CPU.Build.0 = Release|Any CPU + {D1BD9F0F-CEF9-4EED-9730-AFF7902FC0BA}.Release|x64.ActiveCfg = Release|Any CPU + {D1BD9F0F-CEF9-4EED-9730-AFF7902FC0BA}.Release|x64.Build.0 = Release|Any CPU + {D1BD9F0F-CEF9-4EED-9730-AFF7902FC0BA}.Release|x86.ActiveCfg = Release|Any CPU + {D1BD9F0F-CEF9-4EED-9730-AFF7902FC0BA}.Release|x86.Build.0 = Release|Any CPU + EndGlobalSection + GlobalSection(SolutionProperties) = preSolution + HideSolutionNode = FALSE + EndGlobalSection + GlobalSection(ExtensibilityGlobals) = postSolution + SolutionGuid = {64F6D2D8-C43C-41D5-8CEA-8F45ADF2EC6C} + EndGlobalSection +EndGlobal diff --git a/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/AllocationProfiler/AllocationProfiler.csproj b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/AllocationProfiler/AllocationProfiler.csproj new file mode 100644 index 00000000000000..b8bb76d4257cc4 --- /dev/null +++ b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/AllocationProfiler/AllocationProfiler.csproj @@ -0,0 +1,13 @@ + + + + Exe + net8.0 + + + + + + + + diff --git a/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/AllocationProfiler/Program.cs b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/AllocationProfiler/Program.cs new file mode 100644 index 00000000000000..7d0978c0159c2c --- /dev/null +++ b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/AllocationProfiler/Program.cs @@ -0,0 +1,474 @@ +using Microsoft.Diagnostics.NETCore.Client; +using Microsoft.Diagnostics.Tracing.Parsers; +using Microsoft.Diagnostics.Tracing; +using System.Diagnostics.Tracing; +using Microsoft.Diagnostics.Tracing.Parsers.Clr; +using System.Text; +using System.Runtime.CompilerServices; +using System.Collections.Generic; +using System; +using System.Threading.Tasks; +using System.Threading; +using System.Linq; + +namespace DynamicAllocationSampling +{ + internal class TypeInfo + { + public string TypeName = "?"; + public int Count; + public long Size; + public long TotalSize; + public long RemainderSize; + + public override int GetHashCode() + { + return (TypeName+Size).GetHashCode(); + } + + public override bool Equals(object obj) + { + if (obj == null) + { + return false; + } + + if (!(obj is TypeInfo)) + { + return false; + } + + return (TypeName+Size).Equals(((TypeInfo)obj).TypeName+Size); + } + } + + internal class Program + { + private static Dictionary _sampledTypes = new Dictionary(); + private static Dictionary _tickTypes = new Dictionary(); + private static List> _sampledTypesInRun = null; + private static List> _tickTypesInRun = null; + private static int _allocationsCount = 0; + private static List _allocatedTypes = new List(); + private static EventPipeEventSource _source; + + static void Main(string[] args) + { + if (args.Length == 0) + { + Console.WriteLine("No process ID specified"); + return; + } + + int pid = -1; + if (!int.TryParse(args[0], out pid)) + { + Console.WriteLine($"Invalid specified process ID '{args[0]}'"); + return; + } + + try + { + PrintEventsLive(pid); + } + catch (Exception x) + { + Console.WriteLine(x.Message); + } + } + + + public static void PrintEventsLive(int processId) + { + var providers = new List() + { + new EventPipeProvider( + "Microsoft-Windows-DotNETRuntime", + EventLevel.Verbose, // verbose is required for AllocationTick + (long)0x80000000001 // new AllocationSamplingKeyword + GCKeyword + ), + new EventPipeProvider( + "Allocations-Run", + EventLevel.Informational + ), + }; + var client = new DiagnosticsClient(processId); + + using (var session = client.StartEventPipeSession(providers, false)) + { + Console.WriteLine(); + + Task streamTask = Task.Run(() => + { + var source = new EventPipeEventSource(session.EventStream); + _source = source; + + ClrTraceEventParser clrParser = new ClrTraceEventParser(source); + clrParser.GCAllocationTick += OnAllocationTick; + source.Dynamic.All += OnEvents; + + try + { + source.Process(); + } + catch (Exception e) + { + Console.WriteLine($"Error encountered while processing events: {e.Message}"); + } + }); + + Task inputTask = Task.Run(() => + { + while (Console.ReadKey().Key != ConsoleKey.Enter) + { + Thread.Sleep(100); + } + session.Stop(); + }); + + Task.WaitAny(streamTask, inputTask); + } + + // not all cases are emitting allocations run events + if ((_sampledTypesInRun == null) && (_sampledTypes.Count > 0)) + { + ShowIterationResults(); + } + } + + private const long SAMPLING_MEAN = 100 * 1024; + private const double SAMPLING_RATIO = 0.999990234375 / 0.000009765625; + private static long UpscaleSize(long totalSize, int count, long mean, long sizeRemainder) + { + //// This is the Poisson process based scaling + //var averageSize = (double)totalSize / (double)count; + //var scale = 1 / (1 - Math.Exp(-averageSize / mean)); + //return (long)(totalSize * scale); + + // use the upscaling method detailed in the PR + // = sq/p + u + // s = # of samples for a type + // q = 1 - 1/102400 + // p = 1/102400 + // u = sum of object remainders = Sum(object_size - sampledByteOffset) for all samples + return (long)(SAMPLING_RATIO * count + sizeRemainder); + } + + private static void OnAllocationTick(GCAllocationTickTraceData payload) + { + // skip unexpected types + if (!_allocatedTypes.Contains(payload.TypeName)) return; + + if (!_tickTypes.TryGetValue(payload.TypeName + payload.ObjectSize, out TypeInfo typeInfo)) + { + typeInfo = new TypeInfo() { TypeName = payload.TypeName, Count = 0, Size = payload.ObjectSize, TotalSize = 0 }; + _tickTypes.Add(payload.TypeName + payload.ObjectSize, typeInfo); + } + typeInfo.Count++; + typeInfo.TotalSize += (int)payload.ObjectSize; + } + + private static void OnEvents(TraceEvent eventData) + { + if (eventData.ID == (TraceEventID)303) + { + AllocationSampledData payload = new AllocationSampledData(eventData, _source.PointerSize); + + // skip unexpected types + if (!_allocatedTypes.Contains(payload.TypeName)) return; + + if (!_sampledTypes.TryGetValue(payload.TypeName+payload.ObjectSize, out TypeInfo typeInfo)) + { + typeInfo = new TypeInfo() { TypeName = payload.TypeName, Count = 0, Size = (int)payload.ObjectSize, TotalSize = 0, RemainderSize = payload.ObjectSize - payload.SampledByteOffset }; + _sampledTypes.Add(payload.TypeName + payload.ObjectSize, typeInfo); + } + typeInfo.Count++; + typeInfo.TotalSize += (int)payload.ObjectSize; + typeInfo.RemainderSize += (payload.ObjectSize - payload.SampledByteOffset); + + return; + } + + if (eventData.ID == (TraceEventID)600) + { + AllocationsRunData payload = new AllocationsRunData(eventData); + Console.WriteLine($"> starts {payload.Iterations} iterations allocating {payload.Count} instances"); + + _sampledTypesInRun = new List>(payload.Iterations); + _tickTypesInRun = new List>(payload.Iterations); + _allocationsCount = payload.Count; + string allocatedTypes = payload.AllocatedTypes; + if (allocatedTypes.Length > 0) + { + _allocatedTypes = allocatedTypes.Split(';').ToList(); + } + + return; + } + + if (eventData.ID == (TraceEventID)601) + { + Console.WriteLine("\n< run stops\n"); + + ShowRunResults(); + return; + } + + if (eventData.ID == (TraceEventID)602) + { + AllocationsRunIterationData payload = new AllocationsRunIterationData(eventData); + Console.Write($"{payload.Iteration}"); + + _sampledTypes.Clear(); + _tickTypes.Clear(); + return; + } + + if (eventData.ID == (TraceEventID)603) + { + Console.WriteLine("|"); + ShowIterationResults(); + + _sampledTypesInRun.Add(_sampledTypes); + _sampledTypes = new Dictionary(); + _tickTypesInRun.Add(_tickTypes); + _tickTypes = new Dictionary(); + return; + } + } + + private static void ShowRunResults() + { + var iterations = _sampledTypesInRun.Count; + + // for each type, get the percent diff between upscaled count and expected _allocationsCount + Dictionary> typeDistribution = new Dictionary>(); + foreach (var iteration in _sampledTypesInRun) + { + foreach (var info in iteration.Values) + { + // ignore types outside of the allocations run + if (info.Count < 16) continue; + + if (!typeDistribution.TryGetValue(info, out List distribution)) + { + distribution = new List(iterations); + typeDistribution.Add(info, distribution); + } + + var upscaledCount = (long)info.Count * UpscaleSize(info.TotalSize, info.Count, SAMPLING_MEAN, info.RemainderSize) / info.TotalSize; + var percentDiff = (double)(upscaledCount - _allocationsCount) / (double)_allocationsCount; + distribution.Add(percentDiff); + } + } + + foreach (var type in typeDistribution.Keys.OrderBy(t => t.Size)) + { + var distribution = typeDistribution[type]; + + string typeName = type.TypeName; + if (typeName.Contains("[]")) + { + typeName += $" ({type.Size} bytes)"; + } + Console.WriteLine(typeName); + Console.WriteLine("-------------------------"); + int current = 1; + foreach (var diff in distribution.OrderBy(v => v)) + { + if (iterations > 20) + { + if ((current <= 5) || ((current >= 49) && (current < 52)) || (current >= 96)) + { + Console.WriteLine($"{current,4} {diff,8:0.0 %}"); + } + else + if ((current == 6) || (current == 95)) + { + Console.WriteLine(" ..."); + } + } + else + { + Console.WriteLine($"{current,4} {diff,8:0.0 %}"); + } + + current++; + } + Console.WriteLine(); + } + } + + private static void ShowIterationResults() + { + // NOTE: need to take the size into account for array types + // print the sampled types for both AllocationTick and AllocationSampled + Console.WriteLine("Tag SCount TCount SSize TSize UnitSize UpscaledSize UpscaledCount Name"); + Console.WriteLine("--------------------------------------------------------------------------------------------------"); + foreach (var type in _sampledTypes.Values.OrderBy(v => v.Size)) + { + string tag = "S"; + if (_tickTypes.TryGetValue(type.TypeName + type.Size, out TypeInfo tickType)) + { + tag += "T"; + } + + Console.Write($"{tag,3} {type.Count,6}"); + if (tag == "S") + { + Console.Write($" {0,6}"); + } + else + { + Console.Write($" {tickType.Count,6}"); + } + + Console.Write($" {type.TotalSize,13}"); + if (tag == "S") + { + Console.Write($" {0,13}"); + } + else + { + Console.Write($" {tickType.TotalSize,13}"); + } + + string typeName = type.TypeName; + if (typeName.Contains("[]")) + { + typeName += $" ({type.Size} bytes)"; + } + + if (type.Count != 0) + { + Console.WriteLine($" {type.TotalSize / type.Count,9} {UpscaleSize(type.TotalSize, type.Count, SAMPLING_MEAN, type.RemainderSize),13} {(long)type.Count * UpscaleSize(type.TotalSize, type.Count, SAMPLING_MEAN, type.RemainderSize) / type.TotalSize,10} {typeName}"); + } + } + + foreach (var type in _tickTypes.Values) + { + string tag = "T"; + + if (!_sampledTypes.ContainsKey(type.TypeName + type.Size)) + { + string typeName = type.TypeName; + if (typeName.Contains("[]")) + { + typeName += $" ({type.Size} bytes)"; + } + + Console.WriteLine($"{tag,3} {"0",6} {type.Count,6} {"0",13} {type.TotalSize,13} {type.TotalSize / type.Count,9} {"0",13} {"0",10} {typeName}"); + } + } + } + } + + + // + // + // + // + // + // + // + // + class AllocationSampledData + { + const int EndOfStringCharLength = 2; + private TraceEvent _payload; + private int _pointerSize; + public AllocationSampledData(TraceEvent payload, int pointerSize) + { + _payload = payload; + _pointerSize = pointerSize; + TypeName = "?"; + + ComputeFields(); + } + + public GCAllocationKind AllocationKind; + public int ClrInstanceID; + public UInt64 TypeID; + public string TypeName; + public int HeapIndex; + public UInt64 Address; + public long ObjectSize; + public long SampledByteOffset; + + private void ComputeFields() + { + int offsetBeforeString = 4 + 2 + _pointerSize; + + Span data = _payload.EventData().AsSpan(); + AllocationKind = (GCAllocationKind)BitConverter.ToInt32(data.Slice(0, 4)); + ClrInstanceID = BitConverter.ToInt16(data.Slice(4, 2)); + if (_pointerSize == 4) + { + TypeID = BitConverter.ToUInt32(data.Slice(6, _pointerSize)); + } + else + { + TypeID = BitConverter.ToUInt64(data.Slice(6, _pointerSize)); + } + // \0 should not be included for GetString to work + TypeName = Encoding.Unicode.GetString(data.Slice(offsetBeforeString, _payload.EventDataLength - offsetBeforeString - EndOfStringCharLength - 4 - _pointerSize - 8 - 8)); + HeapIndex = BitConverter.ToInt32(data.Slice(offsetBeforeString + TypeName.Length * 2 + EndOfStringCharLength, 4)); + if (_pointerSize == 4) + { + Address = BitConverter.ToUInt32(data.Slice(offsetBeforeString + TypeName.Length * 2 + EndOfStringCharLength + 4, _pointerSize)); + } + else + { + Address = BitConverter.ToUInt64(data.Slice(offsetBeforeString + TypeName.Length * 2 + EndOfStringCharLength + 4, _pointerSize)); + } + ObjectSize = BitConverter.ToInt64(data.Slice(offsetBeforeString + TypeName.Length * 2 + EndOfStringCharLength + 4 + _pointerSize, 8)); + SampledByteOffset = BitConverter.ToInt64(data.Slice(offsetBeforeString + TypeName.Length * 2 + EndOfStringCharLength + 4 + _pointerSize + 8, 8)); + } + } + + class AllocationsRunData + { + const int EndOfStringCharLength = 2; + private TraceEvent _payload; + + public AllocationsRunData(TraceEvent payload) + { + _payload = payload; + + ComputeFields(); + } + + public int Iterations; + public int Count; + public string AllocatedTypes; + + private void ComputeFields() + { + int offsetBeforeString = 4 + 4; + + Span data = _payload.EventData().AsSpan(); + Iterations = BitConverter.ToInt32(data.Slice(0, 4)); + Count = BitConverter.ToInt32(data.Slice(4, 4)); + AllocatedTypes = Encoding.Unicode.GetString(data.Slice(offsetBeforeString, _payload.EventDataLength - offsetBeforeString - EndOfStringCharLength)); + } + } + + class AllocationsRunIterationData + { + private TraceEvent _payload; + public AllocationsRunIterationData(TraceEvent payload) + { + _payload = payload; + + ComputeFields(); + } + + public int Iteration; + + private void ComputeFields() + { + Span data = _payload.EventData().AsSpan(); + Iteration = BitConverter.ToInt32(data.Slice(0, 4)); + } + } +} diff --git a/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/README.md b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/README.md new file mode 100644 index 00000000000000..6e426c36609580 --- /dev/null +++ b/src/tests/tracing/eventpipe/randomizedallocationsampling/manual/README.md @@ -0,0 +1,110 @@ +# Manual Testing for Randomized Allocation Sampling + +This folder has a test app (Allocate sub-folder) and a profiler (AllocationProfiler sub-folder) that together can be used to experimentally +observe the distribution of sampling events that are generated for different allocation scenarios. To run it: + +1. Build both projects +2. Run the Allocate app with corerun and use the --scenario argument to select an allocation scenario you want to validate +3. The Allocate app will print its own PID to the console and wait. +4. Run the AllocationProfiler passing in the allocate app PID as an argument +5. Hit Enter in the Allocate app to begin the allocations. You will see output in the profiler app's console showing the measurements. For example: + + ``` + Tag SCount TCount SSize TSize UnitSize UpscaledSize UpscaledCount Name + ------------------------------------------------------------------------------------------- + S 1 0 24 0 24 102412 4267 System.Int16 + ST 44 61 1056 1464 24 4506128 187755 Object8 + ST 1 1 32 32 32 102416 3200 System.Reflection.MetadataImport + ST 67 30 2144 960 32 6861872 214433 Object16 + ST 80 169 3840 8112 48 8193920 170706 Object32 + S 1 0 56 0 56 102428 1829 MemberInfoCache`1[System.Reflection.RuntimeMethodInfo] + ST 2 3 160 240 80 204880 2561 System.String + S 2 0 128 0 64 204864 3201 System.Reflection.RuntimeMethodBody + S 1 0 80 0 80 102440 1280 System.Signature + ST 143 86 11440 6880 80 14648920 183111 Object64 + S 2 0 222 0 111 204911 1846 System.Byte[] + S 1 0 96 0 96 102448 1067 System.Reflection.RuntimeParameterInfo + S 1 0 112 0 112 102456 914 System.Reflection.ParameterInfo[] + ST 280 272 40320 39168 144 28692164 199251 Object128 + S 2 0 58224 0 29112 235289 8 EventMetadata[] + ST 1 1 8388632 8388640 8388632 8388632 1 Object0[] + T 0 1 0 336 336 0 0 System.Reflection.RuntimeFieldInfo[] + T 0 1 0 48 48 0 0 System.Text.StringBuilder +``` + +- The **Tag** column shows if Allocation**T**ick and/or Allocation**S**ampled events where received for instances of a given type +- The **S**-prefixed colums refer to data from AllocationSampled events payload +- The **T**-prefixed colums refer to data from AllocationTick events payload +- The final **Upscaled**XXX columns are computed from AllocationSampled events payload + +In this special case, the same number of 200000 instances were created and should be checked in the **UpscaledCount** column. + +In a second case, 2 threads allocate 200000 instances of objects with x1/x2/x3 size ratio to see how the relative size distribution is conserved: + +``` +Tag SCount TCount SSize TSize UnitSize UpscaledSize UpscaledCount Name +------------------------------------------------------------------------------------------- + ST 47 67 1128 1608 24 4813364 200556 Object24 + ST 65 48 2080 1536 32 6657040 208032 Object32 + ST 108 94 5184 4512 48 11061792 230454 Object48 + ST 132 145 8448 9280 64 13521024 211266 Object64 + ST 155 87 11160 6264 72 15877580 220521 Object72 + ST 191 192 18336 18432 96 19567569 203828 Object96 + ST 2 2 16777264 16777280 8388632 16777264 2 Object0[] +``` + + +A dedicated `AllocationsRunEventSource` has been created to allow monitoring multiple allocation runs and compute percentiles: +``` +> starts 10 iterations allocating 1000000 instances +0| +Tag SCount TCount SSize TSize UnitSize UpscaledSize UpscaledCount Name +------------------------------------------------------------------------------------------- + ST 246 224 5904 5376 24 25193352 1049723 Allocate.WithFinalizer + ST 5 7 320 448 64 512160 8002 System.RuntimeFieldInfoStub + ST 702 719 50544 51768 72 71910074 998751 System.Int32[,] + ST 946 859 90816 82464 96 96915815 1009539 System.String + ST 1842 1887 362874 377400 197 188802295 958387 System.Byte[] + ST 3 3 56000072 56000096 18666690 56000072 3 System.Object[] +1| +Tag SCount TCount SSize TSize UnitSize UpscaledSize UpscaledCount Name +------------------------------------------------------------------------------------------- + ST 283 224 6792 5376 24 28982596 1207608 Allocate.WithFinalizer + ST 675 711 48600 51192 72 69144302 960337 System.Int32[,] + ST 974 867 93504 83232 96 99784359 1039420 System.String + ST 1861 1888 366617 377600 197 190749767 968272 System.Byte[] + ST 3 3 56000072 56000096 18666690 56000072 3 System.Object[] +2| +Tag SCount TCount SSize TSize UnitSize UpscaledSize UpscaledCount Name +------------------------------------------------------------------------------------------- + ST 215 236 5160 5664 24 22018580 917440 Allocate.WithFinalizer + ST 1 1 64 64 64 102432 1600 System.RuntimeFieldInfoStub + ST 697 650 50184 46800 72 71397894 991637 System.Int32[,] + ST 927 917 88992 88032 96 94969302 989263 System.String + ST 1895 1886 373315 377200 197 194234717 985963 System.Byte[] + ST 3 3 56000072 56000096 18666690 56000072 3 System.Object[] + T 0 1 0 288 288 0 0 System.GCMemoryInfoData +3| +... +8| +Tag SCount TCount SSize TSize UnitSize UpscaledSize UpscaledCount Name +------------------------------------------------------------------------------------------- + ST 244 213 5856 5112 24 24988528 1041188 Allocate.WithFinalizer + ST 710 681 51120 49032 72 72729562 1010132 System.Int32[,] + ST 974 918 93504 88128 96 99784359 1039420 System.String + ST 1920 1875 378240 375000 197 196797180 998970 System.Byte[] + ST 3 3 56000072 56000096 18666690 56000072 3 System.Object[] +9| +Tag SCount TCount SSize TSize UnitSize UpscaledSize UpscaledCount Name +------------------------------------------------------------------------------------------- + ST 236 219 5664 5256 24 24169232 1007051 Allocate.WithFinalizer + ST 698 682 50256 49104 72 71500330 993060 System.Int32[,] + ST 940 913 90240 87648 96 96301127 1003136 System.String + ST 1982 1874 390454 374800 197 203152089 1031228 System.Byte[] + ST 3 3 56000072 56000096 18666690 56000072 3 System.Object[] + +< run stops +``` + + +Feel free to allocate the patterns you want in other methods of the **_Allocate_** project and use the _DynamicAllocationSampling_ events listener to get a synthetic view of the different allocation events. \ No newline at end of file