-
-
Notifications
You must be signed in to change notification settings - Fork 984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make memory alignment more random #1513
Comments
I can put together a list of places where we are seeing this and link it here. |
@adamsitnik what do you think about customizing the LaunchCount that controls the number of started benchmark processes? I designed this property exactly for such cases. As a bonus, we can add some additional memory randomization before the GlobalSetup. I do not like the idea of "GlobalSetup-like" initialization between iterations because it can destroy benchmark stability in the case of microbenchmarks. The LaunchCount approach may be expensive from the total benchmarking time point of view, but it should be a reliable and stable way that solves the problem (at least, I don't know other acceptable solutions). P.S. In the good old days, the default value for LaunchCount was 3, so it was easier to detect such problems. However, it didn't provide benefits in most "simple" cases, so I changed it to 1 in order to reduce the total benchmarking time. |
Also, I have a more "adaptive" idea. We can introduce an additional
@adamsitnik what do you think? |
The example at the top (stable over several days) is interesting, but the most common case I see (which I assume is the general case of the above issue) is more like this Of course CopyTo is pure memcpy so it's very alignment sensitive, but I see this all over benchmarks for collections that are backed by arrays eg I wonder if this is something my team can help with (possibly @adamsitnik, or not) since it apparently affects us quite a bit. |
This may be worse on ARM64, Another example dotnet/runtime#41741 |
I have also been working on some filters, that I think should do a good job of filtering out this kind of test. So that should at least help from a reporting side. |
@DrewScoggins could you say more? I wouldn't want us to filter out these tests - they're core tests. It seems to me this is a problem ideally fixed in BDN or else possibly with workarounds in the benchmarks. |
I mean that when we go to report regressions in the auto-filing we would not report a jump between bimodal points as a regression unless we see change in the character of the bimodal behavior. We won't be getting rid of them or stop running them, and certainly it is better to fix these to not have this behavior, but in the meantime I don't want tests like this to take up our triaging time. |
We've talked about this a fair bit in perf triage -- basically if the underlying distribution shifts, than that is a significant event, even if the distributions themselves are broad (noisy) or bimodal. The main questions are how to determine what data points logically belong to the "same" distribution and how much data you need to accurately characterize the distributions (especially when bimodal or multimodal). As for fixing this behavior, we can either try and regularize alignments or randomize them -- long term we should perhaps pursue both. Regularization helps us and also our customers, who will experience the same sort of uncontrolled perf fluctuations in their code. For regularization of code alignment we've started a little ways down this path but need to go further, and consider controlling loop alignment. But it is a tricky thing to get right. Randomization helps benchmarking by ensuring that each set of runs visits a wide variety of possible behaviors, so we more quickly can get a sense of the true distribution. Currently in benchmarks with broad / complex distributions we can only effectively detect regressions after some time has passed and the new behavior becomes clear. Randomization would speed up this process and perhaps get us back to the point where we could reliably detect regressions with just one set of base/diff runs in most tests. |
@AndyAyersMS by regularize/randomize - are you referring to something the runtime could do (eg an opt in mode where the GC adds a little random padding before array allocations) or something that BDN could potentially do (eg @AndreyAkinshin suggestion of increasing launch counts and adding some random allocation on each launch) |
The runtime would regularize (or the jit would). BDN would randomize. |
Is there an issue tracking the runtime/JIT work @AndyAyersMS , or would you mind creating one? It might be agood candidate for work before 6.0 features. |
On the jit side we need to implement loop head alignment (dotnet/runtime#8108); as a prerequisite we need to do method entry alignment more broadly (dotnet/runtime#9912). x64 alignment is now 32 bytes for Tier1 methods with loops, see dotnet/runtime#2249. I had hoped this would reduce/eliminate some of the bimodal behavior but our benchmark results seemingly say otherwise (though I haven't done a systematic search...) [edit: fixed links] |
@kunalspathak will look into this with Andy and others. |
Few benchmarks that are suspected to be affected because of memory alignment: DrewScoggins/performance-2#2290 (comment) |
Fixed by #1587 |
While working on a new bot for auto-filing performance regressions in dotnet/runtime repository (sample issue) we have found out that quite a few microbenchmarks from dotnet/performance repository are bimodal and the modality tends to be stable for a few days before it switches to the other mode
Example:
So it's very often something like:
A while ago @AndyAyersMS has mentioned stabilizer which performs full randomization.
.NET does not allow for full control of memory alignment, but we could at least try to make it more random.
In dotnet/runtime#37814 @jkotas has provided a small repro that shows "the many modal nature of memory copying":
So the first step could be to allocate a variable-size byte array between iterations to have more randomized memory alignment of the objects allocated during benchmarking.
The problem is that very often the input is allocated in
[GlobalSetup]
method (to exclude the cost of allocation from the benchmark which is good) which we promise to call only once during benchmarking ;) Perhaps we could add an optional config flag to invoke it once per every iteration? (but somehow avoid the[IterationSetup]
hell)Some benchmarks are initialized in constructors, so we might also consider allocating a new instance of benchmarked type for every iteration.
@AndreyAkinshin what do you think?
@DrewScoggins is there any chance you could provide a list of such benchmarks to use them for experimenting?
/cc @billwert @tannergooding @kunalspathak
The text was updated successfully, but these errors were encountered: