-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try changing alignment #76451
Try changing alignment #76451
Conversation
Tagging subscribers to this area: @dotnet/gc Issue DetailsLooking at CPU traces for microbenchmarks, I noticed a hotspot in memset (the flavor that uses AVX2 instructions) for the instruction that clears the very last double quadword at the end of an allocation context. Also, the buffer being cleared is not aligned on a 32-byte boundary. Two tiny changes address this:
Why change 2. helps is not clear, but the measurements say it does - the Perf_String.Replace_Char_Custom benchmark regressed by about 1.8% for regions vs. segments without these changes, but shows the same or slightly better performance with the changes.
|
Hmm, interesting finding. My worry here would be whether we are "over-fitting" the perf benefits for the microbenchmarks but probably arent quite sure whether real scenarios would see benefits or regressions? |
this applies to all scenarios, not just microbenchmarks. this is how we clear memory in general. we are using 8 bytes more per region which is a tiny amount but we avoid the unaligned memset's for the most part so 1) is definitely a good thing. I'm also unclear why 2) matters, maybe now the latency of going to the next page is partially hidden by last clear since the 2nd to last clear hits it now? we can take a look at the individual instruction cost in the loop when you are back. |
One reason I found why 2) is beneficial is that making the size 8k+32 minimizes the number of |
Looking at CPU traces for microbenchmarks, I noticed a hotspot in memset (the flavor that uses AVX2 instructions) for the instruction that clears the very last double quadword at the end of an allocation context. Also, the buffer being cleared is not aligned on a 32-byte boundary.
Two tiny changes address this:
Why change 2. helps is not clear, but the measurements say it does - the Perf_String.Replace_Char_Custom benchmark regressed by about 1.8% for regions vs. segments without these changes, but shows the same or slightly better performance with the changes.