Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try changing alignment #76451

Merged
merged 4 commits into from
Oct 14, 2022
Merged

Conversation

PeterSolMS
Copy link
Contributor

Looking at CPU traces for microbenchmarks, I noticed a hotspot in memset (the flavor that uses AVX2 instructions) for the instruction that clears the very last double quadword at the end of an allocation context. Also, the buffer being cleared is not aligned on a 32-byte boundary.

Two tiny changes address this:

  1. adding additional padding at the start of regions align the allocation context for the microbenchmark cases.
  2. increasing CLR_SIZE slightly ensure the end of an allocation context doesn't consistently fall on a page boundary.

Why change 2. helps is not clear, but the measurements say it does - the Perf_String.Replace_Char_Custom benchmark regressed by about 1.8% for regions vs. segments without these changes, but shows the same or slightly better performance with the changes.

@ghost
Copy link

ghost commented Sep 30, 2022

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Looking at CPU traces for microbenchmarks, I noticed a hotspot in memset (the flavor that uses AVX2 instructions) for the instruction that clears the very last double quadword at the end of an allocation context. Also, the buffer being cleared is not aligned on a 32-byte boundary.

Two tiny changes address this:

  1. adding additional padding at the start of regions align the allocation context for the microbenchmark cases.
  2. increasing CLR_SIZE slightly ensure the end of an allocation context doesn't consistently fall on a page boundary.

Why change 2. helps is not clear, but the measurements say it does - the Perf_String.Replace_Char_Custom benchmark regressed by about 1.8% for regions vs. segments without these changes, but shows the same or slightly better performance with the changes.

Author: PeterSolMS
Assignees: PeterSolMS
Labels:

area-GC-coreclr

Milestone: -

@mangod9
Copy link
Member

mangod9 commented Sep 30, 2022

Hmm, interesting finding. My worry here would be whether we are "over-fitting" the perf benefits for the microbenchmarks but probably arent quite sure whether real scenarios would see benefits or regressions?

@Maoni0
Copy link
Member

Maoni0 commented Sep 30, 2022

this applies to all scenarios, not just microbenchmarks. this is how we clear memory in general. we are using 8 bytes more per region which is a tiny amount but we avoid the unaligned memset's for the most part so 1) is definitely a good thing.

I'm also unclear why 2) matters, maybe now the latency of going to the next page is partially hidden by last clear since the 2nd to last clear hits it now? we can take a look at the individual instruction cost in the loop when you are back.

@PeterSolMS
Copy link
Contributor Author

One reason I found why 2) is beneficial is that making the size 8k+32 minimizes the number of vmovdqu instructions - at 8k we execute 9 of them, while with 8k +32 we only execute 1. Traces I collected with a C++ replica of the allocator's behavior also show that hitting a new page with a vmovdqu to an unaligned address is particularly high cost, presumably because an access straddling two pages is more complicated to handle.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants