Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC's VXSort question #64164

Open
EgorBo opened this issue Jan 23, 2022 · 6 comments
Open

GC's VXSort question #64164

EgorBo opened this issue Jan 23, 2022 · 6 comments
Labels
area-GC-coreclr question Answer questions and provide assistance, not an issue with source code or documentation.
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Jan 23, 2022

I noticed that GC uses VXSort (AVX2/AVX512) but only on Windows-x64. So I assume it has to be enabled for Linux-x64 and rewritten to NEON for Arm64?
image
(a screenshot, because it's not possible to reference lines in gc.cpp via github 😄)

I only tested it on Plaintext-MVC benchmark (allocates a lot of short-living objects) on our perf-lab and it seems like VXSort regresses P90 across 7 runs and has no effect on RPS. Also, it adds 227Kb to native size (for coreclr.dll+clrgc.dll combined)
image

Is there a scenario I can simulate on our perflab to see benefits from it or it only targets real world large workloads?
I am asking because I am wondering if it worth porting to NEON SIMD for arm64.

@dotnet-issue-labeler dotnet-issue-labeler bot added area-Diagnostics-coreclr untriaged New issue has not been triaged by the area owner labels Jan 23, 2022
@ghost
Copy link

ghost commented Jan 23, 2022

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

Issue Details

I noticed that GC uses VXSort (AVX2/AVX512) but only on Windows-x64. So I assume it has to be enabled for Linux-x64 and rewritten to NEON for Arm64?
image
(a screenshot, because it's not possible to reference lines in gc.cpp via github 😄)

I only tested it on Plaintext-MVC benchmark (allocates a lot of short-living objects) on our perf-lab and it seems like it VXSort regresses P95 across 7 runs and has no effect on RPS. Also, it adds 227Kb to native size (for coreclr.dll+clrgc.dll combined)
image

Is there a scenario I can simulate on our perflab to see benefits from it or it only targets real world large workloads?
I am asking because I am wondering if it worth porting to NEON SIMD for arm64.

Author: EgorBo
Assignees: -
Labels:

area-Diagnostics-coreclr, untriaged

Milestone: -

@EgorBo EgorBo added area-GC-coreclr question Answer questions and provide assistance, not an issue with source code or documentation. and removed area-Diagnostics-coreclr labels Jan 23, 2022
@ghost
Copy link

ghost commented Jan 23, 2022

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

I noticed that GC uses VXSort (AVX2/AVX512) but only on Windows-x64. So I assume it has to be enabled for Linux-x64 and rewritten to NEON for Arm64?
image
(a screenshot, because it's not possible to reference lines in gc.cpp via github 😄)

I only tested it on Plaintext-MVC benchmark (allocates a lot of short-living objects) on our perf-lab and it seems like it VXSort regresses P95 across 7 runs and has no effect on RPS. Also, it adds 227Kb to native size (for coreclr.dll+clrgc.dll combined)
image

Is there a scenario I can simulate on our perflab to see benefits from it or it only targets real world large workloads?
I am asking because I am wondering if it worth porting to NEON SIMD for arm64.

Author: EgorBo
Assignees: -
Labels:

question, area-GC-coreclr, untriaged

Milestone: -

@danmoseley
Copy link
Member

@PeterSolMS

Linking #37159

@kunalspathak
Copy link
Member

Some notes about INTROSORT is in #60166 (comment).

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Feb 28, 2022
@mangod9 mangod9 added this to the 7.0.0 milestone Feb 28, 2022
@Maoni0
Copy link
Member

Maoni0 commented Jul 7, 2022

thanks @EgorBo for the data.

that's interesting because if you are just allocating some temp objects you shouldn't even hit the vectorized sorting code path. if you took a trace with cpu samples, do you see any samples captured in do_vxsort at all (or if you set a bp on do_vxsort do you see it get hit?)? just checking if it's a matter of "the benchmark is so small and any code change could disturb this" or is it really caused by the vectorized sorting.

@Maoni0 Maoni0 modified the milestones: 7.0.0, Future Jul 7, 2022
@Maoni0
Copy link
Member

Maoni0 commented Jul 7, 2022

I don't think this needs to be 7.0 but let me know if you don't agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-GC-coreclr question Answer questions and provide assistance, not an issue with source code or documentation.
Projects
None yet
Development

No branches or pull requests

5 participants