Skip to content

Conversation

@ylpoonlg
Copy link

@ylpoonlg ylpoonlg commented Nov 5, 2025

The intrinsic codepath was temporarily disabled (#42052) for AdvSimd due to performance regression (#41699), but never re-enabled again. Since there is no equivalent instruction for ExtractMostSignificantBits in AdvSimd, using the same algorithm as SSE2 is inherently slower.

This PR adds optimizations specific to arm64 AdvSimd based on the generic Vector128 algorithm. It makes use of the Unzip instruction to convert vectors into scalar for processing, which offers some speedups against generic Vector128.

Performance results

Run on Neoverse-V2 (lower is better)

Method Input Version Mean Error Ratio
GetByteCount EnglishAllAscii Before 4.414 us 0.0852 us 1.000
GetByteCount EnglishAllAscii After 4.420 us 0.0858 us 1.001
GetByteCount EnglishMostlyAscii Before 20.168 us 0.0613 us 1.000
GetByteCount EnglishMostlyAscii After 18.366 us 0.1092 us 0.911
GetByteCount Chinese Before 9.132 us 0.0052 us 1.000
GetByteCount Chinese After 8.157 us 0.0342 us 0.893
GetByteCount Cyrillic Before 7.929 us 0.0069 us 1.000
GetByteCount Cyrillic After 7.157 us 0.0521 us 0.903
GetByteCount Greek Before 10.042 us 0.0093 us 1.000
GetByteCount Greek After 9.109 us 0.0646 us 0.907

cc @dotnet/arm64-contrib @SwapnilGaikwad

The intrinsic codepath was disabled for AdvSimd due to slower
performance than the Vector128 codepath. Since there is no equivalent
instruction for ExtractMostSignificantBits in AdvSimd, using the SSE2
algorithm will be slow. Therefore a new specialized algorithm is added
to optimise for AdvSimd, based on the generic Vector128 algorithm.

Fixes dotnet#41699 and dotnet#42052.

Performance results on Neoverse-V2 (lower is better):

| Method       | Input              | Version | Mean      | Error     | Ratio |
|------------- |------------------- |-------- |----------:|----------:|------:|
| GetByteCount | EnglishAllAscii    | Before  |  4.414 us | 0.0852 us | 1.000 |
| GetByteCount | EnglishAllAscii    | After   |  4.420 us | 0.0858 us | 1.001 |
| GetByteCount | EnglishMostlyAscii | Before  | 20.168 us | 0.0613 us | 1.000 |
| GetByteCount | EnglishMostlyAscii | After   | 18.366 us | 0.1092 us | 0.911 |
| GetByteCount | Chinese            | Before  |  9.132 us | 0.0052 us | 1.000 |
| GetByteCount | Chinese            | After   |  8.157 us | 0.0342 us | 0.893 |
| GetByteCount | Cyrillic           | Before  |  7.929 us | 0.0069 us | 1.000 |
| GetByteCount | Cyrillic           | After   |  7.157 us | 0.0521 us | 0.903 |
| GetByteCount | Greek              | Before  | 10.042 us | 0.0093 us | 1.000 |
| GetByteCount | Greek              | After   |  9.109 us | 0.0646 us | 0.907 |
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Nov 5, 2025
@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Nov 5, 2025
@SwapnilGaikwad
Copy link
Contributor

cc: @EgorBo @a74nh @JulieLeeMSFT

@EgorBo
Copy link
Member

EgorBo commented Nov 6, 2025

I think @tannergooding is a better person to review this as it's Libraries code.

Is any of this can be replaced with the newly added "IndexOf" like Vector APIs?

@EgorBo EgorBo requested a review from tannergooding November 6, 2025 15:53
@tannergooding
Copy link
Member

Is any of this can be replaced with the newly added "IndexOf" like Vector APIs?

👍. This is not a large improvement and makes the code harder to maintain. I'd personally lean more towards removing the Sse2 specialized path and instead just having the Vector128.IsHardwareAccelerated path. Ideally utilizing the newer xplat APIs, like Count(...) and IndexOf(...), to handle such optimizations (and tweaking the JIT if things aren't quite right).

The general goal is to reduce the amount of platform specific code we have to maintain over time. This is a goal even if there are relatively minor regressions in doing so. We really only want to have platform specific paths if we know there is "significant" advantage such as much higher throughput numbers and/or real world benchmarks (not microbenchmarks) showing the gains.

-- Not all 10% is created equal. 10% of 1us (100ns) is significantly less than 10% of 1ms (100us), for example. And while a 10% gain on 10us can be impactful to some apps, it is likely not the bottleneck or to have any kind of measurable impact to typical workloads. This is particularly true when things like Tiered Compilation don't kick in until there's a 100ms gap between Tier 0 compilations after startup. So we typically want to see reduction in code complexity or something showing the complexity increase is worthwhile.

@SwapnilGaikwad
Copy link
Contributor

SwapnilGaikwad commented Nov 7, 2025

This is not a large improvement and makes the code harder to maintain.

Thanks @tannergooding for your comment and explanation. I certainly agree, we will close this PR.

@ylpoonlg
Copy link
Author

ylpoonlg commented Nov 7, 2025

Thanks all, that makes sense, I will bear that in mind and explore the suggestions of using IndexOf APIs. Closing this for now.

@ylpoonlg ylpoonlg closed this Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Indicates that the PR has been added by a community member needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants