-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AdvSimd support for System.Text.Unicode.Utf16Utility.GetPointerToFirstInvalidChar #39050
AdvSimd support for System.Text.Unicode.Utf16Utility.GetPointerToFirstInvalidChar #39050
Conversation
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
Outdated
Show resolved
Hide resolved
Tagging subscribers to this area: @tannergooding |
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
Outdated
Show resolved
Hide resolved
Improve Arm64MoveMask.
d91e3f3
to
7591589
Compare
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
Show resolved
Hide resolved
8435583
to
b754c31
Compare
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
Outdated
Show resolved
Hide resolved
@kunalspathak The System.Text.Json string performance tests seem to be touching the code from my 3 PRs. x64 PC - SSE2 turned off vs SSE2 turned on
ARM64 laptop - Before and after Utf16Utility.GetPointerToFirstInvalidChar changesThis PR.
ARM64 laptop - Before and after Utf8Utility.TranscodeToUtf8 changes
ARM64 laptop - Before and after Utf8Utility.GetPointerToFirstInvalidByte changes
ARM64 laptop - All 3 PRs merged togetherThis result combines the changes from all 3 PRs and shows the final performance results.
|
Hmm that's peculiar... Do you happen to have the codegen for the methods? :) |
Just for reference : The performance benchmarks code is here. Questions/Observations:
The first line of action is to see if you can repro the regression on your local machine (just take the relevant code from the benchmark and experiment if you see the slowness). The next thing to do is compare the JIT code before and after your changes as @TamarChristinaArm pointed out. It might not be that something that you introduced, but possibly that those methods already run decently with existing vectorization and the operations that we are doing to find mask are turning out expensive than their counter part in default implementation. |
I verified the code of |
@kunalspathak the tool that compares the benchmark results before and after will only show in the output table the results that had a performance difference above the specified threshold. In the results I shared, this is true for the Sse2 comparison, but not for the AdvSimd results (they diff didn't reach the threshold so they are not shown). |
…tInvalidChar (dotnet#39050) * AdvSimd support for System.Text.Unicode.Utf16Utility.GetPointerToFirstInvalidChar * Move using directive outside #if. Improve Arm64MoveMask. * Change overloads * UIn64 in Arm64MoveMask * Build error implicit conversion fix * Rename method and use simpler version * Use ShiftRightArithmetic instead of CompareEqual + And. * Remove unnecessary comment * Add missing shims causing Linux build to fail
I created 3 benchmark tests that touch the code from each of my 3 PRs. You can see the code in this commit in my fork. I ran these tests in my x64 PC first, with SSE2 disabled then enabled:
Then I ran them in my ARM64 WSL2 VM, with AdvSimd disabbled then enabled:
If a test didn't show up in the table, it's because there was no difference above the specified threshold. @kunalspathak @pgovind @eiriktsarpalis @tannergooding @jeffhandley |
…#39738) * AdvSimd support for System.Text.Unicode.Utf16Utility.GetPointerToFirstInvalidChar (#39050) * AdvSimd support for System.Text.Unicode.Utf16Utility.GetPointerToFirstInvalidChar * Move using directive outside #if. Improve Arm64MoveMask. * Change overloads * UIn64 in Arm64MoveMask * Build error implicit conversion fix * Rename method and use simpler version * Use ShiftRightArithmetic instead of CompareEqual + And. * Remove unnecessary comment * Add missing shims causing Linux build to fail * AdvSimd support for System.Text.Unicode.Utf8Utility.TranscodeToUtf8 (#39041) * AdvSimd support for System.Text.Unicode.Utf8Utility.TranscodeToUtf8 * Readd using to prevent build failure. Add AdvSimd equivalent operation to TestZ. * Inverted condition * Address IsSupported order, improve use ExtractNarrowingSaturated usage * Rename source to result, second argument utf16Data * Improve CompareTest * Add shims causing failures in Linux * Use unsigned version of ExtractNarrowingSaturate, avoid using MinAcross and use MaxPairwise instead * Missing support check for Sse2.X64 * Add missing case for AdvSimd * Use MinPairwise for short * AdvSimd support for System.Text.Unicode.Utf8Utility.GetPointerToFirstInvalidByte (#38653) * AdvSimd support for System.Text.Unicode.Utf8Utility.GetPointerToFirstInvalidByte * Move comment to the top, add shims. * Little endian checks * Use custom MoveMask method for AdvSimd * Address suggestions to improve the AdvSimdMoveMask method * Define initialMask outside MoveMask method * UInt64 in Arm64MoveMask * Add unit test case to verify intrinsics improvement * Avoid casting to smaller integer type * Typo and comment * Use ShiftRightArithmetic instead of CompareEqual + And. Remove test case causing other unit tests to fail. * Use AddPairwise version of GetNotAsciiBytes * Add missing shims causing Linux build to fail * Simplify GetNonAsciiBytes to only one AddPairwise call, shorter bitmask * Respect data type returned by masking method * Address suggestions - assert trailingzerocount and bring back uint mask * Trailing zeroes in AdvSimd need to be divided by 4, and total number should not be larger than 16 * Avoid declaring static field which causes PNSE in Utf8String.Experimental (S.P.Corelib code is used for being NetStandard) * Prefer using nuint for BitConverter.TrailingZeroCount * Fix build failure in net472 debug AdvSimd Utf16Utility (#39652) Co-authored-by: Carlos Sanchez Lopez <1175054+carlossanlop@users.noreply.github.com>
Thanks @carlossanlop for writing these benchmarks. Few points looking at the benchmark code:
For SSE2, same observation as earlier. We see some improvement for few benchmarks but other remains stable (they don't regress). However, for AdvSimd, we don't see any benchmark getting faster but in contrast, we regress some benchmarks. With that I am not sure if intrinsifying those methods with ARM64 intrinsics is worth doing. @TamarChristinaArm , @BruceForstall - thoughts? |
@kunalspathak Part of it is that the code for Arm64 seems to be a literal translation of the SSE one. This introduces a couple of inefficiencies. A couple of examples (without taking a look at the algorithm itself but just the intrinsics usage):
This has several problems, runtime/src/libraries/System.Private.CoreLib/src/System/Numerics/BitOperations.cs Line 243 in 6072e4d
SIMD side to do the popcount and then moves it again back to the GENREG after a very expensive ADDV .
So you hit a reduction in every iteration. As far as I can tell But looking at it, the
I don't believe the JIT does any hoisting does it?
The problem here is that you have done more operations than needed if you don't enter the |
I haven't looked at the |
Thanks for the great feedback @TamarChristinaArm !
That's a good observation. I didn't realize that
Part of "logical right shift of 7" is done inside
The latest code doesn't use static variable. Not because JIT doesn't optimize it but some other failures that we see in code that consumes the file for .netstandard. JIT does optimize I think overall that's a good feedback that @carlossanlop should consider looking into to eliminate the regressions. |
@kunalspathak well it does an arithmetic right shift so the values in each lane are either So what I mean is, using the helper function makes 2 of the 3 cases it's being used do more work than they have to. Only 1 case actually uses the positional information. |
I see what you mean. Thanks for the clarification. So with your earlier suggestion, we can just use
Yes, I also now see your argument about the case where we just check for |
Yup, you probably want to reduce the value so that you get a bigger range. If you do the add on the value after the shift, then it's a byte and you'd overflow after 255 non-ascii characters in the same position. You can use Now on architectures that support it you have a quick way to go from
|
@carlossanlop - any update on the perf regression after intrinsifying the methods? FYI - @JulieLeeMSFT , @jeffhandley |
@kunalspathak @carlossanlop Which reminds me.. Wilco (the one who does the work on Arm optimized's string routine) mentioned that the [MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint GetNonAsciiBytes(Vector128<byte> value, Vector128<byte> bitMask128)
{
Debug.Assert(AdvSimd.Arm64.IsSupported);
Vector128<byte> mostSignificantBitIsSet = AdvSimd.ShiftRightArithmetic(value.AsSByte(), 7).AsByte();
Vector128<byte> extractedBits = AdvSimd.And(mostSignificantBitIsSet, bitMask128);
// self-pairwise add until all flags have moved to the first two bytes of the vector
extractedBits = AdvSimd.Arm64.AddPairwise(extractedBits, extractedBits);
extractedBits = AdvSimd.Arm64.AddPairwise(extractedBits, extractedBits);
extractedBits = AdvSimd.Arm64.AddPairwise(extractedBits, extractedBits);
return extractedBits.AsUInt16().ToScalar();
} Can be made much more efficient by using a This mask has several advantages in that it can be created without needing a literal pool. it's endian agnostic and also only requires a single it works by essentially doing uint16_t elem = 0xF00F;
uint16x8_t mask = vdupq_n_u16 (elem);
uint8x16_t vmask = vreinterpretq_u8_u16 (mask); And you The mask gives you alternating sequences of after a single So I think it's best to separate the |
@carlossanlop - It will be good to submit a PR (after you address the feedback I have given here) to |
@kunalspathak here's a followup: I found that the @pgovind wrote this benchmark class in which he tests this method against several large files with different encodings. I decided to test the performance of my code, making sure to revert all of @pgovind 's ARM changes so there was no overlap/noise (his ARM improvements are closely related to mine). There were no slower tests, but not the same number of improvements as in Sse2. I also made sure to run all the rest of the benchmark methods in that class. These are the results: ToCharArraySse2 Off vs Sse2 On (x64 PC)
AdvSimd off vs AdvSimd on (Surface Pro X)
All benchmarksSse2 Off vs Sse2 On (x64 PC)
AdvSimd off vs AdvSimd on (Surface Pro X)
Some more details on how the new code is reached: The EnglishMostlyAscii.txt file gets to reach these breakpoints:
The Greek.txt file is very similar:
|
@carlossanlop thanks for sharing full results! the improvements are impressive! |
Thanks @carlossanlop for gathering the numbers. As we discussed offline, the benchmark code touches the methods you optimized in #39041 and #38653. Update: The numbers look great and are expected from the work in #39041 and #38653. It will be interesting to see the benchmark numbers for this PR because this is the one has longer implementation to |
Please take a look at this PR in the performance repo with a new benchmark I added to test |
Contributes to #35035
Adds AdvSimd support for
System.Text.Unicode.Utf16Utility.GetPointerToFirstInvalidChar()
inside the file
runtime\src\libraries\System.Private.CoreLib\src\System\Text\Unicode\Utf16Utility.Validation.cs
I've been having difficulties testing this in my ARM device so I want to analyze the CI results.
I manually executed an additional "Libraries Test Run" pipeline to ensure arm64 is run in all platforms.