Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm64: Use VectorTableLookup/VectorTableLookupExtension APIs in libraries #84328

Closed
Tracked by #94464
kunalspathak opened this issue Apr 4, 2023 · 10 comments
Closed
Tracked by #94464
Labels
area-System.Runtime help wanted [up-for-grabs] Good issue for external contributors
Milestone

Comments

@kunalspathak
Copy link
Member

#80297 added support for Arm64's VectorTableLookup and VectorTableLookupExtension APIs. We should start using them in libraries methods.

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 4, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Apr 4, 2023
@kunalspathak
Copy link
Member Author

@a74nh @SwapnilGaikwad

@ghost
Copy link

ghost commented Apr 5, 2023

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

#80297 added support for Arm64's VectorTableLookup and VectorTableLookupExtension APIs. We should start using them in libraries methods.

Author: kunalspathak
Assignees: -
Labels:

area-System.Runtime, untriaged, needs-area-label

Milestone: -

@vcsjones vcsjones removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Apr 5, 2023
@a74nh
Copy link
Contributor

a74nh commented Apr 11, 2023

Looking through the libraries the current uses of shuffle/vectortablelookup can be improved:

  • System.Private.CoreLib/src/System/IndexOfAnyValues/ProbabilisticMap.cs

  • System.Private.CoreLib/src/System/IndexOfAnyValues/IndexOfAnyAsciiSearcher.cs

  • System.Private.CoreLib/src/System/Buffers/Text/Utf8Formatter/Utf8Formatter.Guid.cs

  • System.Private.CoreLib/src/System/Buffers/Binary/BinaryPrimitives.ReverseEndianness.cs
    System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
    System.Private.CoreLib/src/System/SpanHelpers.cs
    Common/src/System/HexConverter.cs
    System.IO.Hashing/src/System/IO/Hashing/XxHashShared.cs

    • Technically out of scope of this issue, but these are all using hardcoded index values, so can switch from shuffle to shuffleunsafe.
  • EncodeToUtf8.cs
    DecodeFromUtf8.cs

    • Once LD4 and ST3 available, then they can be rewritten to use to vectortablelookup2.

There may be additional things, but these were the obvious ones.

@MihaZupan
Copy link
Member

System.Private.CoreLib/src/System/IndexOfAnyValues/IndexOfAnyAsciiSearcher.cs

  • Can combine shuffles, but only the instances where there are two bitmaps in a single function.

Do you mean

Vector128<byte> row0 = Vector128.ShuffleUnsafe(bitmapLookup0, lowNibbles);
Vector128<byte> row1 = Vector128.ShuffleUnsafe(bitmapLookup1, lowNibbles);
Vector128<byte> bitmask = Vector128.ShuffleUnsafe(Vector128.Create(0x8040201008040201).AsByte(), highNibbles);
Vector128<byte> mask = Vector128.GreaterThan(highNibbles.AsSByte(), Vector128.Create((sbyte)0x7)).AsByte();
Vector128<byte> bitsets = Vector128.ConditionalSelect(mask, row1, row0);
Vector128<byte> result = Vector128.Equals(bitsets & bitmask, bitmask);

?
Which operations would you combine here?

@kunalspathak kunalspathak added this to the 8.0.0 milestone Apr 11, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label Apr 11, 2023
@kunalspathak
Copy link
Member Author

@JulieLeeMSFT - FYI.

@a74nh
Copy link
Contributor

a74nh commented Apr 12, 2023

Do you mean

Vector128<byte> row0 = Vector128.ShuffleUnsafe(bitmapLookup0, lowNibbles);
Vector128<byte> row1 = Vector128.ShuffleUnsafe(bitmapLookup1, lowNibbles);
Vector128<byte> bitmask = Vector128.ShuffleUnsafe(Vector128.Create(0x8040201008040201).AsByte(), highNibbles);
Vector128<byte> mask = Vector128.GreaterThan(highNibbles.AsSByte(), Vector128.Create((sbyte)0x7)).AsByte();
Vector128<byte> bitsets = Vector128.ConditionalSelect(mask, row1, row0);
Vector128<byte> result = Vector128.Equals(bitsets & bitmask, bitmask);

?
Which operations would you combine here?

Unless there's something I'm missing, bitsets can be created just via:
Vector128 bitsets = AdvSimd.VectorTableLookup2(bitmapLookup0, bitmapLookup1, lowNibbles);

@SwapnilGaikwad
Copy link
Contributor

System.Private.CoreLib/src/System/Buffers/Binary/BinaryPrimitives.ReverseEndianness.cs
System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
System.Private.CoreLib/src/System/SpanHelpers.cs
Common/src/System/HexConverter.cs
System.IO.Hashing/src/System/IO/Hashing/XxHashShared.cs

* Technically out of scope of this issue, but these are all using hardcoded index values, so can switch from shuffle to shuffleunsafe.

For these cases, changing from Shuffle to ShuffleUnsafe would not give any performance improvements.
When the indices are hardcoded, they get optimised during the importation phase and avoid the bounds check logic. Therefore, we get the same JIT-ed assembly sequence.

Even when the indices are out of the bounds of the vector, if the positions vector is hard-coded then the overhead of bounds-check is avoided with the shuffle.
In the following example, the index for the first position is 25 so with Vector128.Shuffle we expect value 0 in dest[0] but with Ssse3.Shuffle we expect value src[9] (25%16=9).

Vector128.Shuffle(sourceVec, Vector128.Create((byte)25, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0));

Instead of getting a bounds check, after the importation phase, we see an updated index FF for the first position that would set value 0 in dest[0].
With Vector128.ShuffleUnsafe we get index 19 instead.

; Assembly listing for method System.Text.Tests.ShuffleUnsafeTest:ReverseWithShuffle(byref,ulong)
; Emitting BLENDED_CODE for X64 CPU with AVX512 - Unix
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )   byref  ->  rdi         single-def
;* V01 arg1         [V01    ] (  0,  0   )    long  ->  zero-ref    single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  V03 tmp1         [V03,T01] (  2,  4   )  simd16  ->  mm0         "Spilling op1 side effects for HWIntrinsic"
;
; Lcl frame size = 0

G_M46485_IG01:  ;; offset=0000H
       vzeroupper
                        ;; size=3 bbWeight=1 PerfScore 1.00
G_M46485_IG02:  ;; offset=0003H
       vmovups  xmm0, xmmword ptr [rdi]
       vpshufb  xmm0, xmm0, xmmword ptr [reloc @RWD00]
       vmovups  xmmword ptr [rdi], xmm0
                        ;; size=17 bbWeight=1 PerfScore 8.00
G_M46485_IG03:  ;; offset=0014H
       ret
                        ;; size=1 bbWeight=1 PerfScore 1.00
RWD00   dq  08090A0B0C0D0EFFh, 0001020304050607h


Assembly for `Vector128.ShuffleUnsafe(sourceVec, Vector128.Create((byte)25, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0));`.
; Assembly listing for method System.Text.Tests.ShuffleUnsafeTest:ReverseWithShuffleUnsafe(byref,ulong)
; Emitting BLENDED_CODE for X64 CPU with AVX512 - Unix
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 0 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )   byref  ->  rdi         single-def
;* V01 arg1         [V01    ] (  0,  0   )    long  ->  zero-ref    single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  V03 tmp1         [V03,T01] (  2,  4   )  simd16  ->  mm0         "Spilling op1 side effects for HWIntrinsic"
;* V04 tmp2         [V04    ] (  0,  0   )  simd16  ->  zero-ref    "Inlining Arg"
;
; Lcl frame size = 0

G_M34751_IG01:  ;; offset=0000H
       vzeroupper
                        ;; size=3 bbWeight=1 PerfScore 1.00
G_M34751_IG02:  ;; offset=0003H
       vmovups  xmm0, xmmword ptr [rdi]
       vpshufb  xmm0, xmm0, xmmword ptr [reloc @RWD00]
       vmovups  xmmword ptr [rdi], xmm0
                        ;; size=17 bbWeight=1 PerfScore 8.00
G_M34751_IG03:  ;; offset=0014H
       ret
                        ;; size=1 bbWeight=1 PerfScore 1.00
RWD00   dq  08090A0B0C0D0E19h, 0001020304050607h


; Total bytes of code 21, prolog size 3, PerfScore 12.10, instruction count 5, allocated bytes for code 21 (MethodHash=64be7840) for method System.Text.Tests.ShuffleUnsafeTest:ReverseWithShuffleUnsafe(byref,ulong)
; ============================================================

@a74nh
Copy link
Contributor

a74nh commented May 5, 2023

For these cases, changing from Shuffle to ShuffleUnsafe would not give any performance improvements.

Thanks for the analysis. Makes sense the checks would be optimised away.
There's a very minor case for still making the change this as it would make everything use the same API function, but that's just code churn.
I'll mark this as done on the list above.

@tannergooding
Copy link
Member

This is a "do anytime" issue.

@kunalspathak
Copy link
Member Author

This was completed in #87126. If there are more instances in BCL that can benefit from VectorTableLookup, we can have a separate issue for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Runtime help wanted [up-for-grabs] Good issue for external contributors
Projects
None yet
Development

No branches or pull requests

7 participants