Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review the multi-op instruction usage for Arm64 #68028

Open
20 of 29 tasks
Tracked by #94464
tannergooding opened this issue Apr 14, 2022 · 17 comments
Open
20 of 29 tasks
Tracked by #94464

Review the multi-op instruction usage for Arm64 #68028

tannergooding opened this issue Apr 14, 2022 · 17 comments
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:3 Work that is nice to have
Milestone

Comments

@tannergooding
Copy link
Member

tannergooding commented Apr 14, 2022

After seeing the msub PR (#66621) which folds mul, sub into a single msub I went and looked in the Arm64 manual for other interesting "combined operation" instructions.

There is a set of (shifted register) instructions which can combine a shift, op (the variants ending with s also set flags):

There is a set of (extended register) instructions which can combine a zero-extend, op or sign-extend, op:

  • add, adds - Add
  • sub, subs - Subtract
  • cmp - Compare
  • cmn - Compare Negative

There is a set of (carry) instructions which can utilize the carry from a previous operation:

  • adc, adcs - Add with carry
  • sbc, sbcs - Subtract with carry
  • ngc, ngcs - Negate with carry

There are the multiply integer instructions which can combine an op, mul:

  • madd - Multiply-add
  • msub - Multiply-subtract
  • mneg - Multiply-negate

There are then some long multiply instructions which can return a product twice the size of the inputs (i32 * i32 = i64 or similar; effectively good for covering zero or sign extend, multiply):

Finally there is the "multiply high" instructions which can return just the upper bits of a wide multiply:

  • smulh, umulh - Multiply High

There may be other interesting instructions as well, but these are ones that may have broader usage/application and which likely be good to validate we are covering

category:implementation
theme:intrinsics

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 14, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@tannergooding tannergooding added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Apr 14, 2022
@ghost
Copy link

ghost commented Apr 14, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

After seeing the msub PR (#66621) which folds mul, sub into a single msub I went and looked in the Arm64 manual for other interesting "combined operation" instructions.

There is a set of (shifted register) instructions which can combine a shift, op (the variants ending with s also set flags):

  • add, adds - Add
  • sub, subs - Subtract
  • cmp - Compare
  • cmn - Compare Negative
  • neg, negs - Negate
  • and, ands - Bitwise AND
  • bic, bics - Bitwise bit clear
  • eon - Bitwise exclusive OR NOT
  • eor - Bitwise exclusive OR
  • orr - Bitwise inclusive OR
  • mvn - Bitwise NOT
  • orn - Bitwise inclusive OR NOT
  • tst - Test Bits

There is a set of (extended register) instructions which can combine a zero-extend, op or sign-extend, op:

  • add, adds - Add
  • sub, subs - Subtract
  • cmp - Compare
  • cmn - Compare Negative

There is a set of (carry) instructions which can utilize the carry from a previous operation:

  • adc, adcs - Add with carry
  • sbc, sbcs - Subtract with carry
  • ngc, ngcs - Negate with carry

There are the multiply integer instructions which can combine an op, mul:

  • madd - Multiply-add
  • msub - Multiply-subtract
  • mneg - Multiply-negate

There are then some long multiply instructions which can return a product twice the size of the inputs (i32 * i32 = i64 or similar):

  • smull, umull - Multiply long
  • smaddl, umaddl - Multiply-add long
  • smsubl, umsubl - Multiply-subtract long
  • smnegl, umnegl - Multiply-negate long

Finally there is the "multiply high" instructions which can return just the upper bits of a wide multiply:

  • smulh, umulh - Multiply High

There may be other interesting instructions as well, but these are ones that may have broader usage/application and which likely be good to validate we are covering

Author: tannergooding
Assignees: -
Labels:

arch-arm64, area-CodeGen-coreclr, untriaged

Milestone: -

@TIHan
Copy link
Contributor

TIHan commented Apr 14, 2022

These look like really good optimization opportunities.

@JulieLeeMSFT
Copy link
Member

Thanks @tannergooding for compiling all these cases. Let me put this in the future item.

@a74nh
Copy link
Contributor

a74nh commented Nov 18, 2022

@tannergooding and @kunalspathak : is there anyone looking at the remaining instructions? If not, then this could be a good set of items for @SwapnilGaikwad to work through.

While doing this, it's probably worth adding some DisasmCheck tests for them too.

@kunalspathak
Copy link
Member

@TIHan - are you planning to work on any of these?

@kunalspathak
Copy link
Member

this could be a good set of items for @SwapnilGaikwad to work through.

Sounds good to me.

@a74nh
Copy link
Contributor

a74nh commented Dec 19, 2022

MNEG support: #79550

@kunalspathak
Copy link
Member

@a74nh - Are there more items that you or @SwapnilGaikwad will be working on?

@SwapnilGaikwad
Copy link
Contributor

Hi @kunalspathak, we plan to look at the extended register instruction combinations next (add, adds, sub, subs, cmp and cmn).

@kunalspathak
Copy link
Member

Hi @kunalspathak, we plan to look at the extended register instruction combinations next (add, adds, sub, subs, cmp and cmn).

Sounds good.

@kunalspathak
Copy link
Member

@TIHan - Could you please update the issue description on which instructions are you planning to work on? It will help @SwapnilGaikwad to pick some other instructions to optimize.

@TIHan TIHan modified the milestones: 8.0.0, 9.0.0 Jul 17, 2023
@TIHan TIHan added the Priority:3 Work that is nice to have label May 6, 2024
@TIHan TIHan modified the milestones: 9.0.0, Future Jul 25, 2024
@kunalspathak kunalspathak assigned a74nh and unassigned TIHan Nov 7, 2024
jonathandavies-arm added a commit to jonathandavies-arm/runtime that referenced this issue Jan 27, 2025
* Contributes towards dotnet#68028

Change-Id: Ifad031a92270668bb3970fb1af817fda05851116
jonathandavies-arm added a commit to jonathandavies-arm/runtime that referenced this issue Jan 28, 2025
@snickolls-arm
Copy link
Contributor

We should be able to tick off ADD/ADDS/SUB/SUBS (extended register) as they were included in #76273. The JIT is generating these correctly for additions with byte and short.

Looking at the (carry) set, it's not obvious which patterns they might be useful for as it's rare to need to carry on 64-bit. They would be useful for Int128 acceleration but this may be better as it's own issue, as it's a wider scope than just the JIT.

public static Int128 operator +(Int128 left, Int128 right)

jonathandavies-arm added a commit to jonathandavies-arm/runtime that referenced this issue Jan 28, 2025
kunalspathak pushed a commit that referenced this issue Jan 29, 2025
@a74nh
Copy link
Contributor

a74nh commented Jan 31, 2025

Looking at the (carry) set, it's not obvious which patterns they might be useful for as it's rare to need to carry on 64-bit. They would be useful for Int128 acceleration but this may be better as it's own issue, as it's a wider scope than just the JIT.

Agreed. This issue was intended for general codegen opportunities from the IR. Intrinsifying Int128.cs feels like it falls outside the scope.

For this issue, if there's are no opportunities to lower general IR to the carry instructions, then we should consider those complete. (We could lower the Int128 + method to an add carry, but it'd require matching 6+ nodes, then a heavy graph rewrite, which isn't really practical)

@tannergooding
Copy link
Member Author

tannergooding commented Jan 31, 2025

There are several general patterns that can be applicable to generate carry/borrow instructions and while Int128 is a primary case, it also applies to cases like BigInteger or various user defined types.

One common general pattern for an add; add with carry is:

T lower = left._lower + right._lower;
T carry = (lower < left._lower) ? 1: 0;
T upper = left._upper + right._upper + carry;

and for sub; sub with borrow:

T lower = left._lower - right._lower;
T borrow = (lower > left._lower) ? 1 : 0;
T upper = left._upper - right._upper - borrow;

Noting that this assumes an unsigned comparison in computing the carry/borrow but is the correct algorithm for both signed and unsigned integers. The same general pattern can also apply to overflow checking and so would be applicable for add; jump if unsigned overflow or similar.

The pattern for detecting signed overflow for addition (rather than simply computing if a carry is needed; that is doing add; jump if signed overflow) is a bit more complex, but roughly boils down to:

T result = left + right;

if ((left._upper ^ right._upper) >= 0) // signs of inputs match
{
    if ((result._upper ^ left._upper) < 0) // signs of inputs don't match the output
    {
        // overflow!
    }
}

This can be simplified down to a single compare as: (result._upper & ~(left._upper ^ right._upper)) < 0; which works because matching signs produce 0, while mismatching signs produce 1. We negate that then bitwise and it with the result sign. This gives us "false" for anything with mismatched signs or with matching signs for the inputs and the result.

Signed overflow for subtraction is similar, but doing (leftSign != rightSign) && (leftSign != resultSign)

@a74nh
Copy link
Contributor

a74nh commented Jan 31, 2025

There are several general patterns that can be applicable to generate carry/borrow instructions

Agreed with the patterns shown. However, it's still a few nodes that need matching. Do you think it's suitable to do in the lowering phase?

@tannergooding
Copy link
Member Author

Starting in lowering is probably a fine approach, but we should consider whether such patterns are likely to be broken by things like morph or cse due to the shape and higher expense they appear to have by the more complex IR. It might turn out to be better to recognize and transform it earlier instead, such as to use some dedicated node to help ensure it all flows smoothly.

kunalspathak pushed a commit that referenced this issue Feb 3, 2025
* arm64: Add support for Bitwise XOR NOT

* contributes towards #68028

* arm64: Add support for Bitwise OR NOT

* Update comments & add a tests that takes ints

* Add swap tests that take a int & uint
snickolls-arm added a commit to snickolls-arm/runtime that referenced this issue Feb 11, 2025
snickolls-arm added a commit to snickolls-arm/runtime that referenced this issue Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:3 Work that is nice to have
Projects
None yet
Development

No branches or pull requests

7 participants