Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bmi2 MultiplyNoFlags2 #44926

Closed
echesakov opened this issue Nov 19, 2020 · 15 comments
Closed

Bmi2 MultiplyNoFlags2 #44926

echesakov opened this issue Nov 19, 2020 · 15 comments
Labels
api-ready-for-review API is ready for review, it is NOT ready for implementation arch-x64 arch-x86 area-System.Runtime.Intrinsics

Comments

@echesakov
Copy link
Contributor

echesakov commented Nov 19, 2020

Background and Motivation

In .NET 3.1 we introduced the following intrinsics to expose mulx x86 instruction:

  • uint MultiplyNoFlags(uint left, uint right, uint* low)
  • ulong MultiplyNoFlags(ulong left, ulong right, ulong* low)

where low is an out parameter that is used to return the lower 32-bit/64-bit part of 64-bit/128-bit result of left * right multiplication while the return value contains the upper part.

When the instrinsics are used the JIT produces sub-optimal code due to the fact that low has "address-taken" attribute.

For example, the following C# methods

static unsafe uint mulx(uint a, uint b)
{
    uint r;
    return Bmi2.MultiplyNoFlags(a, b, &r) + r;
}

static unsafe ulong mulx_64(ulong a, ulong b)
{
    ulong r;
    return Bmi2.X64.MultiplyNoFlags(a, b, &r) + r;
}

will be compiled down to the following code by the current implementation of the JIT
mulx

G_M48748_IG01:              ;; offset=0000H
       50                   push     rax
       33C0                 xor      rax, rax
       89442404             mov      dword ptr [rsp+04H], eax
       89542418             mov      dword ptr [rsp+18H], edx
                                                ;; bbWeight=1    PerfScore 3.25
G_M48748_IG02:              ;; offset=000BH
       488D442404           lea      rax, bword ptr [rsp+04H]
       448B442418           mov      r8d, dword ptr [rsp+18H]
       8BD1                 mov      edx, ecx
       C4C233F6D0           mulx     edx, r9d, r8d
       448908               mov      dword ptr [rax], r9d
       8BC2                 mov      eax, edx
       03442404             add      eax, dword ptr [rsp+04H]
                                                ;; bbWeight=1    PerfScore 7.00
G_M48748_IG03:              ;; offset=0025H
       4883C408             add      rsp, 8
       C3                   ret

mulx_64

G_M55976_IG01:              ;; offset=0000H
       50                   push     rax
       33C0                 xor      rax, rax
       48890424             mov      qword ptr [rsp], rax
       4889542418           mov      qword ptr [rsp+18H], rdx
                                                ;; bbWeight=1    PerfScore 3.25
G_M55976_IG02:              ;; offset=000CH
       488D0424             lea      rax, bword ptr [rsp]
       4C8B442418           mov      r8, qword ptr [rsp+18H]
       488BD1               mov      rdx, rcx
       C4C2B3F6D0           mulx     rdx, r9, r8
       4C8908               mov      qword ptr [rax], r9
       488BC2               mov      rax, rdx
       48030424             add      rax, qword ptr [rsp]
                                                ;; bbWeight=1    PerfScore 7.00
G_M55976_IG03:              ;; offset=0027H
       4883C408             add      rsp, 8
       C3                   ret
                                                ;; bbWeight=1    PerfScore 1.25

However, if the Bmi2.MultiplyNoFlags were implemented instead as

public static unsafe uint MultiplyNoFlags(uint left, uint right, uint* low)
{
    var result = MultiplyNoFlags2(left, right); *low = result.Item1; return result.Item2;
}

public static unsafe ulong MultiplyNoFlags(ulong left, ulong right, ulong* low)
{
    var result = MultiplyNoFlags2(left, right); *low = result.Item1; return result.Item2;
}

the JIT as in #37928 would inline MultiplyNoFlags and be able to remove the address-taken attribute from a local corresponding to low:
mulx

G_M48748_IG01:              ;; offset=0000H
                                                ;; bbWeight=1    PerfScore 0.00
G_M48748_IG02:              ;; offset=0000H
       C4E27BF6D1           mulx     edx, eax, ecx
       03C2                 add      eax, edx
                                                ;; bbWeight=1    PerfScore 3.25
G_M48748_IG03:              ;; offset=0007H
       C3                   ret
                                                ;; bbWeight=1    PerfScore 1.00

mulx_64

G_M55976_IG01:              ;; offset=0000H
                                                ;; bbWeight=1    PerfScore 0.00
G_M55976_IG02:              ;; offset=0000H
       C4E2FBF6D1           mulx     rdx, rax, rcx
       4803C2               add      rax, rdx
                                                ;; bbWeight=1    PerfScore 3.25
G_M55976_IG03:              ;; offset=0008H
       C3                   ret
                                                ;; bbWeight=1    PerfScore 1.00

Proposed API

namespace System.Runtime.Intrinsics.X86
{
    public abstract class Bmi2 : X86Base
    {
        public static (uint Lower, uint Upper) MultiplyNoFlags2(uint left, uint right);

        public abstract class X64: X86Base.X64
        {
            public static (ulong Lower, ulong Upper) MultiplyNoFlags2(ulong left, ulong right);
        }
    }
}

Based on work Carol did in #37928

cc @CarolEidt @tannergooding

@ghost
Copy link

ghost commented Nov 19, 2020

Tagging subscribers to this area: @tannergooding, @jeffhandley
See info in area-owners.md if you want to be subscribed.

Issue Details
namespace System.Runtime.Intrinsics.X86
{
    public abstract class Bmi2 : X86Base
    {
        public static (uint, uint) MultiplyNoFlags2(uint left, uint right);

        public abstract class X64: X86Base.X64
        {
            public static (ulong, ulong) MultiplyNoFlags2(ulong left, ulong right);
        }
    }
}

Based on work Carol did in #37928

cc @CarolEidt @tannergooding

Author: echesakovMSFT
Assignees: -
Labels:

arch-x64, arch-x86, area-System.Runtime.Intrinsics

Milestone: -

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Nov 19, 2020
@echesakov echesakov added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed untriaged New issue has not been triaged by the area owner labels Nov 19, 2020
@jkotas
Copy link
Member

jkotas commented Nov 19, 2020

It this a workaround for JIT not being able to optimize ulong MultiplyNoFlags(ulong left, ulong right, ulong* low)?

@jkotas
Copy link
Member

jkotas commented Nov 19, 2020

It would be useful to follow the API review template for the top post. Background and Motivation for this one is not obvious.

@CarolEidt
Copy link
Contributor

It this a workaround for JIT not being able to optimize ulong MultiplyNoFlags(ulong left, ulong right, ulong* low)?

Yes. Currently the only way the JIT can handle a node that produces multiple registers is if it produces a struct. An alternative to explicitly adding an API that returns a struct (ValueTuple) would be to synthesize such a struct in the JIT. However, that is not something the JIT currently can do.

Note that, by adding this API and having ulong MultiplyNoFlags(ulong left, ulong right, ulong* low) call it, we can generate identical code for it.

@GrabYourPitchforks
Copy link
Member

Should consider giving names to the returned tuple elements, e.g., (int Hi, int Lo).

@tannergooding
Copy link
Member

I agree, but prefer Upper and Lower, which we use in most of the other intrinsic APIs.

@echesakov
Copy link
Contributor Author

I updated the description to include Background and Motivation as Jan suggested and added the names to the tuple elements.

@jkotas
Copy link
Member

jkotas commented Nov 20, 2020

Is it that hard to remove the address taken attribute in the example? I would think that it should be pretty easy for the JIT to prove that the address does not leak anywhere.

@echesakov
Copy link
Contributor Author

Is it that hard to remove the address taken attribute in the example? I would think that it should be pretty easy for the JIT to prove that the address does not leak anywhere.

From my understanding, it would be quite tricky. The JIT doesn't have any mechanism at the moment to prove that an address is not leaking when it's passed to an intrinsic for the same reasons as when it's passed to a function that was not inlined.

@CarolEidt
Copy link
Contributor

Is it that hard to remove the address taken attribute in the example? I would think that it should be pretty easy for the JIT to prove that the address does not leak anywhere.

From my understanding, it would be quite tricky. The JIT doesn't have any mechanism at the moment to prove that an address is not leaking when it's passed to an intrinsic for the same reasons as when it's passed to a function that was not inlined.

Also, it is not sufficient to prove that the address doesn't leak. It's also problematic to transform the IR into something that allows the backend to put both results in a register, when the IR has it as a reference. Really, in order to do that, by the time it gets to the backend it needs to be a single instruction that defines two registers. And for now, the best way (really, the only way without major changes) is to have it produce a two-field struct.

@jkotas
Copy link
Member

jkotas commented Nov 20, 2020

Yeah, I figured that the real problem is the representation and not just the address taken that the description at the top highlights.

@CarolEidt
Copy link
Contributor

I figured that the real problem is the representation and not just the address taken

Well, they're both real problems. It would be somewhat easier to transform just the backend to accommodate this (though still requiring major changes), but by that point we would already forced the variable to live on the stack.

@pentp
Copy link
Contributor

pentp commented Nov 23, 2020

I propose naming the new methods Bmi2.Multiply instead of MultiplyNoFlags2 - the flags/no-flags effect isn't observable in any way (at least not until flags returning methods are added, but these would be new methods then).

@terrajobst
Copy link
Member

terrajobst commented Nov 24, 2020

Video

  • It seems there are two separate issues:
    1. Performance issue, which we can solve with a private intrinsic that uses tuples.
    2. Usability issue due to use of pointers
  • We should fix (1) which doesn't require new public APIs but we should also do a pass of all the APIs that we believe need out or tuple overloads
namespace System.Runtime.Intrinsics.X86
{
    public abstract class Bmi2 : X86Base
    {
        public static (uint Lower, uint Upper) MultiplyNoFlags2(uint left, uint right);

        public abstract class X64: X86Base.X64
        {
            public static (ulong Lower, ulong Upper) MultiplyNoFlags2(ulong left, ulong right);
        }
    }
}

@stephentoub
Copy link
Member

@terrajobst, did you mean to close this? Is there a separate issue tracking "we should also do a pass of all the APIs that we believe need out or tuple overloads"? Or for that matter a separate issue tracking doing the perf fix for "Performance issue, which we can solve with a private intrinsic that uses tuples."?

@ghost ghost locked as resolved and limited conversation to collaborators Dec 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-ready-for-review API is ready for review, it is NOT ready for implementation arch-x64 arch-x86 area-System.Runtime.Intrinsics
Projects
None yet
Development

No branches or pull requests

9 participants