Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono] MiniJIT OP_XEQUAL for floats #77770

Merged
merged 7 commits into from
Nov 11, 2022
Merged

[mono] MiniJIT OP_XEQUAL for floats #77770

merged 7 commits into from
Nov 11, 2022

Conversation

matouskozak
Copy link
Member

@matouskozak matouskozak commented Nov 2, 2022

Floating points NaN must be treated differently. Adding check to MiniJIT for OP_XEQUAL to use floating point comparison.

Before:
Emitting pcmpeqd for every input type.

Now:
Emitting cmpeqps and cmpeqpd for doubles and floats, respectively. For integer inputs emitting pcmpeqd as before.

Fixing : #74781

@ghost ghost assigned matouskozak Nov 2, 2022
@matouskozak matouskozak added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Nov 2, 2022
@matouskozak matouskozak removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Nov 2, 2022
@matouskozak
Copy link
Member Author

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tannergooding
Copy link
Member

For the integer case pcmpeqdonly works for == because we’re effectively saying “all bits must be equal” and we return a single bool. However, for cases like Equals where we return a per element mask, we have to use the correct of pcmpeqb/w/d/q. The same is true for other comparisons like GreaterThan (must use the correct b/w/d/q instruction).

Keeping == and EqualsAll also using the most accurate instruction is then better as it is more consistent, makes the disassembly more readable/understandable, and allows more optimization opportunities for common patterns in SIMD algorithms. It also makes it conceptually simple as it’s logically Equals, MoveMask, cmp and the general pattern can play into any Sse41.TestZ optimizations

@filipnavara
Copy link
Member

filipnavara commented Nov 2, 2022

the integer case pcmpeqdonly works for == because we’re effectively saying “all bits must be equal” and we return a single bool. However, for cases like Equals where we return a per element mask, we have to use the correct of pcmpeqb/w/d/q. The same is true for other comparisons like GreaterThan (must use the correct b/w/d/q instruction).

That was originally introduced by me in 1eab0fe. It mimicked what Mono/LLVM was producing at that time. There were some instructions where the B/W/D/Q variants were not introduced in the same CPU extension set. In this specific case pcmpeqb/w/d are SSE2, while pcmpeqq is SSE4.1. The baseline is SSE2 and anything else is lowered to the baseline based on CPU capabilities. It's fine to emit pcmpeqb/w/d here but pcmpeqq would have to be special cased. Since the specific opcode is EQUALS only, I opted to just emit simple instruction that works on the CPU baseline. The implementation of OP_XCOMPARE uses correctly sized instructions and lowers quad-word comparisons too.

@tannergooding
Copy link
Member

The baseline is SSE2 and anything else is lowered to the baseline based on CPU capabilities. It's fine to emit pcmpeqb/w/d here but pcmpeqq would have to be special cased. Since the specific opcode is EQUALS only, I opted to just emit simple instruction that works on the CPU baseline. The implementation of OP_XCOMPARE uses correctly sized instructions and lowers quad-word comparisons too.

Yes, there are many cases of the xplat functionality where you have to special case and decide on an alternative for SSE3-SSE4.2 specific functionality. That is normal and expected of the compiler handling these APIs.

I called this out because while it "works" and is valid today, it leaves Mono as being suboptimal compared to RyuJIT and that will negatively impact several scenarios and provide an overall worse user experience for the .NET community when they need to use Mono, so it is something that should be eventually fixed and handled.

@filipnavara
Copy link
Member

filipnavara commented Nov 2, 2022

Yes, there are many cases of the xplat functionality where you have to special case and decide on an alternative for SSE3-SSE4.2 specific functionality. That is normal and expected of the compiler handling these APIs.

I am fine with emitting pcmpeqb/w/d here. Just mentioning that for U8 we either need to emit pcmpeqq conditionally and fallback to pcmpeqd, or simply use pcmpeqd every time. Either way is fine with me, there just needs to be extra care not to emit unsupported opcodes. Avoiding the extra complexity was the rationale for how it was originally implemented.

I called this out because while it "works" and is valid today, it leaves Mono as being suboptimal compared to RyuJIT and that will negatively impact several scenarios and provide an overall worse user experience for the .NET community when they need to use Mono, so it is something that should be eventually fixed and handled.

The Mini JIT is already a suboptimal case. It has notoriously bad register allocation for the vector use case (although it's still much better than a full software fallback). The main reason to maintain it is to keep the same feature set between the Mini JIT, LLVM JIT and LLVM AOT. Otherwise you run into cases where various .IsSupported properties behave inconsistently.

@matouskozak matouskozak added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Nov 3, 2022
@matouskozak
Copy link
Member Author

matouskozak commented Nov 3, 2022

Potential improvent for emit_not_xequal when using MiniJIT:
It would be better to use setne instead of two set instructions. Example:

Current behavior:

  pcmpeqd %xmm1,%xmm0
  pmovmskb %xmm0,%eax
  cmp    $0xffff,%rax
  sete   %al
  movzbq %al,%rax
  test   %rax,%rax
  sete   %al
  movzbq %al,%rax

Better behavior would be:

  pcmpeqd %xmm1,%xmm0
  pmovmskb %xmm0,%eax
  cmp    $0xffff,%rax
  setne  %al
  movzbq %al,%rax

@matouskozak
Copy link
Member Author

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@matouskozak matouskozak removed the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Nov 4, 2022
@fanyang-mono fanyang-mono marked this pull request as ready for review November 4, 2022 15:16
@fanyang-mono
Copy link
Member

fanyang-mono commented Nov 4, 2022

Failures on iOSSimulator, tvOS and tvOSSimulator are not related to the change in this PR. The same set of failures are seen for PR #77827. This PR is ready for review.

Part of the failures are being tracked by #77889

@matouskozak
Copy link
Member Author

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 77770 in repo dotnet/runtime

@fanyang-mono
Copy link
Member

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@fanyang-mono
Copy link
Member

CI Test failures are not related to this PR.

@fanyang-mono fanyang-mono merged commit 803fd02 into main Nov 11, 2022
@jkotas jkotas deleted the minijit_xequal_floats branch November 20, 2022 16:03
@ghost ghost locked as resolved and limited conversation to collaborators Dec 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants