Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cranelift: extra movdqa when lowering icmp #3945

Closed
abrown opened this issue Mar 18, 2022 · 4 comments
Closed

Cranelift: extra movdqa when lowering icmp #3945

abrown opened this issue Mar 18, 2022 · 4 comments
Labels
cranelift:area:regalloc Issues related to register allocation. cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator

Comments

@abrown
Copy link
Contributor

abrown commented Mar 18, 2022

#3886 ports CLIF icmp operations to ISLE but in the process added new, unnecessary moves to the lowering:

; Inst 2: movdqa %xmm1, %xmm2
; Inst 3: movdqa %xmm0, %xmm1
; Inst 4: pmaxsw %xmm2, %xmm1
; Inst 5: pcmpeqw %xmm1, %xmm0

The two initial movdqa instructions are not needed with the proper use of registers; my current thinking is that ISLE's move elision may not be working as it should but this could be due to re-ordering of operands in a lowering rule that ISLE cannot resolve.

@abrown
Copy link
Contributor Author

abrown commented Mar 18, 2022

This may be similar to #3744.

@alexcrichton alexcrichton added cranelift Issues related to the Cranelift code generator cranelift:area:regalloc Issues related to register allocation. cranelift:area:x64 Issues related to x64 codegen labels Mar 23, 2022
@alexcrichton
Copy link
Member

With the new regalloc the current test looks like:

; movdqa %xmm0, %xmm5
; pmaxsw %xmm5, %xmm5, %xmm1
; pcmpeqw %xmm0, %xmm0, %xmm5

I'm not familiar enough with these instructions to know whether this is fixed though (there's still a single movdqa)

@abrown
Copy link
Contributor Author

abrown commented May 3, 2022

I mean... this just looks wrong: I can never quite remember Cranelift's order of operands when using the x64 backend but our use of % makes this look like AT&T syntax (i.e., destination on the right). %xmm0 and %xmm1 are probably used for ABI arguments so it looks like we copy %xmm0 to %xmm5 in order to use the value twice--so far so good. But then the problem: I think what we need to do is to a) find the maximum between %xmm5 and %xmm1 and put that into %xmm5 and b) equality-compare %xmm0 and %xmm5 and put that into %xmm0 for the return. but the last two instructions there don't quite say that! Looks like it should be:

pmaxsw %xmm5, %xmm1, %xmm5
pcmpeqw %xmm0, %xmm5, %xmm0

But even this is a bit silly because those SIMD instructions aren't really being used in their three-operand form.


@alexcrichton, I think this issue is fixed though 😁. The extra move is gone but we should resolve this debug output question...

@cfallin
Copy link
Member

cfallin commented May 3, 2022

@abrown the output of clif-util compile on the function is:

Disassembly of 21 bytes:
   0:   55                      push    rbp
   1:   48 89 e5                mov     rbp, rsp
   4:   66 0f 6f e8             movdqa  xmm5, xmm0
   8:   66 0f ee e9             pmaxsw  xmm5, xmm1
   c:   66 0f 75 c5             pcmpeqw xmm0, xmm5
  10:   48 89 ec                mov     rsp, rbp
  13:   5d                      pop     rbp
  14:   c3                      ret

so this looks right in actuality (result is put in xmm0). I think it's a pretty-printing bug (argument swap) actually -- PR incoming to fix that.

The extra move is indeed elided now, so I think we can close this particular issue.

@cfallin cfallin closed this as completed May 3, 2022
cfallin added a commit to cfallin/wasmtime that referenced this issue May 3, 2022
The pretty-printing had swapped dst and src2; this was introduced when
we moved to RA2 (sorry about that! IMHO we should do something to
automate the mapping between regalloc arg collection and pretty
printing/emission).

`src2` comes at the end because it has a variable number of register
mentions; this is in line with how many of the other inst formats work.

Actual emitted code was never incorrect, just the pretty-printing.

Updated test golden outputs look correct to me now, including the one
that we saw was incorrect in bytecodealliance#3945.
cfallin added a commit that referenced this issue May 3, 2022
The pretty-printing had swapped dst and src2; this was introduced when
we moved to RA2 (sorry about that! IMHO we should do something to
automate the mapping between regalloc arg collection and pretty
printing/emission).

`src2` comes at the end because it has a variable number of register
mentions; this is in line with how many of the other inst formats work.

Actual emitted code was never incorrect, just the pretty-printing.

Updated test golden outputs look correct to me now, including the one
that we saw was incorrect in #3945.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:regalloc Issues related to register allocation. cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

No branches or pull requests

3 participants