-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add i32.popcnt and i64.popcnt to winch #6531
Conversation
Subscribe to Label Action
This issue or pull request has been labeled: "winch"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together! The current changes look good to me. I have one thought/comment though: are you planning on providing a fallback implementation for popcnt
if the has_popcnt
flag is not enabled, similar to what Cranelift provides? If it's too burdensome to do as part of this PR a TODO is also totally fine I think, but we might want to consider updating the fuzzing configuration so that it always enables the has_popcnt
flag to avoid failures when fuzzing winch. Even though we only enable the fuzzer locally, it's still useful for verifying our changes.
Also, you might also want to add <I32|I64>Popcnt
here https://github.com/bytecodealliance/wasmtime/blob/main/fuzz/fuzz_targets/differential.rs#L335
Yeah I can take a crack at that! I'll also add it to the fuzz targets |
@saulecabrera I added two |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it'd be nice to save a couple of instructions if possible, so I'm leaning towards the second fallback. This is also how SpiderMonkey handles this case https://searchfox.org/mozilla-central/source/js/src/jit/x86-shared/MacroAssembler-x86-shared-inl.h#110
I left a couple of extra comments, but feel free to ignore if those are things that you were already thinking about!
winch/codegen/src/isa/x64/asm.rs
Outdated
}); | ||
|
||
let fives = regs::scratch(); | ||
self.load_constant(&0x5555555555555555, fives, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not wrong, in the case of a fallback we'd need extra temporary register because the scratch is clobbered by all the calls to load_constant
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, perhaps we can pass in a mutable reference to the CodeGenContext
to MacroAssembler::popcnt
and do the dispatching at that level and if we need to emit the fallback, we can request an extra temporary register via CodeGenContext::any_gpr
? Similar to to the implementation of clz
https://github.com/bytecodealliance/wasmtime/blob/main/winch/codegen/src/isa/x64/masm.rs#L343
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scratch is clobbered by all the calls to load_constant right?
Oh this is good to know! I didn't know that. I am still learning the relationships between the different reg types and how their different abstraction layers are used, so if it seems like I'm doing something weird, please let me know :)
I'll take the approach in your second comment!
winch/codegen/src/isa/x64/asm.rs
Outdated
} | ||
} | ||
|
||
fn popcnt_fallback2(&mut self, size: OperandSize, reg: Reg) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we replace the emit
calls in this function with the emit functions in the assembler? That has the added benefit that it already handles constant loading based on the operand size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. So, for instance I could replace this:
self.emit(Inst::AluRmiR {
size: size.into(),
op: AluRmiROpcode::Sub,
src1: reg.into(),
src2: masked1.into(),
dst: reg.into(),
});
with something like this:
self.sub_rr(reg, masked, size);
right? I noticed some of the instructions don't have corresponding functions on the assembler (shiftr
, and
), should I add functions for those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly.
For shift
we have shift_ir
and shift_rr
in the assembler and for and
we have and_rr
and and_ir
, I think that should cover all the cases for the lowering here. But if it doesn't feel free to make any additions to the assembler!
e42626d
to
103115d
Compare
Co-authored-by: Nick Fitzgerald <fitzgen@gmail.com> Co-authored-by: Chris Fallin <chris@cfallin.org>
Move popcnt fallback up into the macroassembler. Share code between 32-bit and 64-bit popcnt Add Popcnt to winch differential fuzzing
d65bd49
to
7e301f8
Compare
I moved the fallback logic up into the MacroAssembler. I was running into some tricky ownership issues when trying to pass the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems right to me! I agree that tests are a good idea before merging but assuming those go okay then I think this is good.
winch/codegen/src/isa/x64/masm.rs
Outdated
self.asm.sub(dst.into(), tmp.into(), size); | ||
|
||
// x = (x & m2) + ((x >> 2) & m2); | ||
self.asm.mov(tmp.into(), dst.into(), size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this and the other mov
below be written like this? I'm not sure it makes any difference but it seems a little more clear to me if we can avoid using .into()
so much. I'm also curious if there are other methods named perhaps sub_rr
or and_ir
for the other instructions in this sequence.
self.asm.mov(tmp.into(), dst.into(), size); | |
self.asm.mov_rr(tmp, dst, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also curious if there are other methods named perhaps sub_rr or and_ir for the other instructions in this sequence.
there are! I'll change those
Subscribe to Label Actioncc @fitzgen
This issue or pull request has been labeled: "fuzzing"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
The scratch register was getting clobbered by the calls to `and`, so this is instead passing in a CodeGenContext to the masm's `popcnt` and letting it handle its own registers
@saulecabrera I moved the register management bit to the MacroAssembler and added a couple of filetests for the fallback. I manually tested the fallback by forcing that branch in the code (just added an
It seems to behave as intended! Let me know if there's anything else that needs cleaning up, otherwise I think this might be good to go! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one minor comment, but overall this looks great to me, thanks!
winch/codegen/src/isa/x64/asm.rs
Outdated
src: Gpr::new(src.into()).unwrap().into(), | ||
dst: Writable::from_reg(Gpr::new(src.into()).unwrap()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently have impl From<Reg> for Gpr
and impl From<Reg> for WritableGpr
so you could use those here if you wanted to reduce the boilerplate.
self.asm.popcnt(src, size); | ||
context.stack.push(Val::reg(src)); | ||
} else { | ||
let tmp = context.any_gpr(self); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you add a comment that the fallback is based on MacroAssembler::popcnt32
in https://searchfox.org/mozilla-central/source/js/src/jit/x86-shared/MacroAssembler-x86-shared-inl.h?
let (masks, shift_amt) = match size { | ||
OperandSize::S64 => ( | ||
[ | ||
0x5555555555555555, // m1 | ||
0x3333333333333333, // m2 | ||
0x0f0f0f0f0f0f0f0f, // m4 | ||
0x0101010101010101, // h01 | ||
], | ||
56u8, | ||
), | ||
// 32-bit popcount is the same, except the masks are half as | ||
// wide and we shift by 24 at the end rather than 56 | ||
OperandSize::S32 => ( | ||
[0x55555555i64, 0x33333333i64, 0x0f0f0f0fi64, 0x01010101i64], | ||
24u8, | ||
), | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is perfectly reasonable as-is but it keeps bothering me that the constant masks are duplicated in this way. One alternative might be:
let (masks, shift_amt) = match size { | |
OperandSize::S64 => ( | |
[ | |
0x5555555555555555, // m1 | |
0x3333333333333333, // m2 | |
0x0f0f0f0f0f0f0f0f, // m4 | |
0x0101010101010101, // h01 | |
], | |
56u8, | |
), | |
// 32-bit popcount is the same, except the masks are half as | |
// wide and we shift by 24 at the end rather than 56 | |
OperandSize::S32 => ( | |
[0x55555555i64, 0x33333333i64, 0x0f0f0f0fi64, 0x01010101i64], | |
24u8, | |
), | |
}; | |
let masks = [ | |
0x5555555555555555, // m1 | |
0x3333333333333333, // m2 | |
0x0f0f0f0f0f0f0f0f, // m4 | |
0x0101010101010101, // h01 | |
]; | |
let (mask, shift_amt) = match size { | |
OperandSize::S64 => (u64::MAX as i64, 56u8), | |
OperandSize::S32 => (u32::MAX as i64, 24u8), | |
}; |
Then use e.g. masks[0] & mask
. (Maybe rename mask
though.)
Another approach is to generate the constants using bit-twiddling tricks. I don't think this is a good idea but I spent enough time figuring out how it would work that I'm going to write it down anyway. You can divide u64::MAX
or u32::MAX
by certain constants to get these repeating patterns: specifically, [0x3, 0x5, 0x11, 0xff]
to produce [0x55..., 0x33..., 0x0f..., 0x01...]
respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thiiiink I'm inclined to keep it as is, though I don't feel too strongly about that. I initially had the 32-bit and 64-bit code as completely separate branches (this is what spidermonkey and cranelift do) and wasn't sure if I should even combine them in the first place. I kind of like the different constants being explicitly in the code, but I also see that using one to generate the other is tempting. Happy to go either way here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I don't feel strongly about it either, so I'm good with leaving it as is. 👍
|
||
// x -= (x >> 1) & m1; | ||
self.asm.shift_ir(1u8, dst, ShiftKind::ShrU, size); | ||
self.asm.and(RegImm::imm(masks[0]).into(), dst.into(), size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was pretty confused about why you didn't use and_ir
, until I dug into the implementation enough to understand that x86 only supports 32-bit immediates for and
, so doing it this way lets the underlying assembler decide whether to use load_constant
into the scratch register or to emit the immediate operand inline.
I don't know what to suggest but maybe there's a comment that could go here. Or maybe we should just assume that people reading this are more familiar with Winch idioms than I am. 😆
Also, it's a little unfortunate that in the 64-bit case, masks[1]
gets loaded into the scratch register twice. It would be nice to avoid that, but it would clutter the implementation here and maybe that's not worth doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe we should just assume that people reading this are more familiar with Winch idioms than I am. 😆
The scratch register has caused confusion, I agree. This is something that I'd like to improve, so I'm considering adding guards, which will hopefully make its usage traceable and more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just haven't gotten to it 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I did have the thought about 0x333...
getting loaded in twice, so I had briefly replaced the second RegImm::imm(masks[0]).into()
with regs::scratch().into()
, but that wouldn't work in the 32-bit case and then trying to work around that seemed like added complexity for not a lot of gain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it would be terrible to unconditionally call load_constant
here to put the 0x333...
mask in the scratch register, and use and_rr
explicitly. It's one extra instruction in the 32-bit case, but it's only a move-immediate, and the generated code is shorter due to only emitting the constant once.
No description provided.