Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x64: lower iabs.i64x2 using a single AVX512 instruction when possible #2819

Merged
merged 4 commits into from
Apr 15, 2021

Conversation

abrown
Copy link
Contributor

@abrown abrown commented Apr 8, 2021

This introduces the mechanism for encoding EVEX instructions to the new backend (ported with slight changes from the old) and then uses it to improve the lowering of iabs.i64x2 from 5 instructions to 1 instruction (i.e. VPABSQ).

@abrown
Copy link
Contributor Author

abrown commented Apr 8, 2021

I'm introducing this PR as a draft to solicit opinions about how best to integrate this:

  • any preference to switch to a builder pattern (e.g. Evex::new().v128().prefix(...).map(...).reg(...).rm(...).w(true).opcode(0x1F).encode() matching the manual's syntax, EVEX.128.66.0F38.W1 1F /r) over the current encode_evex(...)?
  • a better integration with Inst than XmmUnaryRmREvex { op: Avx512Opcode, ...}? I cannot just add a boolean evex field to XmmUnaryRmR because some AVX512 instructions (e.g. VPABSQ) don't exist so they should not be included in SseOpcode. But the XmmUnaryRmREvex approach is not going to scale to a bunch of other instructions well.

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen labels Apr 8, 2021
@bnjbvr
Copy link
Member

bnjbvr commented Apr 12, 2021

any preference to switch to a builder pattern (e.g. Evex::new().v128().prefix(...).map(...).reg(...).rm(...).w(true).opcode(0x1F).encode() matching the manual's syntax, EVEX.128.66.0F38.W1 1F /r) over the current encode_evex(...)?

Yeah, this could be nice! If the flags don't have any relationship with each other, it might be possible to pass a &sink to the new() function and encode it as we call the builder functions. Alternatively, deferring it all and building in encode(&sink) sounds good too! And if this works out well, we could refactor a bit the previous encode functions; their API really looks a bit too C-ish.

a better integration with Inst than XmmUnaryRmREvex { op: Avx512Opcode, ...}? I cannot just add a boolean evex field to XmmUnaryRmR because some AVX512 instructions (e.g. VPABSQ) don't exist so they should not be included in SseOpcode. But the XmmUnaryRmREvex approach is not going to scale to a bunch of other instructions well.

Why wouldn't this scale well, out of curiosity? In general I think we should tend to APIs that make it impossible to create invalid instructions. It might not be the case for existing APIs, which should be refactored ideally...

@abrown
Copy link
Contributor Author

abrown commented Apr 13, 2021

@bnjbvr, the latest commit moves to the builder pattern. It's a hybrid of your idea: the EVEX 4-byte prefix is built up as the builder methods are called but we still need to call .encode(sink) to emit later because some methods, e.g. .reg(...), modify both the prefix and the ModR/M byte, which are emitted at different times.

Being a bit paranoid about performance, I benchmarked the encode_evex function approach against the EvexInstruction builder approach by encoding the same instruction repeatedly inside bencher. For some reason, the builder approach turns out to be faster:

test isa::x64::inst::encoding::evex::tests::encode_with_function ... bench:          17 ns/iter (+/- 0
test isa::x64::inst::encoding::evex::tests::encode_with_builder  ... bench:           6 ns/iter (+/- 0)

I didn't dig too deep into this since @cfallin mentioned that emission is hardly the long pole in the tent. I was happy enough that this builder approach was not slower and left it at that. I think this is OK to review and merge understanding that there are still pieces coming (e.g. this only supports reg-reg addressing).

Why wouldn't this scale well, out of curiosity?

I mean mental scaling, not codegen performance or anything like that (some of us operate on limited brain RAM). Finding the right Inst variant when adding a new instruction can be tricky: "so this is unary, right, so I'm going to use that... no wait, I need to use the Xmm form... no, hold on, this is AVX512 so I need to find the Evex form of that." It's a developer experience thing more than anything. I agree that we should restrict the inputs somehow to only generate valid instructions but this "which Inst variant?" question seems a bit different.

@abrown abrown marked this pull request as ready for review April 13, 2021 19:00
@@ -4276,6 +4282,7 @@ fn test_x64_emit() {
let mut isa_flag_builder = x64::settings::builder();
isa_flag_builder.enable("has_ssse3").unwrap();
isa_flag_builder.enable("has_sse41").unwrap();
isa_flag_builder.enable("has_avx512f").unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish there were a way to avoid this here and instead specify it up above when necessary--suggestions welcome because I can't think of a great way to do this without rewriting this entire file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a single array with all the testing tuples, we could have several ones, one for each Flag combination we'd like to test. Then we could refactor this function so that it takes flags from a parameter and an array of test tuples, or something like this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe not something to worry too much about, since there's no actual instruction selection test here. What was your concern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess that's true. It just looked weird to lump the AVX512 flag in with the baseline flags. Maybe in a follow-on PR I will try to factor out the "check these instructions" code so that I can do as you suggest above.

@abrown
Copy link
Contributor Author

abrown commented Apr 14, 2021

@bnjbvr, can you take another look at this when you get a chance?

@bnjbvr
Copy link
Member

bnjbvr commented Apr 14, 2021

@bnjbvr, can you take another look at this when you get a chance?

Yep, happy to take a look in the next few days!

Copy link
Member

@bnjbvr bnjbvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the Builder approach. I'd like to note that while there's nothing preventing from writing the same bits twice in a row, that's actually not an issue because it'll just overwrite them instead.

It'd be nice if we could avoid an additional CodeSink trait, since we already have one. I only skimmed the encodings bit twiddling; as long as there are tests that validate that the right encoding is generated, this patch looks fine!

@@ -4276,6 +4282,7 @@ fn test_x64_emit() {
let mut isa_flag_builder = x64::settings::builder();
isa_flag_builder.enable("has_ssse3").unwrap();
isa_flag_builder.enable("has_sse41").unwrap();
isa_flag_builder.enable("has_avx512f").unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having a single array with all the testing tuples, we could have several ones, one for each Flag combination we'd like to test. Then we could refactor this function so that it takes flags from a parameter and an array of test tuples, or something like this?

cranelift/codegen/src/isa/x64/inst/encoding/evex.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/isa/x64/inst/encoding/evex.rs Outdated Show resolved Hide resolved

#[derive(Copy, Clone, Default)]
pub struct Register(u8);
impl From<u8> for Register {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the way it works right now is that it goes from the RealReg hardware encoding into an u8 then into this Register structure. Could instead the methods taking Register take a RealReg instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that Rust will pare away those abstractions so I didn't worry too much about overhead. I am trying to keep this encoding code dependency-free (especially of external dependencies) so, like with CodeSink, I am using slightly different types/traits.

pub mod rex;
pub mod vex;

pub trait CodeSink {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there already another CodeSink trait somewhere in this crate too? I wonder if we could avoid this one, to reduce the number of concepts, and just read out of the MachBuffer internal vector, for testing purposes?

Copy link
Contributor Author

@abrown abrown Apr 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here, but as you can see it would force this module to implement things it has no idea about and I wasn't too sure the old backend trait would be around forever. Also, I am trying to keep this module as dependency-free as possible so that in the future it could be used elsewhere--if I want to use this in the future to just encode instructions I won't want or need the additional MachBuffer methods and fields.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could name the trait something else -- ByteSink or something?

It might even make sense to do the factoring in the other direction -- split out ByteSink from CodeSink and make the former a constraint on the latter (trait CodeSink : ByteSink { ... } with just the additional methods), but that's probably out-of-scope for this PR...

@@ -4276,6 +4282,7 @@ fn test_x64_emit() {
let mut isa_flag_builder = x64::settings::builder();
isa_flag_builder.enable("has_ssse3").unwrap();
isa_flag_builder.enable("has_sse41").unwrap();
isa_flag_builder.enable("has_avx512f").unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe not something to worry too much about, since there's no actual instruction selection test here. What was your concern?

Also, includes an empty stub module for the VEX encoding.
This change replaces the `encode_evex` function with a builder-style struct, `EvexInstruction`. This approach clarifies the code, adds documentation, and results in slight speedups when benchmarked.
@abrown abrown merged commit 0acc145 into bytecodealliance:main Apr 15, 2021
@abrown abrown deleted the inst-format-3 branch April 15, 2021 18:54
@abrown abrown restored the inst-format-3 branch April 15, 2021 18:54
@abrown abrown deleted the inst-format-3 branch May 17, 2021 17:20
mchesser pushed a commit to mchesser/wasmtime that referenced this pull request May 24, 2021
…bytecodealliance#2819)

* x64: add EVEX encoding mechanism

Also, includes an empty stub module for the VEX encoding.

* x64: lower abs.i64x2 to VPABSQ when available

* x64: refactor EVEX encodings to use `EvexInstruction`

This change replaces the `encode_evex` function with a builder-style struct, `EvexInstruction`. This approach clarifies the code, adds documentation, and results in slight speedups when benchmarked.

* x64: rename encoding CodeSink to ByteSink
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants