Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result<u32, u32> uses less efficient ABI than Result<i32, u32> #100698

Open
Kobzol opened this issue Aug 17, 2022 · 6 comments
Open

Result<u32, u32> uses less efficient ABI than Result<i32, u32> #100698

Kobzol opened this issue Aug 17, 2022 · 6 comments
Labels
A-layout Area: Memory layout of types C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Kobzol
Copy link
Contributor

Kobzol commented Aug 17, 2022

I noticed this while writing a codegen test for #37939 (comment).
It seems that Result<u32, u32> and Result<i32, i32> are passed as two 32-bit numbers, while Result<i32, u32> and Result<u32, i32> are passed as a single 64-bit number.

Compiler: rustc 1.65.0-nightly (86c6ebee8 2022-08-16)

Code
pub fn result_nop_match_i32(x: Result<i32, i32>) -> Result<i32, i32> {
    match x {
        Ok(x) => Ok(x),
        Err(x) => Err(x),
    }
}
pub fn result_nop_match_u32(x: Result<u32, u32>) -> Result<u32, u32> {
    match x {
        Ok(x) => Ok(x),
        Err(x) => Err(x),
    }
}
pub fn result_nop_match_i32_u32(x: Result<i32, u32>) -> Result<i32, u32> {
    match x {
        Ok(x) => Ok(x),
        Err(x) => Err(x),
    }
}
pub fn result_nop_match_u32_i32(x: Result<u32, i32>) -> Result<u32, i32> {
    match x {
        Ok(x) => Ok(x),
        Err(x) => Err(x),
    }
}
LLVM IR
define { i32, i32 } @_ZN7example20result_nop_match_i3217h90973337e0ea1f0bE(i32 noundef %0, i32 %1) unnamed_addr #0 !dbg !6 {
  %2 = insertvalue { i32, i32 } undef, i32 %0, 0, !dbg !11
  %3 = insertvalue { i32, i32 } %2, i32 %1, 1, !dbg !11
  ret { i32, i32 } %3, !dbg !11
}

define { i32, i32 } @_ZN7example20result_nop_match_u3217ha367bf54ac4241a2E(i32 noundef %0, i32 %1) unnamed_addr #0 !dbg !12 {
  %2 = insertvalue { i32, i32 } undef, i32 %0, 0, !dbg !13
  %3 = insertvalue { i32, i32 } %2, i32 %1, 1, !dbg !13
  ret { i32, i32 } %3, !dbg !13
}

define i64 @_ZN7example24result_nop_match_i32_u3217h0912c9953cfa7e17E(i64 returned %0) unnamed_addr #0 !dbg !14 {
  ret i64 %0, !dbg !15
}

define i64 @_ZN7example24result_nop_match_u32_i3217h8fc452067b26b8ddE(i64 returned %0) unnamed_addr #0 !dbg !16 {
  ret i64 %0, !dbg !17
}

https://rust.godbolt.org/z/n91oWdjMh

I'm not sure if there's something stopping Result<u32, u32> from being coerced to a single 64-bit number (maybe some heuristic for auto-vectorization?), but this seemed a bit odd to me. It's also a bit surprising to me that Result<u32, u32> is represented as define { i32, i32 }, but I don't really know a lot about the Rust ABI.

@RalfJung
Copy link
Member

Yeah, looks like Result<u16, u16> gets the ScalarPair layout but Result<u16, i16> does not. Scalar layout knows about the signedness of the type, which is probably the problem here, but I don't know if there is a good reason for it to be that way.

Cc @eddyb @oli-obk

@RalfJung RalfJung added the A-layout Area: Memory layout of types label Aug 28, 2022
@eddyb
Copy link
Member

eddyb commented Aug 28, 2022

Scalar layout knows about the signedness of the type, which is probably the problem here, but I don't know if there is a good reason for it to be that way.

There's a very thorny ABI problem of sign-extension that we sadly had to respect: on many platforms, passing e.g. u8 vs i8 around will zero- vs sign-extend it to a whole register, allowing 2's complement arithmetic on that register without additional sign-extension.
This is a choice that call ABI/"calling conventions" make - if we don't tell LLVM to do such sign-extension, Rust code will continue to work great (albeit performance can suffer if too many redundant pairs truncations and sign-extensions have to be done), but FFI will likely break:

// C
int8_t halve_i8(int8_t x) { return x / 2; }
// Rust
halve_i8(-2 /* 0xfe */)

That may still return -1 on some platforms, but without sign-extension 0xfe / 2 is 0x7f, i.e. 127.

However, ScalarPair is ignored for FFI (eventually maybe we can make it a flavor of Abi::Aggregate? last time I tried I ended up blocking myself on #98615), and while preserving signedness can be useful for performance, we could merge differently-signed Scalars when making the Abi::ScalarPair for enums.

@eddyb
Copy link
Member

eddyb commented Aug 28, 2022

As for the issue itself: Result<u32, i32> being represented with a single i64 is actually caused by an (unnecessary, as described above) pessimization in the current system (Abi::Aggregate instead of Abi::ScalarPair), whereas this issue talks about it the other way around (more on that later below).

You'll also see this if you try to return [u32; 2] vs (u32, u32), so we can mostly ignore the enum interaction.


With Abi::ScalarPair(a, b) (and maybe some future generalization of "N immediate leaves") we get to know that we can pass the value as an a Scalar and a b Scalar instead of as a blob of bytes.

AFAIK this is pretty much always good when dealing with a value inside a function: one SSA component per field is what LLVM tried to get anyway (e.g. via SROA), we're just getting ahead of both us and it wasting time.
(only exception I can think of is if you lose some annotations that you may have had on the whole type otherwise? but that's more of a LLVM issue)

For Rust's call ABI/"calling convention" (all the FFI ones ignore ScalarPair), we also use this information to pass a value in separate registers. And this is where it becomes a tradeoff.

If both Scalars are register-sized, it's pretty much perfect. If they're bigger, LLVM splits them up anyway.
If they're smaller, not getting "optimized" into a ScalarPair could've meant fewer registers used.

So do we want fewer registers, or less bitwise/memory juggling to encode/decode the components?
(i.e. tradeoff is: by "wasting" registers, we always have readily-usable fields, instead of "blobs")

I'm not sure, not even of how to test it - we could obviously add a quick check and see if rustc itself is affected (might be, lots of 32-bit newtyped indices on e.g. x64), but other than that we're in the dark.


For the record, this is a demonstration of the status quo on godbolt: https://godbolt.org/z/d4f9Pqexn.
(sorry for the weird layout but I couldn't find any way to disable mergefunc and it was just too confusing to have {encode_to,decode_from}_2x32 be merged together into one function, when they were in the same compilation)

1x64 is Abi::Aggregate getting fit into a register, 2x32 is the "optimized" Abi::ScalarPair
(and 2x32 doesn't need any of the shr/shl/or that 1x64 does, but uses one more register)

cc @rust-lang/wg-llvm @workingjubilee

@bjorn3
Copy link
Member

bjorn3 commented Aug 30, 2022

What would the effect of keeping ScalarPair for fat pointers (as a lot of codegen code depends on this) and maybe structs with two register sized fields, but using Aggregate for everything else be?

@scottmcm
Copy link
Member

and 2x32 doesn't need any of the shr/shl/or that 1x64 does, but uses one more register

Hmm, I suppose assumptions about whether something will inline is probably relevant too. I'd guess that more registers is probably the better way if it's inlined, since I presume it's easier for LLVM to understand what's happening if it doesn't need to undo the bit packing first.


I couldn't find any way to disable mergefunc

-Z merge-functions=disabled ?

@eddyb
Copy link
Member

eddyb commented Aug 31, 2022

What would the effect of keeping ScalarPair for fat pointers (as a lot of codegen code depends on this) and maybe structs with two register sized fields, but using Aggregate for everything else be?

You also have to consider how much everything newtypes pointers, like Box<[T]> itself is only ScalarPair because it wraps something that itself is a ScalarPair (NonNull<[T]>, a newtype of *mut [T], so that's two levels of "single-field aggregate" ScalarPairs!).

I would personally be inclined to ignore this issue until anyone actually shows a downside from "more registers" (I'm aware of the theoretical ones but is it bad in practice?).

However, we could try coalescing ScalarPair(Int(N), Int(M)) into a single Int(N+M) (i.e. "two integers", and not e.g. (u32, f32) / Option<f32> etc.).
If we do try it, we should make sure to only do it if N+M ≤ pointer_size:

  • e.g. on x64 it would coalesce (i32, i32) (or (i32, i16) etc.) into an i64, but not (u32, u64)
  • the quick explanation is that a LLVM integer larger than the register size will use at least two registers, so if we use such an integer we're only obfuscating the actual structure of the value
  • also, this is continuing the practice of treating pointer_size as a surrogate for "register size" (it may be nice to have some information about "register files" available somewhere - though generally it's probably more useful for floats than integers)
  • it also avoids affecting wide pointers at all, or generally anything involving a pointer (among other things, it's nice to keep an integer/pointer distinction in LLVM, even if we're erasing the pointee type)

Hmm, I suppose assumptions about whether something will inline is probably relevant too. I'd guess that more registers is probably the better way if it's inlined, since I presume it's easier for LLVM to understand what's happening if it doesn't need to undo the bit packing first.

Yeah, we should never do bitpacking intra-function for this reason, at most we can change the call ABI, but we want to go in the direction of more "exploding data types into their scalar leaves" (aka "SROA") intra-function - and inlining effectively "erases" the call ABI, making "inter" into "intra".

This is where a proper "ABI mapping" system would be very nice, something where you'd describe a function in the form best for inlining (e.g. w/ arguments/returns split into individual SSA values), and then the export of that function would be handled just like reification to a fn pointer, as if you had to do:

pub static LINK_EXPORTS: MyLinkExports = MyLinkExports {
    // ...
    foo: foo as fn(_) -> _,
    // ...
};

And then the "reify function into function pointer" construct in the IR would contain the registers/stack mapping of the arguments/return passing, outside of the function.

If you have to call the exported function in a way that can handle late-binding/interposition (which is mandated in C whenever you don't label a function as static, AFAIK), you can basically do an "indirect" call of the export. (or even better, use a separate import with the same name, to force no inlining and guarantee interposition will work - I'm talking generally here though, in Rust we would want to avoid this if possible)

And for anything private, the calls would be in the inlining-friendly form, which would also allow the backend to choose an optimal ABI even if it doesn't inline based on some global analysis (LLVM does do some stuff like this today, but it's much clunkier, since it "edits" the call ABI the frontend chose instead of synthesizing a new one).

Also, in such a system, you could fully annotate arguments/return value fields with e.g. ranges and other invariants (even if the ABI mapping needs to introduce indirection, cram two smaller values into an integer etc.).


-Z merge-functions=disabled?

Ahh, thanks! (I was checking -C help and -C llvm-args=help but didn't think to look under -Z)

@Enselic Enselic added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-layout Area: Memory layout of types C-enhancement Category: An issue proposing an enhancement or a PR with one. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants