Skip to content

Comments

perf(codegen): Eliminate size_of_val == 0 for DSTs with Non-zero-sized Prefix via NUW and Assume#152843

Open
TKanX wants to merge 5 commits intorust-lang:mainfrom
TKanX:bugfix/152788-codegen-dst-size-nuw-assume
Open

perf(codegen): Eliminate size_of_val == 0 for DSTs with Non-zero-sized Prefix via NUW and Assume#152843
TKanX wants to merge 5 commits intorust-lang:mainfrom
TKanX:bugfix/152788-codegen-dst-size-nuw-assume

Conversation

@TKanX
Copy link
Contributor

@TKanX TKanX commented Feb 19, 2026

View all comments

Summary:

Problem:

size_of_val(p) == 0 fails to optimize away for DST types that have a statically-known non-zero-sized prefix:

pub struct Foo<T: ?Sized>(pub [u32; 3], pub T);

pub fn demo(p: &Foo<dyn std::fmt::Debug>) -> bool {
    std::mem::size_of_val(p) == 0  // always false, but LLVM can't prove it
}

Foo has a 12-byte prefix, so its total size is always ≥ 12. Yet the comparison persists as a runtime computation in LLVM IR. This matters because Box<dyn T> drop emits this exact check to guard the deallocation call — for types with a guaranteed non-zero prefix, the branch should vanish but doesn't.

The slice tail variant Foo<[i32]> already optimized correctly; Foo<dyn Trait> and Foo<[u8]> did not.

Root Cause:

In size_and_align_of_dst (the ADT/Tuple branch), the size computation is:

full_size = (offset + unsized_size + (align-1)) & -align

LLVM cannot prove full_size > 0 because:

  1. offset + unsized_size used plain add — no overflow flags, so LLVM cannot conclude the result is ≥ offset.
  2. (x + addend) & -align — LLVM has no fold to prove that alignment rounding never reduces the value below x.

Solution:

Two changes:

  1. add nuw nsw on offset + unsized_size — the sum is bounded by the rounded size ≤ isize::MAX, so neither signed nor unsigned overflow is possible. Tells LLVM: unrounded_size ≥ offset.

  2. assume(full_size ≥ unrounded_size)round_up(x, a) ≥ x is a mathematical identity for power-of-two a. Tells LLVM: full_size ≥ unrounded_size ≥ offset. If offset > 0, the chain proves full_size > 0.

LLVM IR Comparison:

Foo<dyn Debug> — before (godbolt):

define noundef zeroext i1 @demo(ptr %p.0, ptr %p.1) {
start:
  %0 = getelementptr inbounds nuw i8, ptr %p.1, i64 8
  %1 = load i64, ptr %0, align 8, !range !3, !invariant.load !4
  %2 = getelementptr inbounds nuw i8, ptr %p.1, i64 16
  %3 = load i64, ptr %2, align 8, !range !5, !invariant.load !4
  %4 = tail call i64 @llvm.umax.i64(i64 %3, i64 4)
  %5 = add nuw i64 %1, 11
  %6 = add i64 %5, %4
  %7 = sub i64 0, %4
  %8 = and i64 %6, %7
  %_0 = icmp eq i64 %8, 0
  ret i1 %_0
}

Foo<dyn Debug> — after:

define noundef zeroext i1 @demo(ptr %p.0, ptr %p.1) {
start:
  ret i1 false
}

Foo<[u8]> — before:

define noundef zeroext i1 @demo_lessalignedslice(ptr %p.0, i64 %p.1) {
start:
  %0 = add i64 %p.1, 15
  %_0 = icmp ult i64 %0, 4
  ret i1 %_0
}

Foo<[u8]> — after:

define noundef zeroext i1 @demo_lessalignedslice(ptr %p.0, i64 %p.1) {
start:
  ret i1 false
}

Changes:

  • compiler/rustc_codegen_ssa/src/size_of_val.rs: addunchecked_suadd (NUW+NSW) on offset + unsized_size; add assume(full_size ≥ unrounded_size).
  • tests/codegen-llvm/dst-size-of-val-not-zst.rs: new codegen test verifying size_of_val == 0 folds to ret i1 false for Foo<dyn Debug>, Foo<[u8]>, and Foo<[i32]>.

Fixes #152788.

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 19, 2026
@rustbot

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@TKanX
Copy link
Contributor Author

TKanX commented Feb 20, 2026

@rustbot label +A-LLVM +A-codegen +C-optimization +T-compiler

@rustbot rustbot added A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such labels Feb 20, 2026
@fmease
Copy link
Member

fmease commented Feb 21, 2026

r? codegen

@rustbot rustbot assigned dianqk and unassigned fmease Feb 21, 2026
@rust-bors

This comment has been minimized.

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 22, 2026
@rustbot
Copy link
Collaborator

rustbot commented Feb 22, 2026

Reminder, once the PR becomes ready for a review, use @rustbot ready.

@TKanX TKanX force-pushed the bugfix/152788-codegen-dst-size-nuw-assume branch from a9ec27f to 8339cfe Compare February 22, 2026 05:32
@rustbot
Copy link
Collaborator

rustbot commented Feb 22, 2026

This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

@TKanX
Copy link
Contributor Author

TKanX commented Feb 22, 2026

@rustbot ready

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 22, 2026
@TKanX TKanX requested a review from scottmcm February 22, 2026 05:34
Comment on lines +183 to +189
// Alignment rounding can only increase the size, never decrease it:
// `round_up(x, a) >= x` for power-of-two `a`. With the `nuw` on the
// addition above, LLVM can therefore deduce
// `full_size >= unrounded_size >= offset`, which proves `full_size > 0`
// for types with a non-zero-sized prefix (#152788).
let size_ge = bx.icmp(IntPredicate::IntUGE, full_size, unrounded_size);
bx.assume(size_ge);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on which things you tried and why this is the best one? Was it not enough to say that the alignment is a power-of-two? Or...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ask because most of the text in the OP is just useless LLM slop, and the updates to the tests make me suspicious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottmcm

Can you elaborate on which things you tried and why this is the best one? Was it not enough to say that the alignment is a power-of-two? Or...

Tried nuw-only (unchecked_uadd) first. That gives LLVM unrounded >= offset > 0 but it stops at the rounding — LLVM can't prove (x + a-1) & -a >= x. Also checked whether feeding ctpop(align) == 1 would help, but there's no fold for "round-up is monotonic when alignment is pow2" in InstCombine/ValueTracking. So the assume tells LLVM the conclusion directly.

nsw (making it unchecked_suadd) is because unrounded ≤ rounded ≤ isize::MAX. Same reasoning as your #152867.

I ask because most of the text in the OP is just useless LLM slop, and the updates to the tests make me suspicious.

Sorry about the OP — English isn't my native language, I overwrite when trying to be precise. Will clean it up.

For the tests: CHECK-NOT: icmp broke because assume itself emits an icmp. The !range checks on the first two functions were dropped because the assume keeps the size computation alive, so there's now a size load before the alignment load — FileCheck hits the wrong one. Range metadata is still verified in align_load_from_align_of_val below. RANGE_METAALIGN_RANGE since it only covers alignment loads now. Range value {1, 0}{1, 0x20000001} is Align::max_for_target (same change as #152929).

Happy to close this if you'd rather land it as part of #152867.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Landing this separately is great -- I opened the issue because this particular bit about what LLVM can prove is different enough from the point of layout_of_val that it's better to have the changes separated. (That's why I pulled out #152929 too 🙂 )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah, I experimented a bit https://llvm.godbolt.org/z/haGYz7aax and even getting lots of annotations on everything and assume it's still not able to understand what's happening properly.

(Also it's so annoying to see add nsw i64 %4, -1 since that used to be sub nuw nsw i64 %4, 1 but LLVM just insists on throwing that information away.)

@dianqk
Copy link
Member

dianqk commented Feb 22, 2026

r? scottmcm

@rustbot rustbot assigned scottmcm and unassigned dianqk Feb 22, 2026
Comment on lines -33 to 36
// CHECK: load [[USIZE:i[0-9]+]], {{.+}} !range [[RANGE_META:![0-9]+]]
// CHECK: load [[USIZE:i[0-9]+]]
// CHECK-NOT: llvm.umax
// CHECK-NOT: icmp
// CHECK-NOT: select
// CHECK: ret
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the problem here is that if this was testing for "not icmp", just removing that check means this test is (potentially) no longer testing what it was trying to test before.

If there's an icmp now, probably what you want instead is something like

    // CHECK-NOT: llvm.umax
    // CHECK-NOT: icmp
    // CHECK-NOT: select
    // CHECK: [[DOES_NOT_SHRINK:%.+]] = icmp ... something here ...
    // CHECK-NEXT: call void @llvm.assume(i1 [[DOES_NOT_SHRINK]])
    // CHECK-NOT: llvm.umax
    // CHECK-NOT: icmp
    // CHECK-NOT: select

so that the test is that the only icmp is the expected one that's used for the assume.


Similarly, why remove the !range check? It's not being optimized out, is it? (If it is, that's also interesting.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the emitted IR — the assume (and the entire size computation) gets DCE'd in these two functions at -O3, since they only need alignment for the field projection. So there's no extra icmp at all, and the alignment load is still the first one with !range. Restored the original patterns as-is; the file is now unchanged from main.

@scottmcm scottmcm added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 22, 2026
TKanX and others added 2 commits February 22, 2026 13:42
…nzero

Co-authored-by: Scott McMurray <scottmcm@users.noreply.github.com>
…ign-nonzero

Co-authored-by: Scott McMurray <scottmcm@users.noreply.github.com>
@TKanX
Copy link
Contributor Author

TKanX commented Feb 22, 2026

@rustbot ready

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 22, 2026
@TKanX TKanX requested a review from scottmcm February 22, 2026 21:55
@scottmcm
Copy link
Member

Ah, great, that other file just not changing at all any more is excellent. Diffs that aren't there are my favourite things, as a reviewer 🙂

This probably isn't instantiated enough for an assume to be a perf problem, but checking just in case
@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rust-bors

This comment has been minimized.

rust-bors bot pushed a commit that referenced this pull request Feb 22, 2026
…me, r=<try>

perf(codegen): Eliminate `size_of_val == 0` for DSTs with Non-zero-sized Prefix via NUW and Assume
@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 22, 2026
@TKanX
Copy link
Contributor Author

TKanX commented Feb 22, 2026

Ah, great, that other file just not changing at all any more is excellent. Diffs that aren't there are my favourite things, as a reviewer 🙂

This probably isn't instantiated enough for an assume to be a perf problem, but checking just in case @bors try @rust-timer queue

The assume path is cold enough I wasn't worried, but data's data.

@rust-timer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@TKanX
Copy link
Contributor Author

TKanX commented Feb 22, 2026

@scottmcm You were right about the sandwich — I only tested on x86_64, where LLVM DCE'd the assume entirely, but aarch64 keeps it alive. Working on it now.

Co-authored-by: Scott McMurray <scottmcm@users.noreply.github.com>
@scottmcm
Copy link
Member

Hmm, why would aarch64 do anything different here? The codegen-llvm tests are running only the middle-end of llvm, not the backend, so it shouldn't matter...

@rust-bors
Copy link
Contributor

rust-bors bot commented Feb 23, 2026

☀️ Try build successful (CI)
Build commit: 3cf407e (3cf407e8b04f1a796bf7b9360afd7972896f340d, parent: 1500f0f47f5fe8ddcd6528f6c6c031b210b4eac5)

@rust-timer

This comment has been minimized.

@TKanX
Copy link
Contributor Author

TKanX commented Feb 23, 2026

Hmm, why would aarch64 do anything different here? The codegen-llvm tests are running only the middle-end of llvm, not the backend, so it shouldn't matter...

You're right — the architecture has nothing to do with it. I was testing locally against LLVM 22, which DCEs the assume entirely. I verified this just now: same unoptimized IR through opt -O3 with both target triples produces identical output on LLVM 22. The x86_64-gnu-llvm-20 job was cancelled (not passed), so it would have failed the same way.

@scottmcm

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (3cf407e): comparison URL.

Overall result: no relevant changes - no action needed

Benchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This benchmark run did not return any relevant results for this metric.

Max RSS (memory usage)

Results (primary -0.8%)

A less reliable metric. May be of interest, but not used to determine the overall result above.

mean range count
Regressions ❌
(primary)
2.1% [2.1%, 2.1%] 1
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-2.2% [-2.5%, -1.9%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -0.8% [-2.5%, 2.1%] 3

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 483.037s -> 479.541s (-0.72%)
Artifact size: 397.95 MiB -> 397.91 MiB (-0.01%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Feb 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

size_of_val(p) == 0 doesn't optimize out for clearly-not-ZST values

7 participants