-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the impact of Vec::reserve calls that do not cause any allocation #83357
Conversation
r? @kennytm (rust-highfive has picked a reviewer for you, use r? to override) |
This comment has been minimized.
This comment has been minimized.
|
old
new
benchcmp
Honestly I don't know what to make of all this. Overall it looks okay, but I'm going to spend some time maybe tomorrow tracking down the regressions. Most of them seem seem repeatable, for what that's worth. But I think some fraction of them are unrelated to the fact that this patch pessimizes calls to |
// Therefore, we move all the resizing and error-handling logic from grow_amortized and | ||
// handle_reserve behind a call, while making sure that the this function is likely to be | ||
// inlined as just a comparison and a call if the comparison fails. | ||
#[cold] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why mark this as cold
instead of inline(never)
? I would expect this to be called somewhat frequently in a normal program.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I have read in other comments is that cold
simply means llvm's coldcc. So it minimizes the impact on the caller, isn't that the goal here? We assume the caller code to be hotter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some places inside vec (and some other modules) it's annotated via
#[cold]
#[inline(never)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jyn514 That's kind of right and I'm taking a new look at this PR with the same goal of reducing the amount of time spent in reserve
calls that don't cause any capacity change.
It looks to me like most of the Vec::reserve
calls are guarded by a capacity check (such as the one in push
). But some aren't. This manifests in the serde_json
benchmarks because serializers do a lot of Write::write_all
calls which is implemented by Vec::extend_from_slice
, which is implemented (eventually) by append_elements
, which has an unguarded self.reserve
. I'm assessing adding an if self.len() + additional > self.capacity()
guard around the call in append_elements
, and going to take a look at either guarding all the reserve
calls or sinking the check into Vec::reserve
, and deduplicating the check around a lot of the call sites in Vec
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I have read in other comments is that
cold
simply means llvm's coldcc.
Note that #[cold]
is the cold
function attribute https://llvm.org/docs/LangRef.html#function-attributes
Changing the ABI (and thus having incompatible fn
pointers) is extern "rust-cold"
(#97544).
EDIT: Sorry, didn't realize this thread was so old 🤦 It was 5th on the "most recently updated" list I was looking at.
I've looked over the impact this patch has on the standard library benchmarks. From those I've looked into, my overall conclusion is that these benchmarks mostly need help and the changes they report have relatively little relevance to any changes in this PR. I'll be looking into cleaning them up in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is great!
@bors r+ rollup=never |
📌 Commit 73d7734 has been approved by |
☀️ Test successful - checks-actions |
Clean up Vec's benchmarks The Vec benchmarks need a lot of love. I sort of noticed this in rust-lang#83357 but the overall situation is much less awesome than I thought at the time. The first commit just removes a lot of asserts and does a touch of other cleanup. A number of these benchmarks are poorly-named. For example, `bench_map_fast` is not in fact fast, `bench_rev_1` and `bench_rev_2` are vague, `bench_in_place_zip_iter_mut` doesn't call `zip`, `bench_in_place*` don't do anything in-place... Should I fix these, or is there tooling that depend on the names not changing? I've also noticed that `bench_rev_1` and `bench_rev_2` are remarkably fragile. It looks like poking other code in `Vec` can cause the codegen of this benchmark to switch to a version that has almost exactly half its current throughput and I have absolutely no idea why. Here's the fast version: ```asm 0.69 │110: movdqu -0x20(%rbx,%rdx,4),%xmm0 1.76 │ movdqu -0x10(%rbx,%rdx,4),%xmm1 0.71 │ pshufd $0x1b,%xmm1,%xmm1 0.60 │ pshufd $0x1b,%xmm0,%xmm0 3.68 │ movdqu %xmm1,-0x30(%rcx) 14.36 │ movdqu %xmm0,-0x20(%rcx) 13.88 │ movdqu -0x40(%rbx,%rdx,4),%xmm0 6.64 │ movdqu -0x30(%rbx,%rdx,4),%xmm1 0.76 │ pshufd $0x1b,%xmm1,%xmm1 0.77 │ pshufd $0x1b,%xmm0,%xmm0 1.87 │ movdqu %xmm1,-0x10(%rcx) 13.01 │ movdqu %xmm0,(%rcx) 38.81 │ add $0x40,%rcx 0.92 │ add $0xfffffffffffffff0,%rdx 1.22 │ ↑ jne 110 ``` And the slow one: ```asm 0.42 │9a880: movdqa %xmm2,%xmm1 4.03 │9a884: movq -0x8(%rbx,%rsi,4),%xmm4 8.49 │9a88a: pshufd $0xe1,%xmm4,%xmm4 2.58 │9a88f: movq -0x10(%rbx,%rsi,4),%xmm5 7.02 │9a895: pshufd $0xe1,%xmm5,%xmm5 4.79 │9a89a: punpcklqdq %xmm5,%xmm4 5.77 │9a89e: movdqu %xmm4,-0x18(%rdx) 15.74 │9a8a3: movq -0x18(%rbx,%rsi,4),%xmm4 3.91 │9a8a9: pshufd $0xe1,%xmm4,%xmm4 5.04 │9a8ae: movq -0x20(%rbx,%rsi,4),%xmm5 5.29 │9a8b4: pshufd $0xe1,%xmm5,%xmm5 4.60 │9a8b9: punpcklqdq %xmm5,%xmm4 9.81 │9a8bd: movdqu %xmm4,-0x8(%rdx) 11.05 │9a8c2: paddq %xmm3,%xmm0 0.86 │9a8c6: paddq %xmm3,%xmm2 5.89 │9a8ca: add $0x20,%rdx 0.12 │9a8ce: add $0xfffffffffffffff8,%rsi 1.16 │9a8d2: add $0x2,%rdi 2.96 │9a8d6: → jne 9a880 <<alloc::vec::Vec<T,A> as core::iter::traits::collect::Extend<&T>>::extend+0xd0> ```
@saethlin @dtolnay This seems to have caused some performance issues in compilation. Can we talk through the cost/benefit of this change? It seemed like the results of benchmarking were inconclusive. I'm wondering if we're sure that this change has an overall positive impact. |
Yes, I think this change should be backed out probably in favor of the changes I was thinking of in this comment: #83357 (comment) I'm not quite sure why this was merged. Maybe I should have kept it as a draft? |
I think a lot of callers expect
Vec::reserve
to be nearly free when no resizing is required, but unfortunately that isn't the case. LLVM makes remarkably poor inlining choices (along the path fromVec::reserve
toRawVec::grow_amortized
), so depending on the surrounding context you either get a huge blob ofRawVec
's resizing logic inlined into some seemingly-unrelated function, or not enough inlining happens and/or the actual check inneeds_to_grow
ends up behind a function call. My goal is to make the codegen forVec::reserve
match the mental that callers seem to have: It's reliably just asub cmp ja
if there is already sufficient capacity.This patch has the following impact on the serde_json benchmarks: https://github.com/serde-rs/json-benchmark/tree/ca3efde8a5b75ff59271539b67452911860248c7 run with
cargo +stage1 run --release -- -n 1024
Before:
After:
That's approximately a one-third increase in throughput on two of the benchmarks, and no effect on one (The benchmark suite has sufficient jitter that I could pick a run where there are no regressions, so I'm not convinced they're meaningful here).
This also produces perf increases on the order of 3-5% in a few other microbenchmarks that I'm tracking. It might be useful to see if this has a cascading effect on inlining choices in some large codebases.
Compiling this simple program demonstrates the change in codegen that causes the perf impact:
Before:
After: