-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of helloworld5000 could be improved #50994
Comments
I've opened an issue for the LLVM part upstream: https://bugs.llvm.org/show_bug.cgi?id=37588 |
Thank you, @nikic! That's extremely helpful. |
@dotdash Are you sure that's the right PR? It's a Cargo update. |
oops, typo'd, it's #57351 |
Thinking about it, given that the test case originally only took 4.5s, I suspect that the const_eval part might have been a regression that was introduced in the meantime. |
perf: Don't track specific live points for promoteds We don't query this information out of the promoted (it's basically a single "unit" regardless of the complexity within it) and this saves on re-initializing the SparseIntervalMatrix's backing IndexVec with mostly empty rows for all of the leading regions in the function. Typical promoteds will only contain a few regions that need up be uplifted, while the parent function can have thousands. For a simple function repeating println!("Hello world"); 50,000 times this reduces compile times from 90 to 15 seconds in debug mode. The previous implementations re-initialization led to an overall roughly n^2 runtime as each promoted initialized slots for ~n regions, now we scale closer to linearly (5000 hello worlds takes 1.1 seconds). cc rust-lang#50994
perf: Don't track specific live points for promoteds We don't query this information out of the promoted (it's basically a single "unit" regardless of the complexity within it) and this saves on re-initializing the SparseIntervalMatrix's backing IndexVec with mostly empty rows for all of the leading regions in the function. Typical promoteds will only contain a few regions that need up be uplifted, while the parent function can have thousands. For a simple function repeating println!("Hello world"); 50,000 times this reduces compile times from 90 to 15 seconds in debug mode. The previous implementations re-initialization led to an overall roughly n^2 runtime as each promoted initialized slots for ~n regions, now we scale closer to linearly (5000 hello worlds takes 1.1 seconds). cc rust-lang#50994, rust-lang#86244
…dtwco perf: Don't track specific live points for promoteds We don't query this information out of the promoted (it's basically a single "unit" regardless of the complexity within it) and this saves on re-initializing the SparseIntervalMatrix's backing IndexVec with mostly empty rows for all of the leading regions in the function. Typical promoteds will only contain a few regions that need up be uplifted, while the parent function can have thousands. For a simple function repeating println!("Hello world"); 50,000 times this reduces compile times from 90 to 15 seconds in debug mode. The previous implementations re-initialization led to an overall roughly n^2 runtime as each promoted initialized slots for ~n regions, now we scale closer to linearly (5000 hello worlds takes 1.1 seconds). cc rust-lang#50994, rust-lang#86244
perf: Don't track specific live points for promoteds We don't query this information out of the promoted (it's basically a single "unit" regardless of the complexity within it) and this saves on re-initializing the SparseIntervalMatrix's backing IndexVec with mostly empty rows for all of the leading regions in the function. Typical promoteds will only contain a few regions that need up be uplifted, while the parent function can have thousands. For a simple function repeating println!("Hello world"); 50,000 times this reduces compile times from 90 to 15 seconds in debug mode. The previous implementations re-initialization led to an overall roughly n^2 runtime as each promoted initialized slots for ~n regions, now we scale closer to linearly (5000 hello worlds takes 1.1 seconds). cc rust-lang/rust#50994, rust-lang/rust#86244
Either my machine is that much of a beast, or this was simply fixed. $ time rustc helloworld5000.rs
real 0m0.423s
user 0m0.384s
sys 0m0.039s
$ time rustc -Copt-level=3 helloworld5000.rs
real 0m1.724s
user 0m1.579s
sys 0m0.144s |
helloworld5000
is the name I've given to the benchmark that ishelloworld
with theprintln!("Hello world");
repeated 5,000 times. It's an interesting stress test for the compiler.On my machine, a debug build takes 4.5 seconds and an opt build takes 62(!) seconds.
In the debug build, execution time is dominated by
take_and_reset_data
. Cachegrind measures these instruction counts:The
reset_unifications
call withintake_and_reset_data
is the expensive part. It all boils down toset_all
within theena
crate:and iterator code (called from
set_all
):I did some measurement and found that, in the vast majority of cases,
reset_unification
is a no-op -- it overwrites the the unification table with the same values that it already has. I wonder if we could do better somehow. It's a shame we have to keep the unbound variables around rather than just clearing them like we do with the other data intake_and_reset_data
. I know that this is an extreme example, but profiles indicate thatreset_unifications
is somewhat hot on more normal programs too. @nikomatsakis, any ideas?In the opt builds, these are the top functions according to Cachegrind:
That's a lot of time in
PointerMayBeCaptured
.The text was updated successfully, but these errors were encountered: