-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compile time regression introduced in r322401 #37436
Comments
Thanks for reporting this issue. Is this performance degradation due to a slower compiler or due to the quality of code generated by llvm? |
Ah yes, a good clarification to make! The site at perf.rust-lang.org is tracking the compile time of various benchmarks. That means it's testing how long it takes rustc to compile various crates. This execution time, however, is reflective of both LLVM's generated code as well as LLVM's execution time. The Rust compiler itself is compiled by LLVM, so some of the runtime is code generated by LLVM's master branch (the Rust compiler pieces) and some of the runtime is LLVM itself (optimizations and whatnot). In this particular case though I believe the regression is specifically in LLVM's execution time, not in the generated code. In the timings that I was looking at the timings of the Rust-specific pieces were comparable both before and after (aka testing LLVM's code generation). The variable times were the amount of time it took the optimization passes in LLVM. |
Hello Alex, thank you for the detailed report. The commit you referenced adds a feature where a type of analysis (Dominators) are now kept up-to-date across one of the larger passes within LLVM. Due to the design of LLVM this analysis information is kept separate from the actual IR and it means when changes to one happen, they also have to be made in the other. The initial application of this patch was much worse and we were forced to batch up the Dominator updates to minimize the performance impact. I am a bit surprised by your loss of around 6.5% percent though. It may be that rust emits IR that is highly branchy and requires much more iterations inside JumpThreading to stabilize than code emitted by clang. In my testing of the batched mode I saw most compile-sensitive tests around the 1% mark with 3 outliers: There is work ongoing to revise this interface again: It is unclear if this patch will improve performance, or possibly degrade further since it's attempting to preserve the analysis of two data structures instead of just one (PDT and DT). With all of this said, have you tried the latest tip of the tree? There is a new patch in JumpThreading that improves its performance and may help offset some of the loss you're seeing in keeping this analysis up-to-date. The patch is the following: |
Ok thanks for the information! Unfortunately though the patch you mentioned was actually a source of an even bigger regression 1 in our testing which I reported as bug 38047 and was reverted in r336393. I'd definitely believe though that rustc's LLVM IR generation is subpar in various areas, and this is also just one particular benchmark in rustc rather than a massive regression across everything, so it's certainly not the end of the world. Thanks for taking the time to review! |
It's possible you're seeing the regression here due to the change in how the code is generated. It added a synchronization (flush()) to every call where JumpThreading rewires the edges of the IR, and is similar to the immediate update method I originally abandoned. See here: https://reviews.llvm.org/D44282?id=138097#inline-392430 I was hoping with the change that the performance offset of saving time updating the IR directly would result in a net win.
for Block in Function: The Process(BB) is a complex function that attempts several changes and transforms to the IR, especially incoming and outgoing edges. Every change to these need to be tracked and synchronized. The only way I can see this becoming more efficient is to overhaul Process(BB) to be more intelligent of the opportunities it attacks and to not just keep trying until everything stabilizes. That foo.ll file is huuuuge at 60MB! Keep in mind that JumpThreading is a function-level scalar transform. The more basic blocks you have in a single function the more JumpThreading has to work. One way you can reduce this from the code generated out of Rust is to either reduce the number of basicblocks or try to strike a balance between calling external functions or forcing all the code inline. This code reminds me quite a bit of the amalgamated SQlite source and that too performed worse with the change for adding Dominator preservation. |
Ah yeah and sorry I don't mind this as like "hair on fire" bug at all! I figured it would basically be good hygiene to report this and make sure there was no obvious mistake (which it sounds like there definitely isn't). In that sense I think we're likely to merge our LLVM upgrade in Rust with this hit to perf on a few benchmarks. FWIW the huge IR file I didn't try to reduce at all, it was basically just the first one that showed a perceptible-by-me difference in performance that I could bisect on. AFAIK we haven't spent a ton of time in Rust improving our LLVM IR generation that we hand to the backend, but that's definitely a possible optimization for us if it comes to it! Y'all can definitely feel free to do what you'd like with this bug, if you'd prefer to close it as "working as intended for now" that's totally fine! |
Any attempt by Rust to reduce the number of blocks emitted in a single function will help reduce the JumpThreading processing time. There are also a few tuning parameters to JumpThreading you can play with to see if helps strke a compromise between quality of emitted code and the time it takes to compile: BBDuplicateThreshold("jump-threading-threshold", static cl::opt
|
Extended Description
We're currently attempting to upgrade LLVM in the Rust project 1 but we're unfortunately losing performance from this upgrade on a number of our benchmarks 2. I've been doing some bisection and it's somewhat difficult to narrow down on any one regression because the regression to track down is smaller and smaller.
In any case though I've tracked down one regression to r322401 or https://reviews.llvm.org/D40146. The IR file 3 takes awhile to pass through the
opt
tool, but before this commit it took 16.8s locally and afterwards it takes 17.9.I haven't dove too much further into the patch, but let me know if I can help out!
The text was updated successfully, but these errors were encountered: