-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release: switch to Cargo's default #9031
Conversation
Looks like the benchmarks didn't run 🤔 |
Yeah, seems like our determine changes is too aggressive. Would you mind running our hyperfine benchmarks (linting the cpython code base) in addition to the micro benchmarks to get a better understanding of how the performance of the CLI etc is impacted? |
Hm looks like #8225 has a bug in it since this should have been detected as a code change @Cjkjvfnby |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're at it, could we rename release-debug
to profiling
? That makes it clearer why this profile exists.
This makes the intent of the profile a little clearer. Ref #9031 (review)
1e7dd1d
to
2e309b8
Compare
CodSpeed Performance ReportMerging #9031 will degrade performances by 5.92%Comparing Summary
Benchmarks breakdown
|
|
Given the regressions, fwiw i'm also fine with different settings for |
OK, so I followed @charliermarsh's suggestion to run a [profile.fatcg1]
inherits = "release"
lto = "fat"
codegen-units = 1
[profile.fatcg16]
inherits = "release"
lto = "fat"
codegen-units = 16
[profile.thincg1]
inherits = "release"
lto = "thin"
codegen-units = 1
[profile.thincg16]
inherits = "release"
lto = "thin"
codegen-units = 16
[profile.noltocg1]
inherits = "release"
lto = false
codegen-units = 1
[profile.noltocg16]
inherits = "release"
lto = false
codegen-units = 16 Then I compiled a binary for each profile: cargo clean
mkdir -p target/release
for p in fatcg1 fatcg16 thincg1 thincg16 noltocg1 noltocg16; do
cargo build --profile $p -p ruff_cli
cp target/$p/ruff target/release/ruff-$p
done And baked them off:
As expected, So the microbenchmark regressions here do indeed look a little scary, but the more holistic/realistic benchmark looks okay to me? |
In theory I'm fine with it too, but I do feel like it can be pretty tricky. What I'm thinking about is something like this:
In this case, you might be seeing something very different than what was benchmarked in the PR, and tracking down the regression could prove quite annoying. LTO can greatly impact function inlining. My suspicion is that, in most cases, if a regression exists with LTO enabled, then it probably also exists with LTO disabled (or in a different mode). But not necessarily. With that said, yeah, if we find we can't relax the LTO configuration then given the difference in compile times here, I'd probably accept the above as a downside I'd be willing to pay I think? |
@BurntSushi - Would it be easy to re-run that comparison with |
Just a note that we're moving ahead on #8835 with the regression. I'm not sure if this would affect the code generation problem in that PR anyhow but we should take the combined regression into account. |
@charliermarsh Yeah! Here you go:
So in this case, we end up with a 1.05x regression for both |
2e309b8
to
92b1281
Compare
This makes the intent of the profile a little clearer. Ref #9031 (review)
Sorry to ask for more benchmarks, but it would be good to have some numbers on the formatter too: hyperfine ./target/release/ruff format ./checkouts/zulip And you may want to benchmark another project than CPython (one that actually uses ruff like homeassistant or I'm a bit surprised that the lexer microbenchmarks are affected that much... Could it be that Rust directly inlines too much of the lexer into the benchmark, removing multiple function calls? I otherwise wouldn't expect much change because the lexer code is mostly self contained in one crate (and called by the parser from the same crate). It may be worth to compare the profiles between LTO1 and THIN16 to see if there are some obvious candidates where it makes sense to add an It would be nice if we could specify the code units per crate. E.g.
I think I'm otherwise okay with a 3-5% regression, if we have sufficient proof that it boosts our productivity significantly (Note: This won't improve our CI times other than for the benchmarks run) |
This sets `lto = "thin"` instead of using "fat" LTO, and sets `codegen-units = 16`. These are the defaults for Cargo's `release` profile, and I think it may give us faster iteration times, especially when benchmarking. The point of this PR is to see what kind of impact this has on benchmarks. It is expected that benchmarks may regress to some extent. I did some quick ad hoc experiments to quantify this change in compile times. Namely, I ran: cargo build --profile release -p ruff_cli Then I ran touch crates/ruff_python_formatter/src/expression/string/docstring.rs (because that's where i've been working lately) and re-ran cargo build --profile release -p ruff_cli This last command is what I timed, since it reflects how much time one has to wait between making a change and getting a compiled artifact. Here are my results: * With status quo `release` profile, build takes 77s * with `release` but `lto = "thin"`, build takes 41s * with `release`, but `lto = false`, build takes 19s * with `release`, but `lto = false` **and** `codegen-units = 16`, build takes 7s * with `release`, but `lto = "thin"` **and** `codegen-units = 16`, build takes 16s (i believe this is the default `release` configuration) This PR represents the last option. It's not the fastest to compile, but it's nearly a whole minute faster! The idea is that with `codegen-units = 16`, we still make use of parallelism, but keep _some_ level of LTO on to try and re-gain what we lose by increasing the number of codegen units.
This makes the intent of the profile a little clearer. Ref #9031 (review)
This function seems to get inlined with fat LTO and codegen-units=1, but not with thin LTO and codegen-units=16. So we forcefully inline it to make the lexer microbenchmarks a bit faster.
Some dependencies, like the Python parser, can benefit a bit more from higher optimization levels. But we don't want to set codegen-units=1 for everything, since it results in a substantial compilation time hit.
92b1281
to
9bff2ec
Compare
All righty, I re-ran
And I also checked
It looks like the relative difference here is about the same as with linting. My target or hope here is to switch to I was indeed able to add an I looked for other opportunities like that but came up short. I did see some functions inlined with
Aye. It's hard to provide proof, but I do actually feel somewhat strongly that this kind of improvement in iteration time will eventually pay for itself. Ideally, we could get it even faster than 20 seconds, but I think disabling thin LTO is probably a bridge too far. Getting better iteration times after this I think will involve figuring out how to make
Oh hey! Actually this can be done! I had assumed # Some crates don't change as much but benefit more from
# more expensive optimization passes, so we selectively
# decrease codegen-units in some cases.
[profile.release.package.ruff_python_parser]
codegen-units = 1
[profile.release.package.ruff_python_ast]
codegen-units = 1 And at least locally, this seems to shrink the regression for the lexer microbenchmark substantially. I just pushed up that change here, so let's see how it does with codspeed. I also re-ran the hyperfine benchmarks above with this new configuration, but it doesn't detect any difference between it and |
The micro-benchmark regressions are now quite a bit smaller: https://github.com/astral-sh/ruff/pull/9031/checks?check_run_id=19612967302 Summary: This PR reverts the release profile to the default configuration, but does set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing for this and the detailed profiling. This seems a worthwhile trade off to me which most users shouldn't even notice. Long compile times have been a real concern for many contributors and can be very noticable if you work on a somewhat older computer.
I'm all in for merging as is. We can reconsider using lto=fat in the future if we managed to improve our crate structure, with future rust versions or when having better devtooling (earthly with incremental builds?)
Is this regression also effects on final published releases? I mean |
@T-256 I discussed that point here. The core issue is that if the thing you're benchmarking and the thing you release are different, then you risk measuring and tuning the wrong thing. I don't mean to say that this is an iron-clad argument against something like having a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very good and thorough work. I'm slightly less convinced that this change is worth the tradeoffs (as compared to @MichaReiser) since it only helps with release builds, and we do release builds locally so infrequently compare to how often Ruff is run in the wild by users. But the size of the regression is not so great that I would block it if you two are in favor.
I agree that must of us don't to frequent release builds (at least for publishing), but they're ferquently needed when doing performance work where building the benchmarks and running our benchmarks takes a significant amount of time (to be fair, running them probably takes longer than building). Having a faster feedback-cycle there would certainly help. An alternative is to use different profiles between releasing and profiling but it comes with the downside that what we see in profiles might not match what we see in production. But maybe that's something we have to accept anyway if we e.g. consider using data driven optimiations (run ruff with a typical workload and then feed that information into the optimizer, similar to what Rust does). Yet another alternative is to have a light and "full" profiling profile where you can do both a full and light profiling, compare if you see the same outliers, and then use the "light" profilie to optimize the code until the outlier is gone. |
Production builds intended to be always faster at run-time (?) We can document build profiles: "For final production builds, we are using optimized compilation flags to have most optimized run-time." For measuring, we can always rely on default release profile, since always comparing to same builds then I think we'd have correct comparation for new changes. When need to compare against foreign tools (e.g. promoting Ruff against tools/competitors) we can use production build measured numbers. For now, I think we can merge current PR. production builds needs separated discussion. |
I do find myself more sympathetic to the idea of having different profile settings for typical benchmarking and the final release build. But, let's see how far we can get with having them use the same configuration. We can always revisit this and at least change the |
This sets
lto = "thin"
instead of using "fat" LTO, and setscodegen-units = 16
. These are the defaults for Cargo'srelease
profile, and I think it may give us faster iteration times, especially when benchmarking. The point of this PR is to see what kind of impact this has on benchmarks. It is expected that benchmarks may regress to some extent.I did some quick ad hoc experiments to quantify this change in compile times. Namely, I ran:
Then I ran
(because that's where i've been working lately) and re-ran
This last command is what I timed, since it reflects how much time one has to wait between making a change and getting a compiled artifact.
Here are my results:
release
profile, build takes 77srelease
butlto = "thin"
, build takes 41srelease
, butlto = false
, build takes 19srelease
, butlto = false
andcodegen-units = 16
, build takes 7srelease
, butlto = "thin"
andcodegen-units = 16
, build takes 16s (i believe this is the defaultrelease
configuration)This PR represents the last option. It's not the fastest to compile, but it's nearly a whole minute faster! The idea is that with
codegen-units = 16
, we still make use of parallelism, but keep some level of LTO on to try and re-gain what we lose by increasing the number of codegen units.Summary
Test Plan