-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize QueryArena allocation #118227
Optimize QueryArena allocation #118227
Conversation
This allows avoiding some if != 0 checks when allocating worker-local datasets.
This cuts librustc_driver.so code size by ~85 kilobytes.
r? @cjgillot (rustbot has picked a reviewer for you, use r? to override) |
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
…, r=<try> Optimize QueryArena allocation This shifts the WorkerLocal wrapper to be outside the QueryArena, meaning that instead of having each query allocate distinct arenas per-worker we allocate the full set of arenas per-worker. This is primarily a code size optimization (locally, ~85 kilobytes), saving a bunch of code in the initialization of the arenas which was previously duplicated lots of times (per arena type). Additionally this tells LLVM that the thread count can't be zero in this code (I believe this is true?) which shaves some small amount of bytes off as well since we eliminate checks for zero in the vec allocations.
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (884c95a): comparison URL. Overall result: ❌✅ regressions and improvements - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 675.686s -> 675.115s (-0.08%) |
@bors r+ |
☀️ Test successful - checks-actions |
Finished benchmarking commit (34c5ab9): comparison URL. Overall result: ❌✅ regressions and improvements - no action needed@rustbot label: -perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 676.299s -> 673.904s (-0.35%) |
This shifts the WorkerLocal wrapper to be outside the QueryArena, meaning that instead of having each query allocate distinct arenas per-worker we allocate the full set of arenas per-worker. This is primarily a code size optimization (locally, ~85 kilobytes, perf is reporting >100 kilobytes), saving a bunch of code in the initialization of the arenas which was previously duplicated lots of times (per arena type).
Additionally this tells LLVM that the thread count can't be zero in this code (I believe this is true?) which shaves some small amount of bytes off as well since we eliminate checks for zero in the vec allocations.