proc_macro: use crossbeam channels for the proc_macro cross-thread bridge #99123

mystor · 2022-07-10T18:13:02Z

This is done by having the crossbeam dependency inserted into the proc_macro server code from the server side, to avoid adding a dependency to proc_macro.

In addition, this introduces a -Z command-line option which will switch rustc to run proc-macros using this cross-thread executor. With the changes to the bridge in #98186, #98187, #98188 and #98189, the performance of the executor should be much closer to same-thread execution.

In local testing, the crossbeam executor was substantially more performant than either of the two existing CrossThread strategies, so they have been removed to keep things simple.

r? @eddyb

mystor · 2022-07-10T18:15:45Z

I don't think it's worth doing a perf run with the cross-thread executor enabled until at least #98189 has landed, and this patch shouldn't impact perf at all for the existing same-thread executor. We can do a perf comparison with a PR changing the default after these patches have both landed to see how much more work is left before a cross-thread executor is a viable option.

eddyb

LGTM, and being able to test cross-thread proc macro isolation would be great, but I'm not sure if the new -Z flag requires an MCP (cc @rust-lang/compiler).

eddyb · 2022-07-27T15:30:51Z

MCP has completed (see also the Zulip thread, ^{_{which was mostly discussing the background of proc macro isolation}}):

Introduce -Z flag to control how proc-macros are run compiler-team#528

eddyb

r=me after rebase

compiler/rustc_expand/src/proc_macro.rs

rustbot · 2022-07-29T19:15:50Z

Some changes occurred in library/proc_macro/src/bridge

cc @rust-lang/wg-rls-2

…idge This is done by having the crossbeam dependency inserted into the proc_macro server code from the server side, to avoid adding a dependency to proc_macro. In addition, this introduces a -Z command-line option which will switch rustc to run proc-macros using this cross-thread executor. With the changes to the bridge in rust-lang#98186, rust-lang#98187, rust-lang#98188 and rust-lang#98189, the performance of the executor should be much closer to same-thread execution. In local testing, the crossbeam executor was substantially more performant than either of the two existing CrossThread strategies, so they have been removed to keep things simple.

eddyb · 2022-07-29T23:09:19Z

@bors r+ rollup=never

bors · 2022-07-29T23:09:21Z

📌 Commit 6d1650f has been approved by eddyb

It is now in the queue for this repository.

bors · 2022-07-30T04:05:31Z

⌛ Testing commit 6d1650f with merge bd84c73...

bors · 2022-07-30T06:51:23Z

☀️ Test successful - checks-actions
Approved by: eddyb
Pushing bd84c73 to master...

rust-timer · 2022-07-30T08:08:59Z

Finished benchmarking commit (bd84c73): comparison url.

Instruction count

Primary benchmarks: 😿 relevant regressions found
Secondary benchmarks: no relevant changes found

	mean¹	max	count²
Regressions 😿 (primary)	0.9%	1.4%	11
Regressions 😿 (secondary)	N/A	N/A	0
Improvements 🎉 (primary)	N/A	N/A	0
Improvements 🎉 (secondary)	N/A	N/A	0
All 😿🎉 (primary)	0.9%	1.4%	11

Max RSS (memory usage)

Results

Primary benchmarks: 😿 relevant regression found
Secondary benchmarks: 🎉 relevant improvements found

	mean¹	max	count²
Regressions 😿 (primary)	2.1%	2.1%	1
Regressions 😿 (secondary)	N/A	N/A	0
Improvements 🎉 (primary)	N/A	N/A	0
Improvements 🎉 (secondary)	-3.2%	-3.4%	3
All 😿🎉 (primary)	2.1%	2.1%	1

Cycles

Results

Primary benchmarks: mixed results
Secondary benchmarks: mixed results

	mean¹	max	count²
Regressions 😿 (primary)	3.5%	7.0%	4
Regressions 😿 (secondary)	1.6%	1.6%	1
Improvements 🎉 (primary)	-4.4%	-4.4%	1
Improvements 🎉 (secondary)	-3.1%	-3.1%	1
All 😿🎉 (primary)	1.9%	7.0%	5

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

the arithmetic mean of the percent change ↩ ↩² ↩³
number of relevant changes ↩ ↩² ↩³

nnethercote · 2022-07-30T09:44:20Z

this patch shouldn't impact perf at all for the existing same-thread executor.

html5ever-0.26.0 regressed uniformly. Any ideas why that might have happened?

Having said that, clicking on "show non-relevant results" shows some possible improvements. E.g. keccak had a bunch of ~2% improvements that for some reason didn't hit the significance threshold.

eddyb · 2022-07-30T23:23:17Z

cc @wesleywiser Just noticed something here - so html5ever-0.26.0-debug incr-unchanged has the worse % increase in expand_crate, and I'm wondering if it would make sense to track more fine-grained aspects of expansion?
It would be useful to know, for example, how many proc macro invocations happened, how much they account for the expand_crate time, etc.

Regarding the apparent regression: if this is LLVM doing something weird, we could mark the cross-thread executor method as #[cold] #[inline(never)] perhaps?

pnkfelix · 2022-08-12T15:29:47Z

I don't know if its helpful @eddyb , but I've recreated the regression locally, and I think the bulk of the increase in instruction count is coming from <rustc_expand::mbe::macro_parser::TtParser>::parse_tt for some reason:

Instruction Counts	8f68c43c	bd84c73f	difference
TOTAL	988,908,196 (100.0%)	1,002,719,736 (100.0%)	+13,811,540
parse_tt	134,903,906 (13.64%)	149,152,914 (14.87%)	+14,249,008

In any case, I'm going to take a shot now at marking the cross-thread executor method as #[cold] #[inline(never)] to see if that makes this issue go away here.

pnkfelix · 2022-08-12T18:03:22Z

Unfortunately I took my shot (see below) at marking the cross-thread executor as #[cold] #[inline(never)] and it didn't seem to help.

diff

diff --git a/library/proc_macro/src/bridge/server.rs b/library/proc_macro/src/bridge/server.rs
index e068ec60b6a..a29fae17292 100644
--- a/library/proc_macro/src/bridge/server.rs
+++ b/library/proc_macro/src/bridge/server.rs
@@ -213,6 +213,8 @@ impl<P> ExecutionStrategy for CrossThread<P>
 where
     P: MessagePipe<Buffer> + Send + 'static,
 {
+    #[cold]
+    #[inline(never)]
     fn run_bridge_and_client(
         &self,
         dispatcher: &mut impl DispatcherTrait,

eddyb · 2022-08-14T12:18:44Z

compiler/rustc_session/src/options.rs

+    proc_macro_execution_strategy: ProcMacroExecutionStrategy = (ProcMacroExecutionStrategy::SameThread,
+        parse_proc_macro_execution_strategy, [UNTRACKED],
+        "how to run proc-macro code (default: same-thread)"),


@pnkfelix I'm running out of ideas... what if you move this to the end? (as in, maybe the order of the fields or their offsets, in this huge struct, is causing weird things)

It could even be its presence - the flag could be removed and exec_strategy (the sole user of the flag, in compiler/rustc_expand/src/proc_macro.rs) made to, idk, use std::env::var instead? (or just hardcode the choice, but that might cause changes in optimizations)

Wacky! I moved the field to the end of the struct, and that made the two methods with the two highest instruction counts (parse_tt and MacroRulesMacroExpander::expand) each have lower instruction-counts (Ir) according to cachegrind, but the overall instruction-count increased:

Instruction Counts 8f68c43c bd84c73f field-at-end

TOTAL 988,908,196 (100.0%) 1,002,719,736 (100.0%) 1,138,908,984 (100.0%)

parse_tt 134,903,906 (13.64%) 149,152,914 (14.87%) 99,794,157 (8.76%)

MacroRules
MacroExpander
::expand 56,387,220 (5.70%) 56,379,944 (5.62%) 50,969,953 (4.48%)

SipHasher128::finish128 5,975,227 (0.60%) 5,975,227 (0.60%) 36,508,882 (3.21%)

token_name_eq (not in report) (not in report) 29,133,905 (2.56%)

try_mark_previous_green 53,862,272 (5.45%) 53,872,906 (5.37%) 25,344,946 (2.23%)

I don't really want to spend too much more trying to dissect this thing that is probably just an LLVM codegen oddity, but I will at least try eliminating the field entirely as you suggest.

(As an aside, the fact that this kind of field shuffling had such an enormous impact on parse_tt does make me wonder if there's some kind of low-hanging fruit embedded in the code there, e.g. maybe the code is accessing the session when it should be locally stashing an unchanging value? But isn't that the kind of thing you'd like to believe your compiler is doing for you automatically? 😆 )

I suspect that the code is hot, and also that register spilling is happening in a different way or inlining choices are a little different. I.e. the kind of thing that's hard to control, and even if you manage to get it in the faster state now a tiny change later on might make it go back to the slower state.

rust-highfive assigned eddyb Jul 10, 2022

rustbot added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Jul 10, 2022

This comment was marked as off-topic.

Sign in to view

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jul 10, 2022

eddyb approved these changes Jul 10, 2022

View reviewed changes

mystor mentioned this pull request Jul 14, 2022

Introduce -Z flag to control how proc-macros are run rust-lang/compiler-team#528

Closed

3 tasks

eddyb suggested changes Jul 27, 2022

View reviewed changes

compiler/rustc_expand/src/proc_macro.rs Outdated Show resolved Hide resolved

mystor force-pushed the crossbeam_bridge branch from af9579f to eecb355 Compare July 29, 2022 19:15

This comment has been minimized.

Sign in to view

mystor force-pushed the crossbeam_bridge branch from eecb355 to 6d1650f Compare July 29, 2022 21:38

eddyb approved these changes Jul 29, 2022

View reviewed changes

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 29, 2022

bors added the merged-by-bors This PR was explicitly merged by bors. label Jul 30, 2022

bors merged commit bd84c73 into rust-lang:master Jul 30, 2022

rustbot added this to the 1.64.0 milestone Jul 30, 2022

rustbot added the perf-regression Performance regression. label Jul 30, 2022

nnethercote mentioned this pull request Jul 30, 2022

cc @rust-lang/wg-compiler-performance on master perf regressions. rust-lang/rustc-perf#1369

Merged

eddyb reviewed Aug 14, 2022

View reviewed changes

proc_macro: use crossbeam channels for the proc_macro cross-thread bridge #99123

proc_macro: use crossbeam channels for the proc_macro cross-thread bridge #99123

Uh oh!

Conversation

mystor commented Jul 10, 2022

Uh oh!

This comment was marked as off-topic.

mystor commented Jul 10, 2022

Uh oh!

eddyb left a comment

Choose a reason for hiding this comment

Uh oh!

eddyb commented Jul 27, 2022

Uh oh!

eddyb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rustbot commented Jul 29, 2022

Uh oh!

This comment has been minimized.

eddyb commented Jul 29, 2022

Uh oh!

bors commented Jul 29, 2022

Uh oh!

bors commented Jul 30, 2022

Uh oh!

bors commented Jul 30, 2022

Uh oh!

rust-timer commented Jul 30, 2022

Instruction count

Max RSS (memory usage)

Cycles

Footnotes

Uh oh!

nnethercote commented Jul 30, 2022

Uh oh!

eddyb commented Jul 30, 2022

Uh oh!

pnkfelix commented Aug 12, 2022

Uh oh!

pnkfelix commented Aug 12, 2022

Uh oh!

eddyb Aug 14, 2022

Choose a reason for hiding this comment

Uh oh!

pnkfelix Aug 15, 2022

Choose a reason for hiding this comment

Uh oh!

nnethercote Aug 15, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!