Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume one waiter at once in deadlock handler #137731

Merged
merged 1 commit into from
Mar 5, 2025
Merged

Conversation

SparrowLii
Copy link
Member

@SparrowLii SparrowLii commented Feb 27, 2025

When multiple query loop errors occur in the code, only one waiter should be resumed at a time to avoid waking up multiple waiters at the same time and causing deadlock due to thread grabbing.

This fixes the UI failures in #132051

cc @Zoxc @cjgillot @nnethercote @bjorn3 @Kobzol

Zulip discussion here

Edit: We can't reproduce these bugs with the existing test suits, so we keep them until we merge #132051
UPDATES #129912
UPDATES #120757
UPDATES #129911

@rustbot
Copy link
Collaborator

rustbot commented Feb 27, 2025

r? @fmease

rustbot has assigned @fmease.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added A-query-system Area: The rustc query system (https://rustc-dev-guide.rust-lang.org/query.html) S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 27, 2025
@SparrowLii SparrowLii marked this pull request as ready for review February 27, 2025 14:40
@Zoxc
Copy link
Contributor

Zoxc commented Feb 28, 2025

Why would waking up multiple waiters result in the deadlock handler being triggered?

How do you know that this doesn't just avoid deadlocks by reducing parallelism?

Also using a global list for the waiters seems very questionable. You could just wake a single waiter without further changes?

@SparrowLii
Copy link
Member Author

SparrowLii commented Feb 28, 2025

Why would waking up multiple waiters result in the deadlock handler being triggered?

At present, I can only confirm that the deadlock is caused by thread scheduling after deadlock handler, but I can't determine the deeper reason because I don't know much about rayon. (Based on #129912, I observed the query map before entering the deadlock hanlder, where the deadlocks caused by multiple query circles are located in their own threads, which is in line with expectations).

How do you know that this doesn't just avoid deadlocks by reducing parallelism?

I think reducing parallelism here is a good way to solve the problem, because this part belongs to error handling and does not affect the compilation performance of normal code. (At least it doesn't have any errors now).

Also using a global list for the waiters seems very questionable. You could just wake a single waiter without further changes?

This is the solution with the least change I can think of, since remove_cycle will modify query_map in one call so that it can no longer find the next waiter when break_query_cycles is called again, and I don't know how to solve it for now. I added a FIXME here, which can be optimized in a followup PR.

let mut jobs: Vec<QueryJobId> = query_map.keys().cloned().collect();

let mut found_cycle = false;

while jobs.len() > 0 {
if remove_cycle(&query_map, &mut jobs, &mut wakelist) {
if remove_cycle(&query_map, &mut jobs, &WAKELIST) {
found_cycle = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd try to put a break below found_cycle = true;, with a FIXME to remove it later, once the real cause of the deadlocks is fixed. All other changes in the PR could be removed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it sounds good. So remove_cycle doesn't always return true when there are deadlocks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Has done it

Comment on lines 492 to 494
// FIXME: Resume all the waiters at once may cause deadlocks,
// so we resume one waiter at a call for now.
// Remove this when we figure out the real reason.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cool for future readers and people investigating the deadlocks to expand the description here I think.

What do you think of something like the following comment @Zoxc and @SparrowLii?

We can encounter deadlocks for cycles we can't break here, but it's still unclear whether it's due to possible issues in rustc-rayon or instead in the handling of query cycles. We can avoid them by only waking up a single waiter instead of all of them. The deadlock issues seem to only appear when compilation errors are involved, so this reduction in parallelism, while suboptimal, is not universal and only the deadlock handler will encounter these cases. The workaround shows loss of potential gains, but there still are big improvements in the common case, and no regressions compared to the single-threaded case. More investigation is still needed, and once fixed, we can wake up all the waiters up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for helping to explain! Have updated the FIXME here and the func document :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can encounter deadlocks for cycles we can't break here doesn't make sense to me. The previous description seems better than that part at least.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I optimized the description, which should looks better now

@lqd
Copy link
Member

lqd commented Mar 4, 2025

The break_query_cycles documentation will also need a slight update as it still refers to waking up all the waiters at once.

@SparrowLii SparrowLii force-pushed the waiter branch 2 times, most recently from 572efbc to 3fc41d5 Compare March 5, 2025 01:04
@@ -506,7 +518,7 @@ pub fn break_query_cycles(query_map: QueryMap, registry: &rayon_core::Registry)
);
}

// FIXME: Ensure this won't cause a deadlock before we return
// FIXME: Ensure this won't cause a deadlock if we resume all waiters at once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This FIXME is unrelated and shouldn't change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

@nnethercote
Copy link
Contributor

@bors r+

@bors
Copy link
Contributor

bors commented Mar 5, 2025

📌 Commit cc1e4ed has been approved by nnethercote

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Mar 5, 2025
bors added a commit to rust-lang-ci/rust that referenced this pull request Mar 5, 2025
Rollup of 20 pull requests

Successful merges:

 - rust-lang#134063 (dec2flt: Clean up float parsing modules)
 - rust-lang#136581 (Retire the legacy `Makefile`-based `run-make` test infra)
 - rust-lang#136662 (Count char width at most once in `Formatter::pad`)
 - rust-lang#136764 (Make `ptr_cast_add_auto_to_object` lint into hard error)
 - rust-lang#136798 (Added documentation for flushing per rust-lang#74348)
 - rust-lang#136865 (Perform deeper compiletest path normalization for `$TEST_BUILD_DIR` to account for compare-mode/debugger cases, and normalize long type file filename hashes)
 - rust-lang#136975 (Look for `python3` first on MacOS, not `py`)
 - rust-lang#136977 (Upload Datadog metrics with citool)
 - rust-lang#137240 (Slightly reformat `std::fs::remove_dir_all` error docs)
 - rust-lang#137298 (Check signature WF when lowering MIR body)
 - rust-lang#137463 ([illumos] attempt to use posix_spawn to spawn processes)
 - rust-lang#137477 (uefi: Add Service Binding Protocol abstraction)
 - rust-lang#137569 (Stabilize `string_extend_from_within`)
 - rust-lang#137633 (Only use implied bounds hack if bevy, and use deeply normalize in implied bounds hack)
 - rust-lang#137679 (Various coretests improvements)
 - rust-lang#137723 (Make `rust.description` more general-purpose and pass `CFG_VER_DESCRIPTION`)
 - rust-lang#137728 (Remove unsizing coercions for tuples)
 - rust-lang#137731 (Resume one waiter at once in deadlock handler)
 - rust-lang#137875 (mir_build: Integrate "simplification" steps into match-pair-tree creation)
 - rust-lang#138028 (compiler: add `ExternAbi::is_rustic_abi`)

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit 927c11f into rust-lang:master Mar 5, 2025
6 checks passed
@rustbot rustbot added this to the 1.87.0 milestone Mar 5, 2025
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Mar 5, 2025
Rollup merge of rust-lang#137731 - SparrowLii:waiter, r=nnethercote

Resume one waiter at once in deadlock handler

When multiple query loop errors occur in the code, only one waiter should be resumed at a time to avoid waking up multiple waiters at the same time and causing deadlock due to thread grabbing.

This fixes the UI failures in rust-lang#132051

cc `@Zoxc` `@cjgillot` `@nnethercote` `@bjorn3` `@Kobzol`

Zulip discussion [here](https://rust-lang.zulipchat.com/#narrow/channel/187679-t-compiler.2Fwg-parallel-rustc/topic/Deadlocks.20and.20Rayon)

Edit: We can't reproduce these bugs with the existing test suits, so we keep them until we merge rust-lang#132051
UPDATES rust-lang#129912
UPDATES rust-lang#120757
UPDATES rust-lang#129911
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-query-system Area: The rustc query system (https://rustc-dev-guide.rust-lang.org/query.html) S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants