-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resume one waiter at once in deadlock handler #137731
Conversation
Why would waking up multiple waiters result in the deadlock handler being triggered? How do you know that this doesn't just avoid deadlocks by reducing parallelism? Also using a global list for the waiters seems very questionable. You could just wake a single waiter without further changes? |
At present, I can only confirm that the deadlock is caused by thread scheduling after deadlock handler, but I can't determine the deeper reason because I don't know much about
I think reducing parallelism here is a good way to solve the problem, because this part belongs to error handling and does not affect the compilation performance of normal code. (At least it doesn't have any errors now).
This is the solution with the least change I can think of, since |
let mut jobs: Vec<QueryJobId> = query_map.keys().cloned().collect(); | ||
|
||
let mut found_cycle = false; | ||
|
||
while jobs.len() > 0 { | ||
if remove_cycle(&query_map, &mut jobs, &mut wakelist) { | ||
if remove_cycle(&query_map, &mut jobs, &WAKELIST) { | ||
found_cycle = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd try to put a break below found_cycle = true;
, with a FIXME
to remove it later, once the real cause of the deadlocks is fixed. All other changes in the PR could be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it sounds good. So remove_cycle
doesn't always return true
when there are deadlocks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Has done it
// FIXME: Resume all the waiters at once may cause deadlocks, | ||
// so we resume one waiter at a call for now. | ||
// Remove this when we figure out the real reason. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cool for future readers and people investigating the deadlocks to expand the description here I think.
What do you think of something like the following comment @Zoxc and @SparrowLii?
We can encounter deadlocks for cycles we can't break here, but it's still unclear whether it's due to possible issues in rustc-rayon
or instead in the handling of query cycles. We can avoid them by only waking up a single waiter instead of all of them. The deadlock issues seem to only appear when compilation errors are involved, so this reduction in parallelism, while suboptimal, is not universal and only the deadlock handler will encounter these cases. The workaround shows loss of potential gains, but there still are big improvements in the common case, and no regressions compared to the single-threaded case. More investigation is still needed, and once fixed, we can wake up all the waiters up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for helping to explain! Have updated the FIXME here and the func document :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can encounter deadlocks for cycles we can't break here
doesn't make sense to me. The previous description seems better than that part at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I optimized the description, which should looks better now
The |
572efbc
to
3fc41d5
Compare
@@ -506,7 +518,7 @@ pub fn break_query_cycles(query_map: QueryMap, registry: &rayon_core::Registry) | |||
); | |||
} | |||
|
|||
// FIXME: Ensure this won't cause a deadlock before we return | |||
// FIXME: Ensure this won't cause a deadlock if we resume all waiters at once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This FIXME is unrelated and shouldn't change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done :)
@bors r+ |
Rollup of 20 pull requests Successful merges: - rust-lang#134063 (dec2flt: Clean up float parsing modules) - rust-lang#136581 (Retire the legacy `Makefile`-based `run-make` test infra) - rust-lang#136662 (Count char width at most once in `Formatter::pad`) - rust-lang#136764 (Make `ptr_cast_add_auto_to_object` lint into hard error) - rust-lang#136798 (Added documentation for flushing per rust-lang#74348) - rust-lang#136865 (Perform deeper compiletest path normalization for `$TEST_BUILD_DIR` to account for compare-mode/debugger cases, and normalize long type file filename hashes) - rust-lang#136975 (Look for `python3` first on MacOS, not `py`) - rust-lang#136977 (Upload Datadog metrics with citool) - rust-lang#137240 (Slightly reformat `std::fs::remove_dir_all` error docs) - rust-lang#137298 (Check signature WF when lowering MIR body) - rust-lang#137463 ([illumos] attempt to use posix_spawn to spawn processes) - rust-lang#137477 (uefi: Add Service Binding Protocol abstraction) - rust-lang#137569 (Stabilize `string_extend_from_within`) - rust-lang#137633 (Only use implied bounds hack if bevy, and use deeply normalize in implied bounds hack) - rust-lang#137679 (Various coretests improvements) - rust-lang#137723 (Make `rust.description` more general-purpose and pass `CFG_VER_DESCRIPTION`) - rust-lang#137728 (Remove unsizing coercions for tuples) - rust-lang#137731 (Resume one waiter at once in deadlock handler) - rust-lang#137875 (mir_build: Integrate "simplification" steps into match-pair-tree creation) - rust-lang#138028 (compiler: add `ExternAbi::is_rustic_abi`) r? `@ghost` `@rustbot` modify labels: rollup
Rollup merge of rust-lang#137731 - SparrowLii:waiter, r=nnethercote Resume one waiter at once in deadlock handler When multiple query loop errors occur in the code, only one waiter should be resumed at a time to avoid waking up multiple waiters at the same time and causing deadlock due to thread grabbing. This fixes the UI failures in rust-lang#132051 cc `@Zoxc` `@cjgillot` `@nnethercote` `@bjorn3` `@Kobzol` Zulip discussion [here](https://rust-lang.zulipchat.com/#narrow/channel/187679-t-compiler.2Fwg-parallel-rustc/topic/Deadlocks.20and.20Rayon) Edit: We can't reproduce these bugs with the existing test suits, so we keep them until we merge rust-lang#132051 UPDATES rust-lang#129912 UPDATES rust-lang#120757 UPDATES rust-lang#129911
When multiple query loop errors occur in the code, only one waiter should be resumed at a time to avoid waking up multiple waiters at the same time and causing deadlock due to thread grabbing.
This fixes the UI failures in #132051
cc @Zoxc @cjgillot @nnethercote @bjorn3 @Kobzol
Zulip discussion here
Edit: We can't reproduce these bugs with the existing test suits, so we keep them until we merge #132051
UPDATES #129912
UPDATES #120757
UPDATES #129911