Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid deadlock when reporting ice #111352

Closed
wants to merge 1 commit into from

Conversation

SparrowLii
Copy link
Member

@SparrowLii SparrowLii commented May 8, 2023

Fixes the deadlock issue in #110284

When using the parallel compiler, we have added deadlock headler via rayon's thread pool, but outside the thread pool (printing the query stack when reporting ice) it is still possible to stuck into deadlocks. So I added a timeout to print_query_stack to get away from deadlocks.

The impl of print_query_stack is transferred from rustc_query_system to rustc_query_impl because we need to use with_context to enable the sub thread to access TLV

cc @Zoxc

@rustbot
Copy link
Collaborator

rustbot commented May 8, 2023

r? @TaKO8Ki

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added A-query-system Area: The rustc query system (https://rustc-dev-guide.rust-lang.org/query.html) S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels May 8, 2023
@rust-log-analyzer

This comment has been minimized.

let mut i = 0;

#[cfg(not(parallel_compiler))]
let query_map = qcx.try_collect_active_jobs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can try_collect_active_jobs be removed from the QueryContext trait, and be made an inherent impl?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to keep it, since rustc_query_system needs it to handle cycle errors:
https://github.com/rust-lang/rust/blob/master/compiler/rustc_query_system/src/query/plumbing.rs#L269

let query_map = qcx.try_collect_active_jobs();

#[cfg(parallel_compiler)]
let query_map = rustc_middle::ty::tls::with_context(|context| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining why we spawn a thread here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

@@ -788,3 +788,64 @@ macro_rules! define_queries {
}
}
}

pub fn print_query_stack<'tcx>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an inherent method on QueryCtxt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done : )

Copy link
Member Author

@SparrowLii SparrowLii May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: I changed to use rayon_core::join now so no need to use with/enter_context. Reverted the change to lift up print_query_stack.


s.spawn(move || {
rustc_middle::ty::tls::enter_context(context, || {
let query_map = qcx.try_collect_active_jobs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling try_collect_active_jobs on an outside thread can run into issues with WorkerLocal.

Copy link
Member Author

@SparrowLii SparrowLii May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reporting ice is an action outside the thread pool, so this is not a regression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used rayon_core::join instead of spawn so that there is no problem with WorkerLocal.

But I still have a bit doubts—— if reporting ice is running in rayon's thread pool, why the deadlock handler is not performed

@SparrowLii SparrowLii force-pushed the ice_deadlock branch 3 times, most recently from 7e0aadd to ad67dbe Compare May 9, 2023 03:29
@rust-log-analyzer

This comment has been minimized.

@Zoxc
Copy link
Contributor

Zoxc commented May 9, 2023

How does the actual deadlock occur?

I think running the deadlock handler on a Rayon thread as a Rayon job may be an improvement, but I'm not sure that would help for the actual deadlock you ran into.

@SparrowLii
Copy link
Member Author

SparrowLii commented May 10, 2023

How does the actual deadlock occur?

Not sure yet. try_collect_active_job theoretically should not be deadlocked, since it only calls try_lock(). So I guess it has something to do with the specific bug for the derive use case(#110284).
Anyway, I think adding a timeout here is the easiest way to allow us to continue with other more important work.

@pnkfelix
Copy link
Member

pnkfelix commented Jun 1, 2023

I think @Zoxc's request for a description of the actual deadlock is important here. Its really important to understand the root causes of problems like that, if you can isolate them, in order to confirm that the proposed fix is actually a real fix and not just masking it.

retagging as waiting-on-author to account for the need for an answer to that question.

@rustbot label: +S-waiting-on-author -S-waiting-on-review

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 1, 2023
@SparrowLii
Copy link
Member Author

SparrowLii commented Jun 2, 2023

OK, I gonna figure out the real cause of the deadlock.

@SparrowLii
Copy link
Member Author

It looks like the problem is with make_query called in try_collect_active_jobs.

I experimented like this in try_collect_active_jobs:

    pub fn try_collect_active_jobs<Qcx: Copy>(
        &self,
        qcx: Qcx,
        make_query: fn(Qcx, K) -> QueryStackFrame<D>,
        jobs: &mut QueryMap<D>,
    ) -> Option<()> {
        #[cfg(parallel_compiler)]
        {
            // We use try_lock_shards here since we are called from the
            // deadlock handler, and this shouldn't be locked.
            let locks = self.active.locks();
            for lock in locks.iter() {
                use std::sync::mpsc::channel;
                use std::time::Duration;

                let timeout = Duration::from_secs(5); // Set the timeout to 5 seconds

                let (tx, rx) = channel();
                let map = lock.try_lock()?;

                std::thread::spawn(move || {
                    match rx.recv_timeout(timeout) {
                        Ok(_) => (),
                        Err(_) => eprintln!("collect actives failed: time out"),
                    }
                });
                for (k, v) in map.iter() {
                    if let QueryResult::Started(ref job) = *v {
                        let query = make_query(qcx, *k);
                        jobs.insert(job.id, QueryJobInfo { query, job: job.clone() });
                    }
                }
                tx.send(()).unwrap();
            }
        }
        #[cfg(not(parallel_compiler))]
        {
            // We use try_lock here since we are called from the
            // deadlock handler, and this shouldn't be locked.
            // (FIXME: Is this relevant for non-parallel compilers? It doesn't
            // really hurt much.)
            for (k, v) in self.active.try_lock()?.iter() {
                if let QueryResult::Started(ref job) = *v {
                    let query = make_query(qcx, *k);
                    jobs.insert(job.id, QueryJobInfo { query, job: job.clone() });
                }
            }
        }

        Some(())
    }

And it printed:

query stack during panic:
collect actives failed: time out
print query stack failed: time out

@bors
Copy link
Contributor

bors commented Jul 20, 2023

☔ The latest upstream changes (presumably #108714) made this pull request unmergeable. Please resolve the merge conflicts.

@SparrowLii
Copy link
Member Author

close this as #112708 has solved the issue

@SparrowLii SparrowLii closed this Aug 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-query-system Area: The rustc query system (https://rustc-dev-guide.rust-lang.org/query.html) S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants