-
-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pantsd-runner hang in graph acquisition #5169
Comments
I suspect that this is because on the python side, |
After implementing #5178, I was still seeing this, but inside of a lock lower in the stack. I bisected with:
...and came to #4931. |
Dagnabit... tried removing the comment from the PR, but it didn't work. As mentioned in the commit message, this one isn't fixed yet. |
### Problem #5169 describes a deadlock that occurs when any native-engine lock is acquired by a non-main thread during a fork. Pre #4931 the `context_lock` was acquired for the entire length of a request, but during that change that lock acquisition was removed. Kris fixed this as part of #5156, but there is not yet a test confirming that the issue is fixed. ### Solution Add a test to cover #5169. ### Result Covers #5169 with an integration test.
### Problem The scheduler lock existed for a few reasons: 1. To prevent `Nodes` from being invalidated out of the `Graph` while they were being consumed by other `Nodes`. Because `Context` holds an `EntryId` for the running `Node`, the `Node` would fail (...panic, actually) at the next attempt to resolve a dependency. 2. To protect access to the `roots` field of the Scheduler, which was mutated in order to statefully build up the set of roots via the `add_root_select` method, and then capture results later. 3. We used to walk the graph from the python side (but no longer), which meant that `Node`s needed their lifecycle to be tightly controlled. The API no longer allows for walking inner `Node`s, so only roots are relevant now. ### Solution Introduce a new `Failure::Invalidated` type to act as a signal that a `Node`'s `Context`/`EntryId` was invalidated while it was running (due to a file change on disk). The `Scheduler` will retry creating a `Node` a few times to allow execution to recover in the common case of minor file edits. Move the `roots` field off of `Scheduler` and onto a disposable `ExecutionRequest` struct, and have all relevant native functions take a reference to the `ExecutionRequest`. ### Result The scheduler lock is removed. This does _not_ fix pantsbuild#5169, but it should clear up the python-side locking situation to assist in debugging it.
### Problem The scheduler lock existed for a few reasons: 1. To prevent `Nodes` from being invalidated out of the `Graph` while they were being consumed by other `Nodes`. Because `Context` holds an `EntryId` for the running `Node`, the `Node` would fail (...panic, actually) at the next attempt to resolve a dependency. 2. To protect access to the `roots` field of the Scheduler, which was mutated in order to statefully build up the set of roots via the `add_root_select` method, and then capture results later. 3. We used to walk the graph from the python side (but no longer), which meant that `Node`s needed their lifecycle to be tightly controlled. The API no longer allows for walking inner `Node`s, so only roots are relevant now. ### Solution Introduce a new `Failure::Invalidated` type to act as a signal that a `Node`'s `Context`/`EntryId` was invalidated while it was running (due to a file change on disk). The `Scheduler` will retry creating a `Node` a few times to allow execution to recover in the common case of minor file edits. Move the `roots` field off of `Scheduler` and onto a disposable `ExecutionRequest` struct, and have all relevant native functions take a reference to the `ExecutionRequest`. ### Result The scheduler lock is removed. This does _not_ fix pantsbuild#5169, but it should clear up the python-side locking situation to assist in debugging it.
### Problem The scheduler lock existed for a few reasons: 1. To prevent `Nodes` from being invalidated out of the `Graph` while they were being consumed by other `Nodes`. Because `Context` holds an `EntryId` for the running `Node`, the `Node` would fail (...panic, actually) at the next attempt to resolve a dependency. 2. To protect access to the `roots` field of the Scheduler, which was mutated in order to statefully build up the set of roots via the `add_root_select` method, and then capture results later. 3. We used to walk the graph from the python side (but no longer), which meant that `Node`s needed their lifecycle to be tightly controlled. The API no longer allows for walking inner `Node`s, so only roots are relevant now. ### Solution Introduce a new `Failure::Invalidated` type to act as a signal that a `Node`'s `Context`/`EntryId` was invalidated while it was running (due to a file change on disk). The `Scheduler` will retry creating a `Node` a few times to allow execution to recover in the common case of minor file edits. Move the `roots` field off of `Scheduler` and onto a disposable `ExecutionRequest` struct, and have all relevant native functions take a reference to the `ExecutionRequest`. ### Result The scheduler lock is removed. This does _not_ fix pantsbuild#5169, but it should clear up the python-side locking situation to assist in debugging it.
…d#5192) ### Problem pantsbuild#5169 describes a deadlock that occurs when any native-engine lock is acquired by a non-main thread during a fork. Pre pantsbuild#4931 the `context_lock` was acquired for the entire length of a request, but during that change that lock acquisition was removed. Kris fixed this as part of pantsbuild#5156, but there is not yet a test confirming that the issue is fixed. ### Solution Add a test to cover pantsbuild#5169. ### Result Covers pantsbuild#5169 with an integration test.
The current thread is hung trying to acquire the scheduler lock.
I'm going to try pushing the scheduler lock down to the rust side.
The text was updated successfully, but these errors were encountered: