Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pantsd-runner hang in graph acquisition #5169

Closed
stuhood opened this issue Dec 5, 2017 · 4 comments
Closed

pantsd-runner hang in graph acquisition #5169

stuhood opened this issue Dec 5, 2017 · 4 comments
Assignees
Labels
Milestone

Comments

@stuhood
Copy link
Member

stuhood commented Dec 5, 2017

The current thread is hung trying to acquire the scheduler lock.

I'm going to try pushing the scheduler lock down to the rust side.

Thread 0x000070000b276000 (most recent call first):
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/reporting/report.py", line 34 in run
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x000070000ae73000 (most recent call first):
  File "/opt/ee/python/2.7/lib/python2.7/multiprocessing/pool.py", line 389 in _handle_results
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 763 in run
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x000070000aa70000 (most recent call first):
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 340 in wait
  File "/opt/ee/python/2.7/lib/python2.7/Queue.py", line 168 in get
  File "/opt/ee/python/2.7/lib/python2.7/multiprocessing/pool.py", line 336 in _handle_tasks
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 763 in run
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x000070000a66d000 (most recent call first):
  File "/opt/ee/python/2.7/lib/python2.7/multiprocessing/pool.py", line 327 in _handle_workers
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 763 in run
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x0000700002d95000 (most recent call first):
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 340 in wait
  File "/opt/ee/python/2.7/lib/python2.7/Queue.py", line 168 in get
  File "/Users/stuhood/.pex/install/futures-3.0.5-py2-none-any.whl.ab865f3c2a112dab372f8017be358c0c2f575561/futures-3.0.5-py2-none-any.whl/concurrent/futures/thread.py", line 65 in _worker
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 763 in run
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x0000700002992000 (most recent call first):
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 340 in wait
  File "/opt/ee/python/2.7/lib/python2.7/Queue.py", line 168 in get
  File "/Users/stuhood/.pex/install/futures-3.0.5-py2-none-any.whl.ab865f3c2a112dab372f8017be358c0c2f575561/futures-3.0.5-py2-none-any.whl/concurrent/futures/thread.py", line 65 in _worker
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 763 in run
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 783 in __bootstrap

Current thread 0x000070000258f000 (most recent call first):
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 174 in acquire
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/bin/engine_initializer.py", line 96 in create_build_graph
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/bin/goal_runner.py", line 115 in _init_graph
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/bin/goal_runner.py", line 180 in _setup_context
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/bin/goal_runner.py", line 208 in setup
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/bin/local_pants_runner.py", line 77 in _run
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/bin/local_pants_runner.py", line 37 in run
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/bin/daemon_pants_runner.py", line 171 in post_fork_child
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/pantsd/process_manager.py", line 451 in daemonize
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/bin/daemon_pants_runner.py", line 132 in run
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/pantsd/pailgun_server.py", line 58 in _run_pants
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/pantsd/pailgun_server.py", line 81 in handle
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/pantsd/pailgun_server.py", line 41 in handle_request
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/pantsd/pailgun_server.py", line 135 in process_request
  File "/opt/ee/python/2.7/lib/python2.7/SocketServer.py", line 295 in _handle_request_noblock
  File "/opt/ee/python/2.7/lib/python2.7/SocketServer.py", line 280 in handle_request
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/pantsd/service/pailgun_service.py", line 94 in run
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 763 in run
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x000070000218c000 (most recent call first):
  File "/Users/stuhood/.pex/install/pywatchman-1.4.1-cp27-cp27m-macosx_10_4_x86_64.whl.176c2b1fae2f3b26d3f824138b0754ed86f60f9a/pywatchman-1.4.1-cp27-cp27m-macosx_10_4_x86_64.whl/pywatchman/__init__.py", line 339 in readBytes
  File "/Users/stuhood/.pex/install/pywatchman-1.4.1-cp27-cp27m-macosx_10_4_x86_64.whl.176c2b1fae2f3b26d3f824138b0754ed86f60f9a/pywatchman-1.4.1-cp27-cp27m-macosx_10_4_x86_64.whl/pywatchman/__init__.py", line 695 in receive
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/pantsd/watchman_client.py", line 50 in stream_query
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/pantsd/watchman.py", line 178 in subscribed
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/pantsd/service/fs_event_service.py", line 116 in run
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 763 in run
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x0000700001d89000 (most recent call first):
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/pantsd/service/scheduler_service.py", line 71 in _process_event_queue
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/pantsd/service/scheduler_service.py", line 101 in run
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 763 in run
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 810 in __bootstrap_inner
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 783 in __bootstrap

Thread 0x00007fffaa3cf340 (most recent call first):
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 359 in wait
  File "/opt/ee/python/2.7/lib/python2.7/threading.py", line 960 in join
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/pantsd/pants_daemon.py", line 259 in _run_services
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/pantsd/pants_daemon.py", line 283 in run_sync
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/pantsd/pants_daemon.py", line 329 in launch
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/bin/pants_loader.py", line 58 in load_and_execute
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/bin/pants_loader.py", line 65 in run
  File "/Users/stuhood/.pex/install/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl.46445b2c7acbf09afd084ac31d3da3e261ae5636/pantsbuild.pants-1.4.0.dev16+899381073-cp27-none-macosx_10_8_x86_64.whl/pants/bin/pants_loader.py", line 69 in main
  File ".bootstrap/_pex/pex.py", line 484 in execute_pkg_resources
  File ".bootstrap/_pex/pex.py", line 466 in execute_entry
  File ".bootstrap/_pex/pex.py", line 408 in _execute
  File ".bootstrap/_pex/pex.py", line 325 in _wrap_profiling
  File ".bootstrap/_pex/pex.py", line 293 in _wrap_coverage
  File ".bootstrap/_pex/pex.py", line 365 in execute
  File ".bootstrap/_pex/pex_bootstrapper.py", line 83 in bootstrap_pex
  File "__main__.py", line 25 in <module>
  File "/opt/ee/python/2.7/lib/python2.7/runpy.py", line 72 in _run_code
  File "/opt/ee/python/2.7/lib/python2.7/runpy.py", line 162 in _run_module_as_main
@stuhood stuhood added this to the Daemon Beta milestone Dec 5, 2017
@stuhood stuhood self-assigned this Dec 5, 2017
@stuhood
Copy link
Member Author

stuhood commented Dec 5, 2017

I suspect that this is because on the python side, Scheduler._lock ends up used all over the place because the PantsDaemon class is bootstrapped with a reference to it. It seems like that should be a separate lock.

@stuhood
Copy link
Member Author

stuhood commented Dec 8, 2017

After implementing #5178, I was still seeing this, but inside of a lock lower in the stack.

I bisected with:

git bisect start HEAD 7ac68c6b8fe473a1a218d7e54f78cf6c344d8b59 -- src/python/pants/engine/ build-support src/rust/engine/

while :; do ./pants list :: | wc -l ; done
while :; do touch 3rdparty/jvm/commons-io/BUILD ; sleep 1 | wc -l ; done

...and came to #4931.

@stuhood stuhood closed this as completed in b0ae261 Dec 9, 2017
@stuhood
Copy link
Member Author

stuhood commented Dec 9, 2017

Dagnabit... tried removing the comment from the PR, but it didn't work. As mentioned in the commit message, this one isn't fixed yet.

@stuhood stuhood reopened this Dec 9, 2017
stuhood added a commit to twitter/pants that referenced this issue Dec 12, 2017
stuhood added a commit to twitter/pants that referenced this issue Dec 12, 2017
stuhood pushed a commit that referenced this issue Dec 13, 2017
### Problem

#5169 describes a deadlock that occurs when any native-engine lock is acquired by a non-main thread during a fork. Pre #4931 the `context_lock` was acquired for the entire length of a request, but during that change that lock acquisition was removed.

Kris fixed this as part of #5156, but there is not yet a test confirming that the issue is fixed.

### Solution

Add a test to cover #5169.

### Result

Covers #5169 with an integration test.
@stuhood
Copy link
Member Author

stuhood commented Dec 13, 2017

Fixed in #5156, and covered in #5192.

@stuhood stuhood closed this as completed Dec 13, 2017
stuhood pushed a commit to twitter/pants that referenced this issue Dec 15, 2017
### Problem

The scheduler lock existed for a few reasons:
1. To prevent `Nodes` from being invalidated out of the `Graph` while they were being consumed by other `Nodes`. Because `Context` holds an `EntryId` for the running `Node`, the `Node` would fail (...panic, actually) at the next attempt to resolve a dependency.
2. To protect access to the `roots` field of the Scheduler, which was mutated in order to statefully build up the set of roots via the `add_root_select` method, and then capture results later.
3. We used to walk the graph from the python side (but no longer), which meant that `Node`s needed their lifecycle to be tightly controlled. The API no longer allows for walking inner `Node`s, so only roots are relevant now.

### Solution

Introduce a new `Failure::Invalidated` type to act as a signal that a `Node`'s `Context`/`EntryId` was invalidated while it was running (due to a file change on disk). The `Scheduler` will retry creating a `Node` a few times to allow execution to recover in the common case of minor file edits.

Move the `roots` field off of `Scheduler` and onto a disposable `ExecutionRequest` struct, and have all relevant native functions take a reference to the `ExecutionRequest`.

### Result

The scheduler lock is removed. This does _not_ fix pantsbuild#5169, but it should clear up the python-side locking situation to assist in debugging it.
stuhood pushed a commit to twitter/pants that referenced this issue Dec 15, 2017
### Problem

The scheduler lock existed for a few reasons:
1. To prevent `Nodes` from being invalidated out of the `Graph` while they were being consumed by other `Nodes`. Because `Context` holds an `EntryId` for the running `Node`, the `Node` would fail (...panic, actually) at the next attempt to resolve a dependency.
2. To protect access to the `roots` field of the Scheduler, which was mutated in order to statefully build up the set of roots via the `add_root_select` method, and then capture results later.
3. We used to walk the graph from the python side (but no longer), which meant that `Node`s needed their lifecycle to be tightly controlled. The API no longer allows for walking inner `Node`s, so only roots are relevant now.

### Solution

Introduce a new `Failure::Invalidated` type to act as a signal that a `Node`'s `Context`/`EntryId` was invalidated while it was running (due to a file change on disk). The `Scheduler` will retry creating a `Node` a few times to allow execution to recover in the common case of minor file edits.

Move the `roots` field off of `Scheduler` and onto a disposable `ExecutionRequest` struct, and have all relevant native functions take a reference to the `ExecutionRequest`.

### Result

The scheduler lock is removed. This does _not_ fix pantsbuild#5169, but it should clear up the python-side locking situation to assist in debugging it.
baroquebobcat pushed a commit to twitter/pants that referenced this issue Dec 28, 2017
### Problem

The scheduler lock existed for a few reasons:
1. To prevent `Nodes` from being invalidated out of the `Graph` while they were being consumed by other `Nodes`. Because `Context` holds an `EntryId` for the running `Node`, the `Node` would fail (...panic, actually) at the next attempt to resolve a dependency.
2. To protect access to the `roots` field of the Scheduler, which was mutated in order to statefully build up the set of roots via the `add_root_select` method, and then capture results later.
3. We used to walk the graph from the python side (but no longer), which meant that `Node`s needed their lifecycle to be tightly controlled. The API no longer allows for walking inner `Node`s, so only roots are relevant now.

### Solution

Introduce a new `Failure::Invalidated` type to act as a signal that a `Node`'s `Context`/`EntryId` was invalidated while it was running (due to a file change on disk). The `Scheduler` will retry creating a `Node` a few times to allow execution to recover in the common case of minor file edits.

Move the `roots` field off of `Scheduler` and onto a disposable `ExecutionRequest` struct, and have all relevant native functions take a reference to the `ExecutionRequest`.

### Result

The scheduler lock is removed. This does _not_ fix pantsbuild#5169, but it should clear up the python-side locking situation to assist in debugging it.
baroquebobcat pushed a commit to twitter/pants that referenced this issue Dec 28, 2017
…d#5192)

### Problem

pantsbuild#5169 describes a deadlock that occurs when any native-engine lock is acquired by a non-main thread during a fork. Pre pantsbuild#4931 the `context_lock` was acquired for the entire length of a request, but during that change that lock acquisition was removed.

Kris fixed this as part of pantsbuild#5156, but there is not yet a test confirming that the issue is fixed.

### Solution

Add a test to cover pantsbuild#5169.

### Result

Covers pantsbuild#5169 with an integration test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant