Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The client error when pantsd restarts is not self-explanatory #8200

Closed
TansyArron opened this issue Aug 22, 2019 · 6 comments · Fixed by #10052
Closed

The client error when pantsd restarts is not self-explanatory #8200

TansyArron opened this issue Aug 22, 2019 · 6 comments · Fixed by #10052
Assignees

Comments

@TansyArron
Copy link
Contributor

TansyArron commented Aug 22, 2019

When pantsd restarts (generally because its code or plugin code has been edited), the error is not self explanatory.

Under v2 this might look like:

pants.bin.remote_pants_runner.RemotePantsRunner.Terminated: abruptly lost active connection to pantsd runner: NailgunError('Problem talking to nailgun server (address: 127.0.0.1:61905): TruncatedHeaderError("Failed to read nailgun chunk header (TruncatedRead(\'Expected 5 bytes before socket shutdown, instead received 0\',)).",)', TruncatedHeaderError("Failed to read nailgun chunk header (TruncatedRead('Expected 5 bytes before socket shutdown, instead received 0',)).",))
Remote exception:
timestamp: 2020-05-29T16:03:45.528163
process title: pantsd [/Users/stuhood/src/pants]
sys.argv: ['/Users/stuhood/src/pants/src/python/pants/bin/pants_loader.py', '--no-v1', '--v2', 'binary', 'src/python/pants/bin:pants_local_binary']
pid: 25924
Exception caught: (pants.pantsd.pants_daemon.RuntimeFailure)
  File "/Users/stuhood/src/pants/src/python/pants/bin/pants_loader.py", line 98, in <module>
    main()
  File "/Users/stuhood/src/pants/src/python/pants/bin/pants_loader.py", line 94, in main
    PantsLoader.run()
  File "/Users/stuhood/src/pants/src/python/pants/bin/pants_loader.py", line 90, in run
    cls.load_and_execute(entrypoint)
  File "/Users/stuhood/src/pants/src/python/pants/bin/pants_loader.py", line 83, in load_and_execute
    entrypoint_main()
  File "/Users/stuhood/src/pants/src/python/pants/pantsd/pants_daemon.py", line 583, in launch
    PantsDaemon.Factory.create(OptionsBootstrapper.create()).run_sync()
  File "/Users/stuhood/src/pants/src/python/pants/pantsd/pants_daemon.py", line 503, in run_sync
    self._run_services(self._services)
  File "/Users/stuhood/src/pants/src/python/pants/pantsd/pants_daemon.py", line 415, in _run_services
    f"service failure for {service}, shutting down!"

Exception message: service failure for <pants.pantsd.service.scheduler_service.SchedulerService object at 0x10e603828>, shutting down!

whereas under v1 it might look like:

 ./pants test src/python/toolchain/satresolver::                                                                            

17:37:42 00:00 [main]
               See a report at: http://localhost:51055/run/pants_run_2019_08_22_17_37_42_471_0f9540ac8f01464f997a4a29df197279
17:37:42 00:00   [setup]
17:37:42 00:00     [parse]21:37:42 [WARN] /Users/tansypants/.cache/pants/setup/bootstrap-Darwin-x86_64/1.20.0.dev1_py37/lib/python3.7/site-packages/pants/option/options.py:460: DeprecationWarning: DEPRECATED: scope compile.zinc will be removed in version 1.20.0.dev2.
  Use scope compile.rsc instead (options: args, use_classpath_jars, incremental_caching)
  self._check_and_apply_deprecations(scope, values)


               Executing tasks in goals: bootstrap -> imports -> unpack-wheels -> unpack-jars -> deferred-sources -> native-compile -> link -> jvm-platform-validate -> gen -> pyprep -> resolve -> compile -> resources -> test
17:37:42 00:00   [bootstrap]
17:37:42 00:00     [substitute-aliased-targets]
17:37:42 00:00     [jar-dependency-management]timestamp: 2019-08-22T17:37:42.879907
Exception caught: (builtins.OSError)
  File "/Users/tansypants/.cache/pants/setup/bootstrap-Darwin-x86_64/1.20.0.dev1_py37/lib/python3.7/site-packages/pants/bin/daemon_pants_runner.py", line 256, in run
    runner.run()
  File "/Users/tansypants/.cache/pants/setup/bootstrap-Darwin-x86_64/1.20.0.dev1_py37/lib/python3.7/site-packages/pants/bin/local_pants_runner.py", line 231, in run
    self._run()
  File "/Users/tansypants/.cache/pants/setup/bootstrap-Darwin-x86_64/1.20.0.dev1_py37/lib/python3.7/site-packages/pants/bin/local_pants_runner.py", line 303, in _run
    run_tracker_result = self._run_tracker.end()
  File "/Users/tansypants/.cache/pants/setup/bootstrap-Darwin-x86_64/1.20.0.dev1_py37/lib/python3.7/site-packages/pants/goal/run_tracker.py", line 499, in end
    with self.new_workunit("complete"):
  File "/Users/tansypants/.pyenv/versions/3.7.3/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/Users/tansypants/.cache/pants/setup/bootstrap-Darwin-x86_64/1.20.0.dev1_py37/lib/python3.7/site-packages/pants/goal/run_tracker.py", line 312, in new_workunit
    log_config=log_config) as workunit:
  File "/Users/tansypants/.pyenv/versions/3.7.3/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/Users/tansypants/.cache/pants/setup/bootstrap-Darwin-x86_64/1.20.0.dev1_py37/lib/python3.7/site-packages/pants/goal/run_tracker.py", line 340, in new_workunit_under_parent
    self.report.start_workunit(workunit)
  File "/Users/tansypants/.cache/pants/setup/bootstrap-Darwin-x86_64/1.20.0.dev1_py37/lib/python3.7/site-packages/pants/reporting/report.py", line 94, in start_workunit
    reporter.start_workunit(workunit)
  File "/Users/tansypants/.cache/pants/setup/bootstrap-Darwin-x86_64/1.20.0.dev1_py37/lib/python3.7/site-packages/pants/reporting/plaintext_reporter.py", line 142, in start_workunit
    self.flush()
  File "/Users/tansypants/.cache/pants/setup/bootstrap-Darwin-x86_64/1.20.0.dev1_py37/lib/python3.7/site-packages/pants/reporting/plaintext_reporter.py", line 194, in flush
    self.settings.outfile.flush()

Exception message: [Errno 9] Bad file descriptor
@cosmicexplorer

This comment has been minimized.

@TansyArron

This comment has been minimized.

@stuhood stuhood added the pantsd label May 22, 2020
@stuhood stuhood changed the title Intermittent Pants failure - Bad File Descriptor The client error when pantsd restarts is not self-explanatory May 22, 2020
@asherf
Copy link
Member

asherf commented Jun 2, 2020

I am seeing this too.


⠈
15:34:54:815 [ERROR]
15:34:54:815 [ERROR] lost active connection to pantsd!
15:34:54:816 [ERROR] abruptly lost active connection to pantsd runner: NailgunError('Problem talking to nailgun server (address: 127.0.0.1:57502): TruncatedHeaderError("Failed to read nailgun chunk header (TruncatedRead(\'Expected 5 bytes before socket shutdown, instead received 0\')).")', TruncatedHeaderError("Failed to read nailgun chunk header (TruncatedRead('Expected 5 bytes before socket shutdown, instead received 0'))."))
Remote exception:
timestamp: 2020-06-02T08:34:53.717969
process title: pantsd [/Users/asher/projects/toolchain]
sys.argv: ['/Users/asher/.cache/pants/setup/bootstrap-Darwin-x86_64/1.29.0.dev1_py38/bin/pants', '--pants-config-files=+["/Users/asher/projects/toolchain/toolchain.toml"]', '--pants-bin-name=./tc', 'test', 'src/python/toolchain/buildsense/::', 'src/python/toolchain/django/site/::']
pid: 37190
Exception caught: (pants.pantsd.pants_daemon.RuntimeFailure)
  File "/Users/asher/.cache/pants/setup/bootstrap-Darwin-x86_64/1.29.0.dev1_py38/bin/pants", line 8, in <module>
    sys.exit(main())
  File "/Users/asher/.cache/pants/setup/bootstrap-Darwin-x86_64/1.29.0.dev1_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 94, in main
    PantsLoader.run()
  File "/Users/asher/.cache/pants/setup/bootstrap-Darwin-x86_64/1.29.0.dev1_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 90, in run
    cls.load_and_execute(entrypoint)
  File "/Users/asher/.cache/pants/setup/bootstrap-Darwin-x86_64/1.29.0.dev1_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 83, in load_and_execute
    entrypoint_main()
  File "/Users/asher/.cache/pants/setup/bootstrap-Darwin-x86_64/1.29.0.dev1_py38/lib/python3.8/site-packages/pants/pantsd/pants_daemon.py", line 583, in launch
    PantsDaemon.Factory.create(OptionsBootstrapper.create()).run_sync()
  File "/Users/asher/.cache/pants/setup/bootstrap-Darwin-x86_64/1.29.0.dev1_py38/lib/python3.8/site-packages/pants/pantsd/pants_daemon.py", line 503, in run_sync
    self._run_services(self._services)
  File "/Users/asher/.cache/pants/setup/bootstrap-Darwin-x86_64/1.29.0.dev1_py38/lib/python3.8/site-packages/pants/pantsd/pants_daemon.py", line 414, in _run_services
    raise PantsDaemon.RuntimeFailure(

Exception message: service failure for <pants.pantsd.service.scheduler_service.SchedulerService object at 0x10eb697c0>, shutting down!


(Use --print-exception-stacktrace to see more error details.)

In the middle of a test session run.

@stuhood

@asherf
Copy link
Member

asherf commented Jun 2, 2020

pantsd log:

15:34:52:462 [INFO] notify invalidation: cleared 2 and dirtied 208 nodes for: {"src/python/toolchain/buildsense", "src/python/toolchain/buildsense/views_api_test.py"}
15:34:52:466 [INFO] notify invalidation: cleared 0 and dirtied 0 nodes for: {"src/python/toolchain/buildsense/views_api_test.py", "src/python/toolchain/buildsense"}
15:34:53:218 [INFO] terminating pantsd service: <pants.pantsd.service.scheduler_service.SchedulerService object at 0x10eb697c0>
15:34:53:218 [INFO] terminating pantsd service: <pants.pantsd.service.pailgun_service.PailgunService object at 0x10eb69fa0>
15:34:53:711 [INFO] pailgun service on shutting down
15:34:53:711 [INFO] terminating pantsd service: <pants.pantsd.service.store_gc_service.StoreGCService object at 0x10eb69dc0>
15:34:53:711 [INFO] terminating pantsd
15:34:53:720 [WARN] /Users/asher/.cache/pants/setup/bootstrap-Darwin-x86_64/1.29.0.dev1_py38/lib/python3.8/site-packages/pants/base/exception_sink.py:359: DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats
  process_title=setproctitle.getproctitle(),

15:34:53:721 [ERROR] service failure for <pants.pantsd.service.scheduler_service.SchedulerService object at 0x10eb697c0>, shutting down!

(Use --print-exception-stacktrace to see more error details.)



15:38:27:475 [INFO] setting up service <pants.pantsd.service.scheduler_service.SchedulerService object at 0x1155c39a0>
15:38:27:484 [INFO] notify invalidation: cleared 0 and dirtied 0 nodes for: {".pids", ".pids/pantsd"}
15:38:27:485 [INFO] notify invalidation: cleared 0 and dirtied 0 nodes for: {".pids/pantsd", ".pids/pantsd/socket_pailgun"}
15:38:27:485 [INFO] notify invalidation: cleared 0 and dirtied 0 nodes for: {".pids/pantsd/socket_pailgun", ".pids/pantsd"}

@stuhood
Copy link
Member

stuhood commented Jun 2, 2020

@asherf : That isn't quite enough of the pantsd log, but I'm assuming that somewhere shortly above those lines there was a warning about the invalidation globs having been matched?

@asherf
Copy link
Member

asherf commented Jun 2, 2020

15:34:41:175 [INFO] notify invalidation: cleared 0 and dirtied 0 nodes for: {".pids/pantsd", ".pids/pantsd/fingerprint"}
15:34:41:233 [INFO] handling request: `--pants-config-files=+["/Users/asher/projects/toolchain/toolchain.toml"] --pants-bin-name=./tc test src/python/toolchain/buildsense/:: src/python/toolchain/django/site/::`
15:34:51:786 [INFO] notify invalidation: cleared 2 and dirtied 208 nodes for: {"src/python/toolchain/buildsense", "src/python/toolchain/buildsense/views_api_test.py"}
15:34:52:317 [ERROR] saw filesystem changes covered by invalidation globs: content changed (Digest(fingerprint='fdfee1ee8e0b7b42f9991d462b9d69fd80daa3b550e75ddb184a444969d7d74a', serialized_bytes_length=407) fs Digest(fingerprint='f35d669e0e15ea641c97a6bb5add4863bd77dae0fe97578ec3ba320ae48af969', serialized_bytes_length=407)). terminating the daemon.
15:34:52:462 [INFO] notify invalidation: cleared 2 and dirtied 208 nodes for: {"src/python/toolchain/buildsense", "src/python/toolchain/buildsense/views_api_test.py"}
15:34:52:466 [INFO] notify invalidation: cleared 0 and dirtied 0 nodes for: {"src/python/toolchain/buildsense/views_api_test.py", "src/python/toolchain/buildsense"}
15:34:53:218 [INFO] terminating pantsd service: <pants.pantsd.service.scheduler_service.SchedulerService object at 0x10eb697c0>
15:34:53:218 [INFO] terminating pantsd service: <pants.pantsd.service.pailgun_service.PailgunService object at 0x10eb69fa0>
15:34:53:711 [INFO] pailgun service on shutting down
15:34:53:711 [INFO] terminating pantsd service: <pants.pantsd.service.store_gc_service.StoreGCService object at 0x10eb69dc0>
15:34:53:711 [INFO] terminating pantsd
15:34:53:720 [WARN] /Users/asher/.cache/pants/setup/bootstrap-Darwin-x86_64/1.29.0.dev1_py38/lib/python3.8/site-packages/pants/base/exception_sink.py:359: DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats
  process_title=setproctitle.getproctitle(),

15:34:53:721 [ERROR] service failure for <pants.pantsd.service.scheduler_service.SchedulerService object at 0x10eb697c0>, shutting down!

(Use --print-exception-stacktrace to see more error details.)
15:38:27:475 [INFO] setting up service <pants.pantsd.service.scheduler_service.SchedulerService object at 0x1155c39a0>

@stuhood stuhood self-assigned this Jun 2, 2020
stuhood added a commit that referenced this issue Jun 3, 2020
### Problem

The `PantsDaemon` class was playing double duty as both the entrypoint for the pantsd server, and as a launcher for the client. We will soon need to make significant changes there in order to support #8200 and #7654.

### Solution

Extract a `PantsDaemonProcessManager` base class for process metadata reads/writes and a `PantsDaemonClient` subclass to consume the metadata to decide whether to launch the server. This is 100% code moves... no logic changed.

### Result

The `PantsDaemon` class is now exclusively a server, and `PantsDaemonClient` is now exclusively a client.

[ci skip-rust-tests]
[ci skip-jvm-tests]
stuhood added a commit that referenced this issue Jun 9, 2020
### Problem

`pantsd` does not implement garbage collection of the `Graph` (see #7675), but additionally, there are likely a few Python-level reference cycles beyond those that we have already discovered.

### Solution

We will eventually implement #7675, and it will need a config value to control how much it collects. But in the meantime, having a configurable built-in cap on total memory usage is generally useful, and can consume the same flag that our eventual collection will.

### Result

`pantsd` will restart itself when it uses more than the configured amount of memory (defaulting to 4GB). As mentioned in the test comment, until #8200 is fixed, the message rendered when we restart will not be particularly friendly, so we should likely not cherry-pick this to 1.29.x, which will not receive #8200.

This is not a complete fix for #9999, but I'm going to resolve it in favor of tracking followup in #7675.

[ci skip-rust-tests]
[ci skip-jvm-tests]
stuhood added a commit that referenced this issue Jun 15, 2020
### Problem

Currently we restart pantsd for most configuration changes, and exclude a small set of bootstrap options (by marking them `daemon=False`) that should not trigger a restart. But in most cases, restarting is heavy-handed. We'd like to be able to keep more and more of our state alive over time as we continue to remove global mutable state (in order to allow us to tackle #7654, among other things).

Additionally, the pantsd client currently implements the fingerprinting that decides when the server should restart, which blocks moving the pantsd client to rust. We'd like the client to only need to interact with a small set of options to simplify its implementation.

### Solution

Move the nailgun server out of the `PantsService` model and directly into the `PantsDaemon` class. Share a `PantsDaemonCore` between the daemon and `DaemonPantsRunner` that encapsulates the current `Scheduler` and all live services. Finally, have the `PantsDaemonCore` implement fingerprinting to decide when to reinitialize/recreate the `Scheduler` (without restarting) and trim down the options that trigger a restart (`daemon=True`) to only those that are used to start the daemon itself (rather than to create the `Scheduler`).

### Result

`pantsd` will stay up through the vast majority of options changes (with the exception of a handful of "micro-bootstrap" options), and will instead reinitialize the `Scheduler` for bootstrap options changes with some useful output when it does so.

Example:
```
$ ./pants help
23:26:22 [INFO] initializing pantsd...
23:26:24 [INFO] pantsd initialized.
Pants 1.30.0.dev0 https://pypi.org/pypi/pantsbuild.pants/1.30.0.dev0

Usage:
<snip>

$ ./pants --no-v1 help
23:26:31 [INFO] initialization options changed: reinitializing pantsd...
23:26:32 [INFO] pantsd initialized.
Pants 1.30.0.dev0 https://pypi.org/pypi/pantsbuild.pants/1.30.0.dev0

Usage:
<snip>
```

This prepares to port the client to rust, and unblocks a fix for #8200, by having the `PantsDaemon` class tear down the nailgun server cleanly in the foreground if any services exit. Fixes #6114, fixes #7573, and fixes #10041.
stuhood added a commit that referenced this issue Jun 16, 2020
### Problem

As described in #8200, the errors when pantsd shuts down during a run are not particularly helpful.

### Solution

When pantsd has chosen to exit (due to either of the `PantsService` instances exiting: generally the `SchedulerService` which watches the pythonpath and other invalidation globs for changes), wait for any ongoing connections to complete before exiting.

### Result

The server will wait for ongoing runs to complete before exiting. Fixes #8200.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants