Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pantsd] more graceful failure mode for watchman crash #4409

Closed
kwlzn opened this issue Apr 1, 2017 · 2 comments
Closed

[pantsd] more graceful failure mode for watchman crash #4409

kwlzn opened this issue Apr 1, 2017 · 2 comments
Assignees
Milestone

Comments

@kwlzn
Copy link
Member

kwlzn commented Apr 1, 2017

currently, when a synchronous pailgun request is mid-flight in the daemon it's possible for the daemon to tear itself down (e.g. if watchman crashes or gets killed), which terminates the daemon-side socket, which leads to the following failure mode:

17:32:38 Exception caught: (<class 'pants.java.nailgun_client.NailgunError'>)
17:32:38   File "/var/lib/jenkins/.cache/pants/pex/bin/pants.pex/1.3.0.dev14-engine1/pants.pex/.bootstrap/_pex/pex.py", line 360, in execute
17:32:38     self._wrap_coverage(self._wrap_profiling, self._execute)
17:32:38   File "/var/lib/jenkins/.cache/pants/pex/bin/pants.pex/1.3.0.dev14-engine1/pants.pex/.bootstrap/_pex/pex.py", line 288, in _wrap_coverage
17:32:38     runner(*args)
17:32:38   File "/var/lib/jenkins/.cache/pants/pex/bin/pants.pex/1.3.0.dev14-engine1/pants.pex/.bootstrap/_pex/pex.py", line 320, in _wrap_profiling
17:32:38     runner(*args)
17:32:38   File "/var/lib/jenkins/.cache/pants/pex/bin/pants.pex/1.3.0.dev14-engine1/pants.pex/.bootstrap/_pex/pex.py", line 403, in _execute
17:32:38     return self.execute_entry(self._pex_info.entry_point)
17:32:38   File "/var/lib/jenkins/.cache/pants/pex/bin/pants.pex/1.3.0.dev14-engine1/pants.pex/.bootstrap/_pex/pex.py", line 461, in execute_entry
17:32:38     return runner(entry_point)
17:32:38   File "/var/lib/jenkins/.cache/pants/pex/bin/pants.pex/1.3.0.dev14-engine1/pants.pex/.bootstrap/_pex/pex.py", line 479, in execute_pkg_resources
17:32:38     return runner()
17:32:38   File "/data/jenkins/workspace/source_4/.pex/install/pantsbuild.pants-1.3.0.dev14+394128272-py2-none-any.whl.7c309bcf5d3426cdf22bef063b44a0aa0ea0f88d/pantsbuild.pants-1.3.0.dev14+394128272-py2-none-any.whl/pants/bin/pants_exe.py", line 44, in main
17:32:38     PantsRunner(exiter).run()
17:32:38   File "/data/jenkins/workspace/source_4/.pex/install/pantsbuild.pants-1.3.0.dev14+394128272-py2-none-any.whl.7c309bcf5d3426cdf22bef063b44a0aa0ea0f88d/pantsbuild.pants-1.3.0.dev14+394128272-py2-none-any.whl/pants/bin/pants_runner.py", line 57, in run
17:32:38     options_bootstrapper=options_bootstrapper)
17:32:38   File "/data/jenkins/workspace/source_4/.pex/install/pantsbuild.pants-1.3.0.dev14+394128272-py2-none-any.whl.7c309bcf5d3426cdf22bef063b44a0aa0ea0f88d/pantsbuild.pants-1.3.0.dev14+394128272-py2-none-any.whl/pants/bin/pants_runner.py", line 35, in _run
17:32:38     return RemotePantsRunner(exiter, args, env, process_metadata_dir).run()
17:32:38   File "/data/jenkins/workspace/source_4/.pex/install/pantsbuild.pants-1.3.0.dev14+394128272-py2-none-any.whl.7c309bcf5d3426cdf22bef063b44a0aa0ea0f88d/pantsbuild.pants-1.3.0.dev14+394128272-py2-none-any.whl/pants/bin/remote_pants_runner.py", line 83, in run
17:32:38     result = client.execute(self.PANTS_COMMAND, *self._args, **modified_env)
17:32:38   File "/data/jenkins/workspace/source_4/.pex/install/pantsbuild.pants-1.3.0.dev14+394128272-py2-none-any.whl.7c309bcf5d3426cdf22bef063b44a0aa0ea0f88d/pantsbuild.pants-1.3.0.dev14+394128272-py2-none-any.whl/pants/java/nailgun_client.py", line 169, in execute
17:32:38     .format(self._host, self._port, e))
17:32:38 
17:32:38 Exception message: Problem communicating with nailgun server at 127.0.0.1:37353: error(104, 'Connection reset by peer')

in order to be more graceful, we should add a locking mechanism here that would prevent the daemon from completely tearing down until any active pailgun runners have forked.

@kwlzn kwlzn added this to the 1.4.0 milestone Apr 8, 2017
@dotordogh
Copy link
Contributor

dotordogh commented Aug 14, 2017

I haven't been able to reproduce this error...

@kwlzn kwlzn modified the milestones: 1.4.x, Daemon Beta Oct 3, 2017
@kwlzn
Copy link
Member Author

kwlzn commented Oct 25, 2017

this should be vastly improved if not fully fixed with #4847 and #4931, so will close this for now. feel free to reopen if this is encountered again.

@kwlzn kwlzn closed this as completed Oct 25, 2017
@kwlzn kwlzn self-assigned this Dec 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants