Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] split NailgunProtocol into PailgunProtocol versions of everything, add PGRP chunk #6579

Conversation

cosmicexplorer
Copy link
Contributor

@cosmicexplorer cosmicexplorer commented Oct 1, 2018

WIP because I wasn't able to get the java/ tests all the way passing last night, there's probably hella refactoring I could do

Problem

Our implementation of the Nailgun protocol has an extension in the form of a PID chunk, currently, which we will only worry about if we receive a chunk of that type (types are defined in ChunkType). The only difference in the way we use pailgun clients is that the nailgun client (e.g. in RemotePantsRunner) will secretly wait for a PID chunk (even if it's for a nailgun invocation that doesn't support that). This actually works pretty well, and I only ran into issues when I wanted to try adding the PGRP chunk or doing anything based on it, because we currently wait for the PID chunk in the execution loop in NailgunClientSession#process_session(), and that doesn't really make it clear when or whether we can expect that chunk or whether there should be more than one.

I realized that the step of receiving the PID/PGRP chunk can be separated into a distinct initialization phase, after the execution request is sent from the nailgun client to pantsd, after our pantsd-runner process self-daemonizes. Not doing that now means we don't reliably know the remote pid or pgrp, when we can instead require that these are sent before any execution occurs (this was already deterministically true). So I ran with that and here's the justification for the current changes:

  • We're currently using the PID chunk to pass the negated pgrp, not the pid anyway. We need the pid in that PID chunk in order to retrieve fatal error logs from a failed pantsd-runner invocation, the work for which was already done in pantsd client logs exceptions from server processes #6539 (that test was previously xfailed for this reason -- we were only getting the pgrp previously in that chunk, so we didn't know where to look for fatal error logs for the remote pantsd-runner). The pgrp is useful to send a signal to the whole family of pantsd-runner subprocesses at once.
  • We're currently letting the pantsd-runner process define when the remote session completes (when we catch control-c on the client, the signal is just caught and transmitted to the pantsd-runner process, then dropped), and that leads directly to interactivity issues later on.

There is (or rather, there's not now at all, but should be) some implicit sequencing required in negotiating a pailgun execution given a live (what is now called) PailgunClient:

  1. the thin client sends the execution request with args and cwd and whatever to pantsd over a nailgun connection
  2. pantsd double forks to create a self-daemonizing pantsd-runner process which sends the PID and PGRP chunks, then hooks up stdio to the nailgun socket shared with pantsd (pantsd closes its end of that socket)
  3. the thin client in RemotePantsRunner waits for the PID chunk and PGRP chunk to be sent from the pantsd-runner process over the nailgun connection which was transferred
  4. the client writes STDOUT and STDERR chunks as they come in, and exits on the first EXIT chunk (stdin is turned on and off over the course of the nailgun execution with other chunk types) (this part is unchanged).

One contribution of this PR is to make that sequencing (which is specific to pailgun) explicit by composing contextmanagers -- this makes ownership of the underlying resources and the IPC needed to obtain them much more clear, imo. The PID/PGRP dissonance at the top is a prime example of how trying to munge them together is hard (and it ends up being not too much work to un-munge them, even if the diff ends up being big).

If you're concerned that this sounds like it's introducing too much state into the session, I was actually able to remove most (all?) mutations by using contextmanagers, which also makes a pailgun execution simply a matter of a with statement inside a nailgun execution (which if you were to write it in pseudocode could maybe look like that as well). Please let me know if there are any minor to sweeping changes or misconceptions to fix on nailgun, python, or IPC in general.

TODO

  • Can/should we merge the PID and PGRP chunks into a tuple (or a datatype to be specific)? This sounds like the right thing to do.
    • We should probably state clearly what these are being used for -- I don't know if there are further uses of the negated pgrp that I'm missing above.
  • there are some more TODOs left in the diff (these will all become separate issues or PRs)

(explain the context of the problem and why you're making this change. include
references to all relevant github issues.
)

Solution

  • add PGRP chunk but in a new PailgunChunkType class
  • create PailgunProtocol, PailgunClient, etc (was trying to avoid this lengthy mirroring to whatever extent possible but not sure if that is a good idea or feasible at this point)
  • use the new process information to withdraw the correct logs / kill the correct process as per the xfailed test from pantsd client logs exceptions from server processes #6539

(describe the modifications you've made.)

Result

(describe how your changes affect the end-user behavior of the system. this section is
optional, and should generally be summarized in the title of the pull request.
)

TODO

@cosmicexplorer cosmicexplorer force-pushed the differentiate-nailgun-pailgun-pgrp branch from 81066cd to 81260fe Compare October 12, 2018 02:30
cosmicexplorer added a commit that referenced this pull request Jan 18, 2019
…classpaths (#7092)

*Resolves #7089.*

### Problem

`RscCompile` is the one task in pants which invokes multiple JVM tools over the course of its run, as a consequence of [using outlining to generate semantic information with rsc which zinc can compile against](https://github.com/twitter/rsc/blob/master/docs/compiler.md#typechecking-in-rsc-mid-2018). This doesn't play nice with our `NailgunExecutor`, and causes error messages like the following when `--worker-count` > 1 (see #7089):

```
metacp(jdk) failed: Problem launching via NailgunClient(host=u'127.0.0.1', port=55511, workdir=u'/Users/dmcclanahan/tools/pants') command scala.meta.cli.Metacp --verbose --out .pants.d/tmp/tmpX8iJid.pants.d/compile/rsc/a04416cba788/--jdk--/index /Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/lib/dt.jar:/Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/TwitterJDK/Contents/Home/lib/tools.jar: (u'Problem talking to nailgun server (address: 127.0.0.1:55511, remote_pid=<remote PID chunk not yet received!>, remote_pgrp=<remote PGRP chunk not yet received!>): TruncatedHeaderError(u"Failed to read nailgun chunk header (TruncatedRead(u\'Expected 5 bytes before socket shutdown, instead received 0\',)).",)', TruncatedHeaderError(u"Failed to read nailgun chunk header (TruncatedRead(u'Expected 5 bytes before socket shutdown, instead received 0',)).",))
                     E   	                   Traceback:
                     E   	                     File "/Users/dmcclanahan/tools/pants/src/python/pants/backend/jvm/tasks/jvm_compile/execution_graph.py", line 276, in worker
                     E   	                       work()
                     E   	                   
                     E   	                     File "/Users/dmcclanahan/tools/pants/src/python/pants/backend/jvm/tasks/jvm_compile/execution_graph.py", line 44, in __call__
                     E   	                       self.fn()
                     E   	                   
                     E   	                     File "/Users/dmcclanahan/tools/pants/src/python/pants/backend/jvm/tasks/jvm_compile/rsc/rsc_compile.py", line 339, in work_for_vts_rsc_jdk
                     E   	                       output_dir=rsc_index_dir)
                     E   	                   
                     E   	                     File "/Users/dmcclanahan/tools/pants/src/python/pants/backend/jvm/tasks/jvm_compile/rsc/rsc_compile.py", line 819, in _runtool
                     E   	                       dist=distribution
                     E   	                   
                     E   	                     File "/Users/dmcclanahan/tools/pants/src/python/pants/backend/jvm/tasks/nailgun_task.py", line 111, in runjava
                     E   	                       raise TaskError(e)
                     E   	                   [info] Compiling 1 Java source to /Users/dmcclanahan/tools/pants/.pants.d/tmp/tmpX8iJid.pants.d/compile/rsc/a04416cba788/examples.src.java.org.pantsbuild.example.hello.greet.greet/current/zinc/classes ...
                     E   	                       [info] Done compiling.
                     E   	                       [info] Compile success at Jan 16, 2019 6:36:30 PM [2.258s]
                     E   	                       
                     E   	FAILURE: Compilation failure: Failed jobs: metacp(jdk)
```

### Solution

- Introduce `JvmToolMixin.register_combined_jvm_tools()` as an API for accessing specific JVM tools as a combined classpath (a suggestion from @xeno-by) but different main classes, allowing the use of a single nailgun instance (resolving #7089).
- Introduce `NailgunTaskBase#do_for_execution_strategy_variant()` which allows specifying different actions to perform for different values of the `--execution-strategy` option in a structured way.
- Use the above to add a code path for a nailgun execution for `RscCompile`, and add testing for `--worker-count` > 1.
- Introduce `ZincCompile#get_zinc_compiler_classpath()` to allow `RscCompile` to override it with the combined classpath to persist the nailgun.
- Remove the mysterious `or self.cmd != self._distribution.java` in `nailgun_executor.py` -- this is necessary for this PR to work, but an option can be plumbed in if it breaks pantsd (this might motivate #6579).

### Result

`RscCompile` can be invoked with > 1 nailgun instance at a time by bundling all the tool jars into a single classpath, without affecting the performance for any of the other execution strategies.
@cosmicexplorer cosmicexplorer force-pushed the differentiate-nailgun-pailgun-pgrp branch 2 times, most recently from d6f6b17 to 2ddcffd Compare January 21, 2019 01:36
…erything

we were using something that kinda supported either without making it clear
@cosmicexplorer cosmicexplorer force-pushed the differentiate-nailgun-pailgun-pgrp branch from 2ddcffd to fa98448 Compare February 28, 2019 03:26
@stuhood
Copy link
Member

stuhood commented May 10, 2020

These chunks were removed in #9389.

@stuhood stuhood closed this May 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants