Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BinaryPaths should fingerprint found paths. #10769

Open
jsirois opened this issue Sep 13, 2020 · 7 comments
Open

BinaryPaths should fingerprint found paths. #10769

jsirois opened this issue Sep 13, 2020 · 7 comments
Labels

Comments

@jsirois
Copy link
Contributor

jsirois commented Sep 13, 2020

BinaryPaths will often find binaries in standard search paths like /bin and /usr/bin. These paths are subject to upgrades on user machines where a given path like /usr/bin/python or /usr/bin/bash will stay constant over time but its contents will change in ways that can affect program output when upgraded or downgraded. Hashing the contents of the discovered binary paths will automatically invalidate downstream results in the large majority of cases which is what we want.

See #10768 (comment) for motivating discussion of this.

@jsirois jsirois added the bug label Sep 13, 2020
jsirois added a commit to jsirois/pants that referenced this issue Sep 13, 2020
A BinaryPathRequest can now include a test to run against each found
binary to validate it works and optionally fingerprint it. Eventually
fingerprinting the binary contents should be performed automatically
and the optional test fingerprint mixed in with it when present as
tracked by pantsbuild#10769.

We leverage testing found binaries in PexEnvironment to ensure we find
only interpreter paths compatible with Pex bootstrapping. This plugs an
existing hole not yet encountered in the wild where a Python 2.6 binary
(for example) could be chosen and then PEX file bootstrapping fail as a
result. We additionally fingerprint interpreters passing the version
range test to ensure we detect interpreter upgrades and pyenv shim
switches. Even with the automatic hashing of binaries tracked in pantsbuild#10769
working, we'd still need to do this in the pyenv shim case since the
same shim script can redirect to different interpreters depending on
configuration external to the shim script.

# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
jsirois added a commit that referenced this issue Sep 13, 2020
A BinaryPathRequest can now include a test to run against each found
binary to validate it works and optionally fingerprint it. Eventually
fingerprinting the binary contents should be performed automatically
and the optional test fingerprint mixed in with it when present as
tracked by #10769.

We leverage testing found binaries in PexEnvironment to ensure we find
only interpreter paths compatible with Pex bootstrapping. This plugs an
existing hole not yet encountered in the wild where a Python 2.6 binary
(for example) could be chosen and then PEX file bootstrapping fail as a
result. We additionally fingerprint interpreters passing the version
range test to ensure we detect interpreter upgrades and pyenv shim
switches. Even with the automatic hashing of binaries tracked in #10769
working, we'd still need to do this in the pyenv shim case since the
same shim script can redirect to different interpreters depending on
configuration external to the shim script.
@stuhood
Copy link
Member

stuhood commented Sep 14, 2020

As discussed in slack, this should likely use an absolute file watching facility in the engine, possibly by exposing intrinsics like PathGlobsAndRoot->Snapshot (both existing types, but making PathGlobsAndRoot a Node in the Graph to memoize/watch it would take some refactoring).

I'm not sure whether it should be exposed to @rules though, or whether it should only be consumed by the CommandRunners as an implementation detail (as described in point two of #10526).

@jsirois
Copy link
Contributor Author

jsirois commented Sep 14, 2020

Point two of #10526 will need to support something like the tests shown here if we're to do this in CommandRunners:

  1. The name of the binary may not be enough to find a good one:
    Python needs the binary discovery process to support a test:
    @frozen_after_init
    @dataclass(unsafe_hash=True)
    class BinaryPathRequest:
    """Request to find a binary of a given name.
    If `test_args` are specified all binaries that are found will be executed with these args and
    only those binaries whose test executions exit with return code 0 will be retained.
    Additionally, if test execution includes stdout content, that will be used to fingerprint the
    binary path so that upgrades and downgrades can be detected. A reasonable test for many programs
    might be to pass `["--version"]` for `test_args` since it will both ensure the program runs and
    also produce stdout text that changes upon upgrade or downgrade of the binary at the discovered
    path.
    """
    search_path: Tuple[str, ...]
    binary_name: str
    test_args: Optional[Tuple[str, ...]]

    So it can find a valid bootstrap python in the face of pythonX.Y binaries not being installed to carry the needed info in the binary name:
    # PEX files are compatible with bootstrapping via Python 2.7 or Python 3.5+. The bootstrap
    # code will then re-exec itself if the underlying PEX user code needs a more specific python
    # interpreter. As such, we look for many Pythons usable by the PEX bootstrap code here for
    # maximum flexibility.
    all_python_binary_paths = await MultiGet(
    [
    Get(
    BinaryPaths,
    BinaryPathRequest(
    search_path=python_setup.interpreter_search_paths,
    binary_name=binary_name,
    test_args=[
    "-c",
    # N.B.: The following code snippet must be compatible with Python 2.7 and
    # Python 3.5+.
    dedent(
    """\
    import sys
    major, minor = sys.version_info[:2]
    if (major, minor) == (2, 7) or (major == 3 and minor >= 5):
    # Here we hash the underlying python interpreter executable to
    # ensure we detect changes in the real interpreter that might
    # otherwise be masked by pyenv shim scripts found on the search
    # path. Naively, just printing out the full version_info would be
    # enough, but that does not account for supported abi changes (e.g.:
    # a pyenv switch from a py27mu interpreter to a py27m interpreter.
    import hashlib
    hasher = hashlib.sha256()
    with open(sys.executable, "rb") as fp:
    # We pick 8192 for efficiency of reads and fingerprint updates
    # (writes) since it's a common OS buffer size and an even
    # multiple of the hash block size.
    for chunk in iter(lambda: fp.read(8192), b""):
    hasher.update(chunk)
    sys.stdout.write(hasher.hexdigest())
    sys.exit(0)
    else:
    sys.exit(1)
    """
    ),
    ],
    ),
    )
    for binary_name in pex_runtime_env.bootstrap_interpreter_names
    ]
    )
  2. More complex search logic may need to be triggered than can easily be expressed in a BinaryPathsRequest test.
    This example could probably re-use BinaryPathsRequest if it accepted an input digest:
    @rule
    async def find_interpreter(interpreter_constraints: PexInterpreterConstraints) -> PythonExecutable:
    process = await Get(
    Process,
    PexCliProcess(
    description=f"Find interpreter for constraints: {interpreter_constraints}",
    # Here we run the Pex CLI with no requirements which just selects an interpreter and
    # normally starts an isolated repl. By passing `--` we force the repl to instead act as
    # an interpreter (the selected one) and tell us about itself. The upshot is we run the
    # Pex interpreter selection logic unperturbed but without resolving any distributions.
    argv=(
    *interpreter_constraints.generate_pex_arg_list(),
    "--",
    "-c",
    # N.B.: The following code snippet must be compatible with Python 2.7 and
    # Python 3.5+.
    dedent(
    """\
    import hashlib
    import os
    import sys
    python = os.path.realpath(sys.executable)
    print(python)
    hasher = hashlib.sha256()
    with open(python, "rb") as fp:
    # We pick 8192 for efficiency of reads and fingerprint updates
    # (writes) since it's a common OS buffer size and an even
    # multiple of the hash block size.
    for chunk in iter(lambda: fp.read(8192), b""):
    hasher.update(chunk)
    print(hasher.hexdigest())
    """
    ),
    ),
    ),
    )
    result = await Get(ProcessResult, UncacheableProcess(process))
    path, fingerprint = result.stdout.decode().strip().splitlines()
    return PythonExecutable(path=path, fingerprint=fingerprint)

@stuhood
Copy link
Member

stuhood commented Sep 14, 2020

The name of the binary may not be enough to find a good one:

@jsirois : That makes sense, but it feels like it couples "discovering all entries on a PATH" with "filtering whether a particular discovered PATH entry is valid". The latter thing can easily be implemented in "user space" by @rules... the former thing is the thing that I think maybe needs explicit support from the CommandRunner.

@jsirois
Copy link
Contributor Author

jsirois commented Sep 14, 2020

The latter thing can easily be implemented in "user space" by @rules...

Sure, but if the filtering is separate and this is not an intrinsic but built into the CommandRunner as you suggested then the sequence becomes:

  1. Use CommandRunner just to return binary paths.
  2. Filter those paths in a rule by running another Process(es).
  3. Finally execute a Process with the filtered binary as argv0.

The current userspace implementation does:

  1. Use CommandRunner to return binary paths by executing a script :/.
  2. Filter those paths in a rule by running a Process per each.
  3. Finally execute a Process with the filtered binary as argv0.

I think we need to replace 1 with a native binary for OSX and a native binary for Linux (to replace the script + hashing of the binaries found pre-filter) since we need to support doing all this over remoting. IOW, even though the local CommandRunner or a local intrinsic could search the search path and fingerprint all in-process, the remoting implementation cannot. For that we need the aformentioned platform specific binaries to push into the CAS for remoting to execute. Since we need that, we might as well just head there 1st. Doing it in-process in the local CommandRunner becomes a performance optimization only.

@stuhood
Copy link
Member

stuhood commented Sep 23, 2020

I think we need to replace 1 with a native binary for OSX and a native binary for Linux (to replace the script + hashing of the binaries found pre-filter) since we need to support doing all this over remoting. IOW, even though the local CommandRunner or a local intrinsic could search the search path and fingerprint all in-process, the remoting implementation cannot. For that we need the aformentioned platform specific binaries to push into the CAS for remoting to execute. Since we need that, we might as well just head there 1st. Doing it in-process in the local CommandRunner becomes a performance optimization only.

I'm less sure about this. Yes, the uniformity between local and remote is good, but not being able to use our file watching facilities because the files that are accessed are hidden from the engine in a forked external binary will mean that we need to fork that binary on every run and have it re-fingerprint things repeatedly.

The reason I mention PathGlobsAndRoot (have opened #10842 about this) is that I think that it is important that the facility we use locally uses file watching to avoid re-fingerprinting on every run. As you mentioned, that isn't something that we can avoid for remote execution. So rather than "always forking a binary", I think that this might be a library implemented generically atop the VFS abstraction (which the engine implements in terms of file watching), which the local CommandRunner uses in-process with file-watching, and the remote CommandRunner runs as a binary "once per docker image digest" or something.

@jsirois
Copy link
Contributor Author

jsirois commented Sep 29, 2020

I'm less sure about this. Yes, the uniformity between local and remote is good, but not being able to use our file watching facilities because the files that are accessed are hidden from the engine in a forked external binary will mean that we need to fork that binary on every run and have it re-fingerprint things repeatedly.

We appear to violently agree. As I said, a native binary is needed for remoting so tactically it seemed to make sense to do this 1st. All the latter will be a performance optimization for the local use case.

@jsirois jsirois self-assigned this Sep 29, 2020
@stuhood
Copy link
Member

stuhood commented Sep 29, 2020

Got it, yes. I think that I misread the "native binary for OSX and a native binary for Linux " as "local binary and remote binary", which is clearly not what you meant on re-read. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants