BinaryPaths should fingerprint found paths. #10769

jsirois · 2020-09-13T16:15:29Z

BinaryPaths will often find binaries in standard search paths like /bin and /usr/bin. These paths are subject to upgrades on user machines where a given path like /usr/bin/python or /usr/bin/bash will stay constant over time but its contents will change in ways that can affect program output when upgraded or downgraded. Hashing the contents of the discovered binary paths will automatically invalidate downstream results in the large majority of cases which is what we want.

See #10768 (comment) for motivating discussion of this.

The text was updated successfully, but these errors were encountered:

A BinaryPathRequest can now include a test to run against each found binary to validate it works and optionally fingerprint it. Eventually fingerprinting the binary contents should be performed automatically and the optional test fingerprint mixed in with it when present as tracked by pantsbuild#10769. We leverage testing found binaries in PexEnvironment to ensure we find only interpreter paths compatible with Pex bootstrapping. This plugs an existing hole not yet encountered in the wild where a Python 2.6 binary (for example) could be chosen and then PEX file bootstrapping fail as a result. We additionally fingerprint interpreters passing the version range test to ensure we detect interpreter upgrades and pyenv shim switches. Even with the automatic hashing of binaries tracked in pantsbuild#10769 working, we'd still need to do this in the pyenv shim case since the same shim script can redirect to different interpreters depending on configuration external to the shim script. # Rust tests and lints will be skipped. Delete if not intended. [ci skip-rust] # Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

A BinaryPathRequest can now include a test to run against each found binary to validate it works and optionally fingerprint it. Eventually fingerprinting the binary contents should be performed automatically and the optional test fingerprint mixed in with it when present as tracked by #10769. We leverage testing found binaries in PexEnvironment to ensure we find only interpreter paths compatible with Pex bootstrapping. This plugs an existing hole not yet encountered in the wild where a Python 2.6 binary (for example) could be chosen and then PEX file bootstrapping fail as a result. We additionally fingerprint interpreters passing the version range test to ensure we detect interpreter upgrades and pyenv shim switches. Even with the automatic hashing of binaries tracked in #10769 working, we'd still need to do this in the pyenv shim case since the same shim script can redirect to different interpreters depending on configuration external to the shim script.

stuhood · 2020-09-14T21:48:16Z

As discussed in slack, this should likely use an absolute file watching facility in the engine, possibly by exposing intrinsics like PathGlobsAndRoot->Snapshot (both existing types, but making PathGlobsAndRoot a Node in the Graph to memoize/watch it would take some refactoring).

I'm not sure whether it should be exposed to @rules though, or whether it should only be consumed by the CommandRunners as an implementation detail (as described in point two of #10526).

jsirois · 2020-09-14T22:14:05Z

Point two of #10526 will need to support something like the tests shown here if we're to do this in CommandRunners:

The name of the binary may not be enough to find a good one:
Python needs the binary discovery process to support a test:

pants/src/python/pants/engine/process.py

Lines 316 to 332 in 895b914

    
           @frozen_after_init 
        
           @dataclass(unsafe_hash=True) 
        
           class BinaryPathRequest: 
        
               """Request to find a binary of a given name. 
        
               If `test_args` are specified all binaries that are found will be executed with these args and 
        
               only those binaries whose test executions exit with return code 0 will be retained. 
        
               Additionally, if test execution includes stdout content, that will be used to fingerprint the 
        
               binary path so that upgrades and downgrades can be detected. A reasonable test for many programs 
        
               might be to pass `["--version"]` for `test_args` since it will both ensure the program runs and 
        
               also produce stdout text that changes upon upgrade or downgrade of the binary at the discovered 
        
               path. 
        
               """ 
        
               search_path: Tuple[str, ...] 
        
               binary_name: str 
        
               test_args: Optional[Tuple[str, ...]]

So it can find a valid bootstrap python in the face of pythonX.Y binaries not being installed to carry the needed info in the binary name:

pants/src/python/pants/backend/python/util_rules/pex_environment.py

Lines 136 to 182 in 0f43baa

    
               # PEX files are compatible with bootstrapping via Python 2.7 or Python 3.5+. The bootstrap 
        
               # code will then re-exec itself if the underlying PEX user code needs a more specific python 
        
               # interpreter. As such, we look for many Pythons usable by the PEX bootstrap code here for 
        
               # maximum flexibility. 
        
               all_python_binary_paths = await MultiGet( 
        
                   [ 
        
                       Get( 
        
                           BinaryPaths, 
        
                           BinaryPathRequest( 
        
                               search_path=python_setup.interpreter_search_paths, 
        
                               binary_name=binary_name, 
        
                               test_args=[ 
        
                                   "-c", 
        
                                   # N.B.: The following code snippet must be compatible with Python 2.7 and 
        
                                   # Python 3.5+. 
        
                                   dedent( 
        
                                       """\ 
        
                                       import sys 
        
                                       major, minor = sys.version_info[:2] 
        
                                       if (major, minor) == (2, 7) or (major == 3 and minor >= 5): 
        
                                           # Here we hash the underlying python interpreter executable to 
        
                                           # ensure we detect changes in the real interpreter that might 
        
                                           # otherwise be masked by pyenv shim scripts found on the search 
        
                                           # path. Naively, just printing out the full version_info would be 
        
                                           # enough, but that does not account for supported abi changes (e.g.: 
        
                                           # a pyenv switch from a py27mu interpreter to a py27m interpreter. 
        
                                           import hashlib 
        
                                           hasher = hashlib.sha256() 
        
                                           with open(sys.executable, "rb") as fp: 
        
                                               # We pick 8192 for efficiency of reads and fingerprint updates 
        
                                               # (writes) since it's a common OS buffer size and an even 
        
                                               # multiple of the hash block size. 
        
                                               for chunk in iter(lambda: fp.read(8192), b""): 
        
                                                   hasher.update(chunk) 
        
                                           sys.stdout.write(hasher.hexdigest()) 
        
                                           sys.exit(0) 
        
                                       else: 
        
                                           sys.exit(1) 
        
                                       """ 
        
                                   ), 
        
                               ], 
        
                           ), 
        
                       ) 
        
                       for binary_name in pex_runtime_env.bootstrap_interpreter_names 
        
                   ] 
        
               )

More complex search logic may need to be triggered than can easily be expressed in a BinaryPathsRequest test.
This example could probably re-use BinaryPathsRequest if it accepted an input digest:

pants/src/python/pants/backend/python/util_rules/pex.py

Lines 393 to 433 in 0e809db

    
           @rule 
        
           async def find_interpreter(interpreter_constraints: PexInterpreterConstraints) -> PythonExecutable: 
        
               process = await Get( 
        
                   Process, 
        
                   PexCliProcess( 
        
                       description=f"Find interpreter for constraints: {interpreter_constraints}", 
        
                       # Here we run the Pex CLI with no requirements which just selects an interpreter and 
        
                       # normally starts an isolated repl. By passing `--` we force the repl to instead act as 
        
                       # an interpreter (the selected one) and tell us about itself. The upshot is we run the 
        
                       # Pex interpreter selection logic unperturbed but without resolving any distributions. 
        
                       argv=( 
        
                           *interpreter_constraints.generate_pex_arg_list(), 
        
                           "--", 
        
                           "-c", 
        
                           # N.B.: The following code snippet must be compatible with Python 2.7 and 
        
                           # Python 3.5+. 
        
                           dedent( 
        
                               """\ 
        
                               import hashlib 
        
                               import os 
        
                               import sys 
        
                               python = os.path.realpath(sys.executable) 
        
                               print(python) 
        
                               hasher = hashlib.sha256() 
        
                               with open(python, "rb") as fp: 
        
                                 # We pick 8192 for efficiency of reads and fingerprint updates 
        
                                 # (writes) since it's a common OS buffer size and an even 
        
                                 # multiple of the hash block size. 
        
                                 for chunk in iter(lambda: fp.read(8192), b""): 
        
                                     hasher.update(chunk) 
        
                               print(hasher.hexdigest()) 
        
                               """ 
        
                           ), 
        
                       ), 
        
                   ), 
        
               ) 
        
               result = await Get(ProcessResult, UncacheableProcess(process)) 
        
               path, fingerprint = result.stdout.decode().strip().splitlines() 
        
               return PythonExecutable(path=path, fingerprint=fingerprint)

stuhood · 2020-09-14T23:31:00Z

The name of the binary may not be enough to find a good one:

@jsirois : That makes sense, but it feels like it couples "discovering all entries on a PATH" with "filtering whether a particular discovered PATH entry is valid". The latter thing can easily be implemented in "user space" by @rules... the former thing is the thing that I think maybe needs explicit support from the CommandRunner.

jsirois · 2020-09-14T23:37:43Z

The latter thing can easily be implemented in "user space" by @rules...

Sure, but if the filtering is separate and this is not an intrinsic but built into the CommandRunner as you suggested then the sequence becomes:

Use CommandRunner just to return binary paths.
Filter those paths in a rule by running another Process(es).
Finally execute a Process with the filtered binary as argv0.

The current userspace implementation does:

Use CommandRunner to return binary paths by executing a script :/.
Filter those paths in a rule by running a Process per each.
Finally execute a Process with the filtered binary as argv0.

I think we need to replace 1 with a native binary for OSX and a native binary for Linux (to replace the script + hashing of the binaries found pre-filter) since we need to support doing all this over remoting. IOW, even though the local CommandRunner or a local intrinsic could search the search path and fingerprint all in-process, the remoting implementation cannot. For that we need the aformentioned platform specific binaries to push into the CAS for remoting to execute. Since we need that, we might as well just head there 1st. Doing it in-process in the local CommandRunner becomes a performance optimization only.

stuhood · 2020-09-23T18:21:39Z

I think we need to replace 1 with a native binary for OSX and a native binary for Linux (to replace the script + hashing of the binaries found pre-filter) since we need to support doing all this over remoting. IOW, even though the local CommandRunner or a local intrinsic could search the search path and fingerprint all in-process, the remoting implementation cannot. For that we need the aformentioned platform specific binaries to push into the CAS for remoting to execute. Since we need that, we might as well just head there 1st. Doing it in-process in the local CommandRunner becomes a performance optimization only.

I'm less sure about this. Yes, the uniformity between local and remote is good, but not being able to use our file watching facilities because the files that are accessed are hidden from the engine in a forked external binary will mean that we need to fork that binary on every run and have it re-fingerprint things repeatedly.

The reason I mention PathGlobsAndRoot (have opened #10842 about this) is that I think that it is important that the facility we use locally uses file watching to avoid re-fingerprinting on every run. As you mentioned, that isn't something that we can avoid for remote execution. So rather than "always forking a binary", I think that this might be a library implemented generically atop the VFS abstraction (which the engine implements in terms of file watching), which the local CommandRunner uses in-process with file-watching, and the remote CommandRunner runs as a binary "once per docker image digest" or something.

jsirois · 2020-09-29T01:01:52Z

I'm less sure about this. Yes, the uniformity between local and remote is good, but not being able to use our file watching facilities because the files that are accessed are hidden from the engine in a forked external binary will mean that we need to fork that binary on every run and have it re-fingerprint things repeatedly.

We appear to violently agree. As I said, a native binary is needed for remoting so tactically it seemed to make sense to do this 1st. All the latter will be a performance optimization for the local use case.

stuhood · 2020-09-29T01:20:36Z

Got it, yes. I think that I misread the "native binary for OSX and a native binary for Linux " as "local binary and remote binary", which is clearly not what you meant on re-read. Thanks!

jsirois added the bug label Sep 13, 2020

jsirois mentioned this issue Sep 13, 2020

Support tests for BinaryPaths. #10770

Merged

stuhood mentioned this issue Sep 14, 2020

--use-first-matching-interpreter shebang doesn't play nicely with macOS Python 3.7 install #10648

Closed

This was referenced Sep 23, 2020

Add an intrinsic for capturing Snapshots from absolute paths #10842

Open

Plumb the custom certs file through to Pip (via Pex) #10837

Merged

jsirois self-assigned this Sep 29, 2020

jsirois added the in-progress label Sep 29, 2020

stuhood mentioned this issue Mar 19, 2021

Remove subprocess-environment allow list. #11743

Merged

stuhood mentioned this issue Jun 14, 2021

Triage and improve remote cache hit rate for desktop usage #12203

Open

benjyw removed the in-progress label Sep 9, 2021

stuhood mentioned this issue Sep 8, 2022

Fix filesystem sandbox leaks in Python bootstrapping #16800

Closed

stuhood mentioned this issue Mar 3, 2023

Add the ability for Pants to provide Python via a union (with a pyenv impl) #18352

Merged

jsirois removed their assignment Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BinaryPaths should fingerprint found paths. #10769

BinaryPaths should fingerprint found paths. #10769

jsirois commented Sep 13, 2020

stuhood commented Sep 14, 2020

jsirois commented Sep 14, 2020

stuhood commented Sep 14, 2020 •

edited

Loading

jsirois commented Sep 14, 2020 •

edited

Loading

stuhood commented Sep 23, 2020 •

edited

Loading

jsirois commented Sep 29, 2020

stuhood commented Sep 29, 2020

BinaryPaths should fingerprint found paths. #10769

BinaryPaths should fingerprint found paths. #10769

Comments

jsirois commented Sep 13, 2020

stuhood commented Sep 14, 2020

jsirois commented Sep 14, 2020

stuhood commented Sep 14, 2020 • edited Loading

jsirois commented Sep 14, 2020 • edited Loading

stuhood commented Sep 23, 2020 • edited Loading

jsirois commented Sep 29, 2020

stuhood commented Sep 29, 2020

stuhood commented Sep 14, 2020 •

edited

Loading

jsirois commented Sep 14, 2020 •

edited

Loading

stuhood commented Sep 23, 2020 •

edited

Loading