Nondeterministic `ImportError: No module named __pants_df_parser` while computing changed targets #16778

danxmoran · 2022-09-06T12:20:35Z

(NOTE: There's been a lot of discussion on this in Slack here, but I realized I never filed an issue for it)

Describe the bug

A few times per day (out of ~hundreds of jobs), our CI jobs will fail when trying to build dockerfile_parser.pex. The error always hits at the beginning of goal execution, and affects all our goals (test, lint, check). The error looks like:

13:28:44.08 [INFO] Starting: Resolving plugins: toolchain.pants.plugin==0.20.0
13:28:56.61 [INFO] Completed: Resolving plugins: toolchain.pants.plugin==0.20.0
13:28:57.16 [INFO] [auth-plugin] Successfully loaded Toolchain token from env var 'TOOLCHAIN_AUTH_TOKEN', expiration: 2022-12-20T20:52:09+00:00.
13:29:17.25 [INFO] Starting: Building dockerfile_parser.pex from dockerfile-parser_default.lock
13:29:18.32 [INFO] Canceled: Building dockerfile_parser.pex from dockerfile-parser_default.lock
Exception caught: (pants.engine.internals.scheduler.ExecutionError)
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/bin/pants", line 8, in <module>
    sys.exit(main())
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 115, in main
    PantsLoader.main()
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 111, in main
    cls.run_default_entrypoint()
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 93, in run_default_entrypoint
    exit_code = runner.run(start_time)
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_runner.py", line 99, in run
    runner = LocalPantsRunner.create(
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/local_pants_runner.py", line 159, in create
    specs = calculate_specs(
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/init/specs_calculator.py", line 94, in calculate_specs
    (changed_addresses,) = session.product_request(
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 577, in product_request
    return self.execute(request)
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 521, in execute
    self._raise_on_error([t for _, t in throws])
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 498, in _raise_on_error
    raise ExecutionError(
Exception message: 1 Exception encountered:
Engine traceback:
  in select
  in pants.vcs.changed.find_changed_owners
  in pants.backend.project_info.dependees.find_dependees
  in pants.backend.project_info.dependees.map_addresses_to_dependees
  in pants.engine.internals.graph.resolve_dependencies (src/projects/rhp/operations:rhp-api)
  in pants.backend.docker.util_rules.dependencies.inject_docker_dependencies (src/projects/rhp/operations:rhp-api)
  in pants.backend.docker.subsystems.dockerfile_parser.parse_dockerfile
  in pants.engine.process.fallible_to_exec_result_or_raise
Traceback (most recent call last):
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/process.py", line 272, in fallible_to_exec_result_or_raise
    raise ProcessExecutionFailure(
pants.engine.process.ProcessExecutionFailure: Process 'Parse Dockerfile.' failed with exit code 1.
stdout:
stderr:
Traceback (most recent call last):
  File "/tmp/pants-sandbox-kVePcL/.cache/pex_root/venvs/e2b00a6620ed72ee63184ca7a64555f5551da8c5/f679563b742633f5edaba61ca7d405616ffc49b3/pex", line 234, in <module>
    runpy.run_module(module_name, run_name="__main__", alter_sys=True)
  File "/opt/python/3.8.12/lib/python3.8/runpy.py", line 203, in run_module
    mod_name, mod_spec, code = _get_module_details(mod_name)
  File "/opt/python/3.8.12/lib/python3.8/runpy.py", line 138, in _get_module_details
    raise error("No module named %s" % mod_name)
ImportError: No module named __pants_df_parser

Pants version

The error started appearing on PANTS_SHA=5d8a328d72209863986c8959b20305505bc068ba, and has continued appearing as we've upgraded along the 2.13.x branch.
As far as I can tell, the error was not appearing on v2.13.0a0

OS

Linux

Additional info

Through debugging on Slack, we've found:

__pants_df_parser.py truly isn't in the venv when this error occurs
The error doesn't occur if we use --no-process-cleanup
The sandbox PEX created for the dockerfile parser contains the file, even when the pex_root venv does not. Running the PEX creates a venv that contains the file.

The text was updated successfully, but these errors were encountered:

danxmoran · 2022-09-06T21:33:42Z

I just saw a new error for the first time, still in the Parse Dockerfile stage 😮

/tmp/pants_named_caches/pex_root/venvs/f92f4a34a585f3429d24d32fb8349b7f900ea322/cf26e88f2412df9ce2b3f6b83a6b129bc3e3bb55/bin/python: can't open file '/tmp/pants_named_caches/pex_root/venvs/f92f4a34a585f3429d24d32fb8349b7f900ea322/cf26e88f2412df9ce2b3f6b83a6b129bc3e3bb55/pex': [Errno 2] No such file or directory

jsirois · 2022-09-20T17:06:37Z

Ok, I started re-reviewing this from the top instead of focusing in on Pex. That leads here:

pants/src/python/pants/backend/python/util_rules/pex.py

Lines 750 to 757 in 0bbf1fb

    
                       # If the seeded venv has been removed from the PEX_ROOT, we re-seed from the original 
        
                       # `--venv` mode PEX file. 
        
                       if [ ! -e "${{target_venv_executable}}" ]; then 
        
                           rm -rf "${{venv_dir}}" || true 
        
                           PEX_INTERPRETER=1 ${{execute_pex_args}} -c '' 
        
                       fi 
        
                       exec "${{target_venv_executable}}" "$@"

Clearly? the rm -rf can be interrupted and since rm -r is implemented as a depth-1st delete, that can leave the root venv/pex file present while portions of its subtree have been removed. So, given that, imagine this:
*1. A process runs on a machine with no venv in the PEX_ROOT yet, and enters the outer guard.
2. Another process enters the guard and builds the venv and starts exec'uting it.
3. The original process starts processing the rm -rf on the venv the process from step 2 is using.

*N.B. 1 can only happen when the PEX in question is not built locally, since we pre --seed the PEX venv when building it. In CI, that means either the LMDB store is cached but the named_caches (PEX_ROOT) are not or that remote caching is enabled and that's where the pre-built PEX comes from.

@danxmoran can you check me on this logic / if it fully describes / comports with your setup where you are seeing failures?

jsirois · 2022-09-20T17:08:49Z

I think if the above is the explanation, then just switching to Pex's --sh-boot mode and eliminating the wrapper script will do the trick and allow Pants to keep all the hard earned milliseconds saved by this old wrapper script pre-dating --sh-boot.

danxmoran · 2022-09-20T17:40:40Z

@jsirois the explanation makes sense for our CI context - we use ephemeral runners so the Pants cache and PEX root are empty at the start of each run, and we use remote caching.

jsirois · 2022-09-20T17:47:34Z

Alrighty, thanks for the confirmation. I think the issue is now fully understood. I want to stew a bit more on the --sh-boot fix vs just removing the rm -rf as a smaller step. I need to review why / when I added the rm -rf in the 1st place.

jsirois · 2022-09-20T17:54:01Z

Ok, the rm -rf was in there from the get-go a year and 1/2 ago: https://github.com/pantsbuild/pants/pull/11557/files#r575485648

I seem to recall during development of that PR that I initially didn't have it and ran into issues, but I'll give that path a thorough analysis before committing to the longer term --sh-boot fix.

There was a race in venv re-population due to a non-atomic `rm`, create sequence. There was no real need for the `rm` and the create is atomic on its own; so just remove the `rm` which was only attempting to guard "corrupted" venvs in a slapdash way. Now the venv either exists or it doesn't from the script point of view. If the venv exists but has been tampered with, its execution will consistently fail until there is a manual intervention removing the venv dir offline. Fixes pantsbuild#16778

jsirois · 2022-09-20T20:02:23Z

Ok, the --sh-boot will still require scripts or some new --sh-boot PEX_SCRIPT affordance or some sort of conscript-like system since code like this needs to point directly at various console scripts in a venv:

pants/src/python/pants/backend/codegen/protobuf/python/rules.py

Lines 146 to 167 in 7f98ead

    
           argv = [os.path.join(protoc_relpath, downloaded_protoc_binary.exe), "--python_out", output_dir] 
        
           if mypy_pex: 
        
               argv.extend( 
        
                   [ 
        
                       f"--plugin=protoc-gen-mypy={mypy_pex.bin[protoc_gen_mypy_script].argv0}", 
        
                       "--mypy_out", 
        
                       output_dir, 
        
                   ] 
        
               ) 
        
           if downloaded_grpc_plugin: 
        
               argv.extend( 
        
                   [f"--plugin=protoc-gen-grpc={downloaded_grpc_plugin.exe}", "--grpc_out", output_dir] 
        
               ) 
        
               if mypy_pex and protoc_gen_mypy_grpc_script in mypy_pex.bin: 
        
                   argv.extend( 
        
                       [ 
        
                           f"--plugin=protoc-gen-mypy_grpc={mypy_pex.bin[protoc_gen_mypy_grpc_script].argv0}", 
        
                           "--mypy_grpc_out", 
        
                           output_dir, 
        
                       ] 
        
                   )

It looks like the rm -rf ... can simply be dropped. I'll proceed down that path.

There was a race in venv re-population due to a non-atomic `rm`, create sequence. There was no real need for the `rm` and the create is atomic on its own; so just remove the `rm` which was only attempting to guard "corrupted" venvs in a slapdash way. Now the venv either exists or it doesn't from the script point of view. If the venv exists but has been tampered with, its execution will consistently fail until there is a manual intervention removing the venv dir offline. Fixes #14618 Fixes #16778

There was a race in venv re-population due to a non-atomic `rm`, create sequence. There was no real need for the `rm` and the create is atomic on its own; so just remove the `rm` which was only attempting to guard "corrupted" venvs in a slapdash way. Now the venv either exists or it doesn't from the script point of view. If the venv exists but has been tampered with, its execution will consistently fail until there is a manual intervention removing the venv dir offline. Fixes pantsbuild#14618 Fixes pantsbuild#16778 (cherry picked from commit cace851)

There was a race in venv re-population due to a non-atomic `rm`, create sequence. There was no real need for the `rm` and the create is atomic on its own; so just remove the `rm` which was only attempting to guard "corrupted" venvs in a slapdash way. Now the venv either exists or it doesn't from the script point of view. If the venv exists but has been tampered with, its execution will consistently fail until there is a manual intervention removing the venv dir offline. Fixes #14618 Fixes #16778 (cherry picked from commit cace851)

danxmoran · 2022-09-21T16:20:26Z

@jsirois this is still a problem even with your patch - I just hit it running on PANTS_SHA f06bf22

jsirois · 2022-09-21T16:38:57Z

@danxmoran I don't believe you / hrm. Can you get a sandbox on that version, even if just locally, and confirm the wrapper script has the rm -rf ... removed?

danxmoran · 2022-09-21T17:07:50Z

Confirmed that when I look at at sandbox locally the rm -rf ... is gone 🤔

jsirois · 2022-09-21T17:10:43Z

Alrighty - thanks. I am totally out of ideas then. That rm -rf ... truly was a race that definitely should cause errors like you saw. With that gone, the focus moves back to Pex's atomicity of venv creation which is what I had previously scrutinized and found the only hole to be a posix leaky file lock leading to a uuid4 collision - and that is not a viable explanation. This may need new minds staring at code.

Eric-Arellano · 2022-09-21T17:15:12Z

this is still a problem even with your patch - I just hit it running on PANTS_SHA f06bf22

Same stracktrace as the OP?

danxmoran · 2022-09-21T17:16:41Z

I think yes, but here it is just in case:

Exception caught: (pants.engine.internals.scheduler.ExecutionError)
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/bin/pants", line 8, in <module>
    sys.exit(main())
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 112, in main
    PantsLoader.main()
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 108, in main
    cls.run_default_entrypoint()
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 90, in run_default_entrypoint
    exit_code = runner.run(start_time)
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/bin/pants_runner.py", line 99, in run
    runner = LocalPantsRunner.create(
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/bin/local_pants_runner.py", line 158, in create
    specs = calculate_specs(
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/init/specs_calculator.py", line 72, in calculate_specs
    (changed_addresses,) = session.product_request(
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 577, in product_request
    return self.execute(request)
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 521, in execute
    self._raise_on_error([t for _, t in throws])
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 498, in _raise_on_error
    raise ExecutionError(

Exception message: 1 Exception encountered:

Engine traceback:
  in select
  in pants.vcs.changed.find_changed_owners
  in pants.backend.project_info.dependees.find_dependees
  in pants.backend.project_info.dependees.map_addresses_to_dependees
  in pants.engine.internals.graph.resolve_dependencies (src/scripts/solutions_architects/lambdas/xifin_formatter:xifin-report-formatter)
  in pants.backend.docker.util_rules.dependencies.infer_docker_dependencies
  in pants.backend.docker.subsystems.dockerfile_parser.parse_dockerfile
  in pants.engine.process.fallible_to_exec_result_or_raise
Traceback (most recent call last):
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/engine/process.py", line 272, in fallible_to_exec_result_or_raise
    raise ProcessExecutionFailure(
pants.engine.process.ProcessExecutionFailure: Process 'Parse Dockerfile.' failed with exit code 1.
stdout:

stderr:
Traceback (most recent call last):
  File "/home/runner/.pants/execution/pants-sandbox-L4oqO4/.cache/pex_root/venvs/276ee4fbe15aa067f0a986eeb6845163f958680d/cf26e88f2412df9ce2b3f6b83a6b129bc3e3bb55/pex", line 235, in <module>
    runpy.run_module(module_name, run_name="__main__", alter_sys=True)
  File "/opt/python/3.8.13/lib/python3.8/runpy.py", line 203, in run_module
    mod_name, mod_spec, code = _get_module_details(mod_name)
  File "/opt/python/3.8.13/lib/python3.8/runpy.py", line 138, in _get_module_details
    raise error("No module named %s" % mod_name)
ImportError: No module named __pants_df_parser

jsirois · 2022-09-21T17:35:11Z

@danxmoran I'm stepping away from work until the 28th; so I'll un-assign for now and pick back up then if no one else has dug in.

thejcannon · 2023-03-16T21:21:55Z

Setting that env var did not stop the issue from recurring today. Unfortunately the machine went offline before I could debug, but I think we're prepped for debug next time.

thejcannon · 2023-03-22T01:12:50Z

I've made the following changes to try and reproduce, with still no luck after roughly 10 goes at CI:

In ./pants set export PANTS_NAMED_CACHES_DIR=$(mktemp -d) so that we get a new Named cache dir (and therefore Pex root) each run
Set [filter].target_type = ["docker_image"] in pants.toml so our work really only applies to dockerfiles (just a speedup, but putting it here for completeness)
Turned off both pantsd and the local cache, since this seemingly only reproduces on machines where dockerfile_parser.pex is being pulled remotely.
In the root BUILD.pants use the following:

for i in __import__("builtins").range(1000):
   CACHE_BUSTER = __import__("random").randint(0, 1000000000)

   docker_image(
       name=f"{CACHE_BUSTER}-img",
       skip_push=True,
       image_tags=["whatever"],
       repository="idkman",
       registries=["myregistry"],
       instructions=[
           "FROM ubuntu:focal",
           f"# {CACHE_BUSTER}",
       ],
   )

In CI I can confirm that dockerfile_parser.pex is being pulled remotely (seemingly a constant among Dan and I's reproductions?) and can confirm it is parsing the dockerfiles by looking at the --stats-log.

This is without _PEX_FILE_LOCK_STYLE=bsd set.

My hope was that we'd hit the race condition by forcing many many processes to all want a stab at the venv, but no such luck.

danxmoran · 2023-03-29T14:19:56Z

Another data-point:

We saw this error a few times yesterday. In every case:

The error occurred in a CI job running pants publish on a single docker_image target, so there were no concurrent Pex processes
The "building Pex" process for the Dockerfile parser was cancelled after a remote-cache hit
Looking in the post-failure sandboxes, dockerfile_parser.pex/__pants_df_parser.py is present! But there is no .pyc for the file within dockerfile_parser.pex/__pycache__/

Could there possibly be a race condition between populating the sandbox & invoking the dockerfile parser? Such that sometimes the parser is invoked before __pants_df_parser.py lands in the right place

thejcannon · 2023-03-29T15:45:31Z

Having this not be a pex concurrency issue would align with my current thinking, since banging on it all day didn't reproduce.

Additionally a thread on every reproduction is remote cache beating local execution.

stuhood · 2023-03-29T16:43:49Z

While a race between materializing files in a sandbox and running a process is possible, it would have nothing at all to do with whether a cache hit had occurred: files are materialized from the Store, regardless of whether they came from the cache or another process in the same run.

Based on where this occurs, my intuition (without having investigated this in a long time) is that this is a PEX named cache entry not being cancellation safe, such that if a PEX process is interrupted (as by the cancellation), a cache is partially populated. When a PEX file is fetched from the remote cache, if it bakes in any assumptions about named cache entries being present, that can be (see: #12727) problematic.

To test that hypothesis, you'd need to:

nail down precisely which mode PEX is running in in this case
figure out which named cache entries it uses, and in which order
then see whether some or all of those cache entries being missing could lead to this error

But also: I have had relatively good luck with adding targeted debug output. Adding additional debug output to the df-parser run, like adding PEX_VERBOSE=9 explicitly (or behind a temporary option) might help shake this out.

jsirois · 2023-03-29T17:00:26Z

@stuhood it seems like we're stuck in a loop here. This might help:

Based on where this occurs, my intuition (without having investigated this in a long time) is that this is a PEX named cache entry not being cancellation safe

In other language this means a Pex process cancellation by Pants - which maps to what? - a kill SIGQUIT or a kill SIGKILL? leads to a Pex atomic directory operation - which is a block of Python code whose last step on the happy path of no exceptions raised prior, is an os.rename to the final cache path Pants can see, somehow running part way but somehow never raising an exception (this is how you get 1/2 a dir populated), and finally doing the os.rename anyhow to expose the 1/2 directory?

This sounds crazy, but from what you're saying this must be happening somehow.

Put another way - how does one make a process to be run by Pants cancellation safe? The atomic_directory code thinks in terms of being multi-thread / multi-process safe only. But maybe I'm missing a 3rd thing that is cancellation safe.

stuhood · 2023-03-29T17:21:28Z

This sounds crazy, but from what you're saying this must be happening somehow.

I wasn't claiming to have reasoned this all out: only suggesting a course of action.

In other language this means a Pex process cancellation by Pants - which maps to what? - a kill SIGQUIT or a kill SIGKILL? leads to a Pex atomic directory operation - which is a block of Python code whose last step on the happy path of no exceptions raised prior, is an os.rename to the final cache path Pants can see, somehow running part way but somehow never raising an exception (this is how you get 1/2 a dir populated), and finally doing the os.rename anyhow to expose the 1/2 directory?

It will lead to SIGKILL: we don't do graceful shutdown for Processes currently (only InteractiveProcess). I agree that the atomic rename should be sufficient though, regardless.

jsirois · 2023-03-29T17:27:49Z

I wasn't claiming to have reasoned this all out: only suggesting a course of action.

Aha, Ok. I'm pretty sensitive at this point to wasting people's time. Perhaps reporters have not gone through the suggested steps, but I seem to recall they've been down this road before. Basically in a cycle maybe its X, please collect ~data, repeat. I've probably spent ~1 full work week (at least) staring hard at Pex code, adding debug output and knobs to releases, etc to get to the bottom of this from the Pex side re multi-thread or multi-process errors in atomic directory logic or usage and I've come up blank. The only real fix so far has come from a much smaller effort on my part investigating Pants code, where I fixed the VenvPex wrapper script early on to not do an rm -rf which was on its face racy and wrong.

I'll have a look again at the Pants side later today since I'm fully juiced for ideas on the Pex side. I've noted to @danxmoran before the more eyes the better, but the key Pex code is here: https://github.com/pantsbuild/pex/blob/main/pex/atomic_directory.py

thejcannon · 2023-03-29T17:50:34Z

The things I would keep fresh in mind, personally are that this only occurs:

For this one specific file in this one specific sandbox for this one specific process. It likely isn't as fingerprinted as anything in the file/sandbox/process. I'm guessing it's just the right size for
Remote cache is always at play and it always beats out local process execution for building the pex

There's something we never put on the table(, which I REALLY hope isn't the case) but I'm pretty sure these venvs have R+W perms. So technically speaking a rogue remove relative to some __file__ could muck up the venv. Again, implausible but possible.

danxmoran · 2023-03-29T19:15:01Z

I just noticed that we're now getting BuildSense links for these errors (vs. when I first reported the problem and the error was hitting before the links were logged). Here's the latest one: https://app.toolchain.com/organizations/color/repos/color/builds/pants_run_2023_03_29_11_21_12_96_a150c7d599f6461b818e7e7c8cc63ef3/.

Looking at the workunits.json that was collected in that run, I realize I have to walk back one of my assumptions. Even though the CI job is running pants publish on a single docker_image target, we still end up running multiple "Parse Dockerfile" processes because of dependencies on in-repo base images. The total dependency stack is 3 images deep (docker_image -> docker_image -> docker_image). The lowest image in the chain is the one that failed.

Would it be useful for me to set --streaming-workunits-level=trace in this CI job? I don't mind eating the extra slowness if it helps us get to the bottom of things.

thejcannon · 2023-03-30T01:41:11Z

Some new notes from the most recent occurrence with some hands-on time before our instance got terminated:

All the same information as prior:
- No module named __pants_df_parser
- __pants_df_parser.py is in the PEX

$ ls -ln dockerfile_parser.pex
total 20
-rw-r--r-- 1 1000 1000  777 Mar 29 15:41 PEX-INFO
-rwxr-xr-x 1 1000 1000 3503 Mar 29 15:41 __main__.py
-rwxr-xr-x 1 1000 1000 6949 Mar 29 15:41 __pants_df_parser.py
drwxrwxr-x 2 1000 1000 4096 Mar 29 15:41 __pex__

Then I remove the venv:

$ rm -rf .cache/pex_root/venvs/35e4cfd30810a27a5e9797d48e3ae2214064fd4c/ddab8011daaee380698ac2fb9701af18c90c03f6/

and re-run:

$ ./__run.sh
...
Ignoring the following environment variables in Pex venv mode:
_PEX_FILE_LOCK_STYLE=bsd

Ignoring the following environment variables in Pex venv mode:
_PEX_FILE_LOCK_STYLE=bsd

Traceback (most recent call last):
  File "/tmp/pants-sandbox-Dghqof/.cache/pex_root/venvs/35e4cfd30810a27a5e9797d48e3ae2214064fd4c/ddab8011daaee380698ac2fb9701af18c90c03f6/pex", line 243, in <module>
    runpy.run_module(module_name, run_name="__main__", alter_sys=True)
  File "/usr/lib/python3.8/runpy.py", line 203, in run_module
    mod_name, mod_spec, code = _get_module_details(mod_name)
  File "/usr/lib/python3.8/runpy.py", line 138, in _get_module_details
    raise error("No module named %s" % mod_name)
ImportError: No module named __pants_df_parser

So ignoring Pants for a moment, I can reproduce the issue just from the sandbox by running the script 🎉

Here's the contents of the bad venv:

$ ls -ln .cache/pex_root/venvs/35e4cfd30810a27a5e9797d48e3ae2214064fd4c/ddab8011daaee380698ac2fb9701af18c90c03f6/lib/python3.8/site-packages/
total 16
-rw-rw-r-- 1 1000 1000  230 Mar 29 16:08 PEX_EXTRA_SYS_PATH.pth
drwxrwxr-x 2 1000 1000 4096 Mar 29 16:08 dockerfile-3.2.0.dist-info
lrwxrwxrwx 1 1000 1000  187 Mar 29 16:08 dockerfile.abi3.h -> ../../../../../../installed_wheels/d69ac9c0d6d985f68e3b3d228605d457a9b688e60e27cdc426a6ce79ae214307/dockerfile-3.2.0-cp36-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl/dockerfile.abi3.h
lrwxrwxrwx 1 1000 1000  188 Mar 29 16:08 dockerfile.abi3.so -> ../../../../../../installed_wheels/d69ac9c0d6d985f68e3b3d228605d457a9b688e60e27cdc426a6ce79ae214307/dockerfile-3.2.0-cp36-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl/dockerfile.abi3.so

Running the PEX outside of the run script DOES NOT show the issue:

$ /usr/bin/python3.8 ./dockerfile_parser.pex
Not enough arguments.
Usage: /home/ubuntu/.pex/venvs/35e4cfd30810a27a5e9797d48e3ae2214064fd4c/ddab8011daaee380698ac2fb9701af18c90c03f6/lib/python3.8/site-packages/__pants_df_parser.py [DOCKERFILE ...]

Additionally, scping a tarball of the sandbox and running it on my machine also DOES NOT reproduce the issue.

I strongly suspect something is poisoned about the pex root and we've all been barking up the wrong tree, but John can help rule out any pex issues with the pex itself. I'll share privately in a DM on Slack.

thejcannon · 2023-03-30T20:06:34Z

I think I found a smoking gun today:

$ cat ./dockerfile_parser.pex/PEX-INFO  | jq .code_hash
"fa968c16b8df77369826d9e37f1517ba5fbc8eb4"
$ ls -lna  .cache/pex_root/user_code/fa968c16b8df77369826d9e37f1517ba5fbc8eb4/
total 8
drwxrwxr-x 2 1000 1000 4096 Mar 30 12:58 .
drwxrwxr-x 5 1000 1000 4096 Mar 30 12:58 ..

whereas if I point to a different PEX_ROOT:

$ ls -lna ~/.pex/user_code/fa968c16b8df77369826d9e37f1517ba5fbc8eb4/
total 24
drwxrwxr-x  4 1000 1000 4096 Mar 30 13:05 .
drwxrwxr-x 10 1000 1000 4096 Mar 30 13:05 ..
-rwxr-xr-x  3 1000 1000 6949 Mar 30 11:38 __pants_df_parser.py
drwxrwxr-x  2 1000 1000 4096 Mar 30 13:05 __pex__
drwxrwxr-x  2 1000 1000 4096 Mar 30 13:05 __pycache__

So it's a poisoned cache. Unclear when it got poisoned though. It could be in this run, if this is the first run with that code_hash or an early one.

jsirois · 2023-03-31T00:09:33Z

Ok. The data @thejcannon provided was a second instance of similar data @danxmoran provided many months ago and this was finally enough to switch the focus to remote cache hit cancellation of the VenvPex build process more sharply. Thank you both for being willing to collect data.

My fundamental analysis error over these past 9 months was assuming the data loss was on the output side. It in fact looks like it was on the input side instead. I did not complete an analysis of the Pants code that sets up the sandbox the Pex CLI is run in to create the __df_parser.py VenvPex, but @stuhood provided expert input that the contents of that sandbox were all there if the Pex CLI process was launched.

Taking that to heart, but still focusing on the input side, the missing file is a user code file (the same file) in both reported cases. The Pex is a packed VenvPex. A packed VenvPex is a directory containing 4 classes of items:

loose Pex-internal files: PEX-INFO, main.py, __pex__/ import hook package.
loose user-code
a .bootstrap zip of the Pex bootstrap code
0 or more .deps/*.whl installed wheel zips.

When such a VenvPex is built, Pants passes --seed to ensure the PEX_ROOT caches are pre-seeded. In the course of seeding a VenvPex from a packed PEX, the packed PEX is 1st installed as an unzipped pex in the PEX root before re-execing from that to do the venv install in the venvs cache under the PEX_ROOT. The unzipped PEX installation has 3 atomic inputs and 1 non-atomic input - the non-atomic one being the user code. That is gathered via an os.walk here:
https://github.com/pantsbuild/pex/blob/bfea50d28f875d74607d6cc5ef93271744d9dd7b/pex/layout.py#L274-L289

So, the dest_dir there is an atomic_directory work dir - which is as things should be. But the walk assumes the inputs will be stable. During Pants remote cache speculation however, a remote cache hit beating a local VenvPex build and --seed will issue a SIGKILL to the Pex CLI process and then go through Rust tear down which includes RAII finalization of the sandbox /tmp dir. The Pex CLI process can thus have the sandbox packed PEX dir open for reading while its contents are nuked by the Rust RAII tear-down leading to an os.walk over an empty sandbox packed PEX directory that results in populating the unzipped PEX user_code cache with 0 user code files. From that point on the cache is poisoned and the user code (__df_parser.py) will always be missing when a venv is created or re-created from that PEX_ROOT cache.

@stuhood has concurred this is a likely cause and has volunteered to take over providing a fix for this uncontrolled asynchrony on the Pants side. The idea being a Pants Process should not have to come equipped with its own bespoke code to guard against its inputs being mutated out from under it.

@jsirois

…ed up (#18632) As @jsirois discovered and described in #16778 (comment), because the `local::CommandRunner` does not `wait` for its child process to have exited, it might be possible for a SIGKILL to not have taken effect on the child process before its sandbox has been torn down. And that could result in the process seeing a partial sandbox (and in the case of #16778, potentially cause named cache corruption). This change ensures that we block to call `wait` when spawning local processes by wrapping them in the `ManagedChild` helper that we were already using for interactive processes. It then attempts to clarify the contract of the `CapturedWorkdir` trait (but in order to improve cherry-pickability does not attempt to adjust it), to make it clear that child processes should fully exit before the `run_and_capture_workdir` method has returned. Fixes #16778 🤞. --------- Co-authored-by: John Sirois <john.sirois@gmail.com>

@jsirois

…ed up (Cherry-pick of pantsbuild#18632) As @jsirois discovered and described in pantsbuild#16778 (comment), because the `local::CommandRunner` does not `wait` for its child process to have exited, it might be possible for a SIGKILL to not have taken effect on the child process before its sandbox has been torn down. And that could result in the process seeing a partial sandbox (and in the case of pantsbuild#16778, potentially cause named cache corruption). This change ensures that we block to call `wait` when spawning local processes by wrapping them in the `ManagedChild` helper that we were already using for interactive processes. It then attempts to clarify the contract of the `CapturedWorkdir` trait (but in order to improve cherry-pickability does not attempt to adjust it), to make it clear that child processes should fully exit before the `run_and_capture_workdir` method has returned. Fixes pantsbuild#16778 🤞. --------- Co-authored-by: John Sirois <john.sirois@gmail.com> (cherry picked from commit 1462bef)

@jsirois

…ed up (Cherry-pick of pantsbuild#18632) As @jsirois discovered and described in pantsbuild#16778 (comment), because the `local::CommandRunner` does not `wait` for its child process to have exited, it might be possible for a SIGKILL to not have taken effect on the child process before its sandbox has been torn down. And that could result in the process seeing a partial sandbox (and in the case of pantsbuild#16778, potentially cause named cache corruption). This change ensures that we block to call `wait` when spawning local processes by wrapping them in the `ManagedChild` helper that we were already using for interactive processes. It then attempts to clarify the contract of the `CapturedWorkdir` trait (but in order to improve cherry-pickability does not attempt to adjust it), to make it clear that child processes should fully exit before the `run_and_capture_workdir` method has returned. Fixes pantsbuild#16778 🤞. --------- Co-authored-by: John Sirois <john.sirois@gmail.com> (cherry picked from commit 1462bef)

@jsirois

…ed up (Cherry-pick of pantsbuild#18632) As @jsirois discovered and described in pantsbuild#16778 (comment), because the `local::CommandRunner` does not `wait` for its child process to have exited, it might be possible for a SIGKILL to not have taken effect on the child process before its sandbox has been torn down. And that could result in the process seeing a partial sandbox (and in the case of pantsbuild#16778, potentially cause named cache corruption). This change ensures that we block to call `wait` when spawning local processes by wrapping them in the `ManagedChild` helper that we were already using for interactive processes. It then attempts to clarify the contract of the `CapturedWorkdir` trait (but in order to improve cherry-pickability does not attempt to adjust it), to make it clear that child processes should fully exit before the `run_and_capture_workdir` method has returned. Fixes pantsbuild#16778 🤞. --------- Co-authored-by: John Sirois <john.sirois@gmail.com> (cherry picked from commit 1462bef)

@jsirois

#18641) …ed up (Cherry-pick of #18632) As @jsirois discovered and described in #16778 (comment), because the `local::CommandRunner` does not `wait` for its child process to have exited, it might be possible for a SIGKILL to not have taken effect on the child process before its sandbox has been torn down. And that could result in the process seeing a partial sandbox (and in the case of #16778, potentially cause named cache corruption). This change ensures that we block to call `wait` when spawning local processes by wrapping them in the `ManagedChild` helper that we were already using for interactive processes. It then attempts to clarify the contract of the `CapturedWorkdir` trait (but in order to improve cherry-pickability does not attempt to adjust it), to make it clear that child processes should fully exit before the `run_and_capture_workdir` method has returned. Fixes #16778 🤞. --------- Co-authored-by: John Sirois <john.sirois@gmail.com> (cherry picked from commit 1462bef) Co-authored-by: Stu Hood <stuhood@gmail.com>

@jsirois

#18640) …ed up (Cherry-pick of #18632) As @jsirois discovered and described in #16778 (comment), because the `local::CommandRunner` does not `wait` for its child process to have exited, it might be possible for a SIGKILL to not have taken effect on the child process before its sandbox has been torn down. And that could result in the process seeing a partial sandbox (and in the case of #16778, potentially cause named cache corruption). This change ensures that we block to call `wait` when spawning local processes by wrapping them in the `ManagedChild` helper that we were already using for interactive processes. It then attempts to clarify the contract of the `CapturedWorkdir` trait (but in order to improve cherry-pickability does not attempt to adjust it), to make it clear that child processes should fully exit before the `run_and_capture_workdir` method has returned. Fixes #16778 🤞. --------- Co-authored-by: John Sirois <john.sirois@gmail.com> (cherry picked from commit 1462bef) Co-authored-by: Stu Hood <stuhood@gmail.com>

@jsirois

#18642) …ed up (Cherry-pick of #18632) As @jsirois discovered and described in #16778 (comment), because the `local::CommandRunner` does not `wait` for its child process to have exited, it might be possible for a SIGKILL to not have taken effect on the child process before its sandbox has been torn down. And that could result in the process seeing a partial sandbox (and in the case of #16778, potentially cause named cache corruption). This change ensures that we block to call `wait` when spawning local processes by wrapping them in the `ManagedChild` helper that we were already using for interactive processes. It then attempts to clarify the contract of the `CapturedWorkdir` trait (but in order to improve cherry-pickability does not attempt to adjust it), to make it clear that child processes should fully exit before the `run_and_capture_workdir` method has returned. Fixes #16778 🤞. --------- Co-authored-by: John Sirois <john.sirois@gmail.com> (cherry picked from commit 1462bef) Co-authored-by: Stu Hood <stuhood@gmail.com>

engnatha · 2023-05-12T22:49:16Z

Sorry to revisit this can of worms, but we just got hit with this on 2.15.0.

[BOOTSTRAP] [INFO] Sending command: /buildbot/astranis-monorepo-python/build/pants --print-stacktrace export //:reqs
[BOOTSTRAP] [WARNING] Received non-zero return code (1) from /buildbot/astranis-monorepo-python/build/pants --print-stacktrace export //:reqs
[BOOTSTRAP] [WARNING] STDOUT: 
[BOOTSTRAP] [WARNING] STDERR: 22:02:43.94 [INFO] Initializing scheduler...
22:02:44.47 [INFO] Scheduler initialized.
22:02:44.59 [WARN] DEPRECATED: exporting resolves without using the --resolve option is scheduled to be removed in version 2.23.0.dev0.
Use the --resolve flag one or more times to name the resolves you want to export,
and don't provide any target specs. E.g.,
`/buildbot/astranis-monorepo-python/build/pants export --resolve=python-default --resolve=pytest`
22:02:47.98 [ERROR] 1 Exception encountered:
Engine traceback:
  in select
    ..
  in pants.core.goals.export.export
    `export` goal
  in pants.backend.python.goals.export.export_virtualenvs
    ..
  in pants.backend.python.goals.export.export_virtualenv_for_targets
    ..
  in pants.backend.python.util_rules.pex_from_targets.create_pex_from_targets
    ..
  in pants.backend.python.util_rules.pex_from_targets.determine_requirement_strings_in_closure
    ..
  in pants.engine.internals.graph.transitive_targets
    Resolve transitive targets
  in pants.engine.internals.graph.transitive_dependency_mapping
    ..
  in pants.engine.internals.graph.resolve_targets
    ..
  in pants.engine.internals.graph.resolve_unexpanded_targets
    ..
  in pants.engine.internals.graph.resolve_dependencies
    Resolve direct dependencies of target - astranis-python/astranis/ground/service/ground_correspondence:ground_correspondence
  in pants.backend.docker.util_rules.dependencies.infer_docker_dependencies
    ..
  in pants.backend.docker.subsystems.dockerfile_parser.parse_dockerfile
    ..
  in pants.engine.process.fallible_to_exec_result_or_raise
    ..
Traceback (most recent call last):
  File "/home/buildbot/.cache/pants/setup/bootstrap-Linux-x86_64/2.15.0_py38/lib/python3.8/site-packages/pants/engine/process.py", line 289, in fallible_to_exec_result_or_raise
    raise ProcessExecutionFailure(
pants.engine.process.ProcessExecutionFailure: Process 'Parse Dockerfile.' failed with exit code 1.
stdout:
stderr:
Traceback (most recent call last):
  File "/tmp/pants-sandbox-yhk9SH/.cache/pex_root/venvs/78566b08c5c7ade31237703b0f67d5496faeb045/ddab8011daaee380698ac2fb9701af18c90c03f6/pex", line 243, in <module>
    runpy.run_module(module_name, run_name="__main__", alter_sys=True)
  File "/usr/lib/python3.8/runpy.py", line 203, in run_module
    mod_name, mod_spec, code = _get_module_details(mod_name)
  File "/usr/lib/python3.8/runpy.py", line 138, in _get_module_details
    raise error("No module named %s" % mod_name)
ImportError: No module named __pants_df_parser
Use `--keep-sandboxes=on_failure` to preserve the process chroot for inspection.

Any chance there was a regression between when this was fixed and that version?

jsirois · 2023-05-12T22:53:56Z

@engnatha the fix is here: https://pypi.org/project/pantsbuild.pants/2.15.1rc1/

danxmoran added the bug label Sep 6, 2022

jsirois self-assigned this Sep 16, 2022

jsirois added the backend: Python Python backend-related issues label Sep 16, 2022

This was referenced Sep 20, 2022

Fix venv re-population race. #16931

Merged

docker runner: cannot access parent directories #16902

Closed

Venv "can't open file file" named cache flake #14618

Closed

jsirois closed this as completed in #16931 Sep 20, 2022

This was referenced Sep 20, 2022

Fix venv re-population race. (Cherry-pick of #16931) #16935

Merged

Fix venv re-population race. (Cherry-pick of #16931) #16936

Merged

danxmoran reopened this Sep 21, 2022

jsirois assigned stuhood and unassigned jsirois Mar 31, 2023

stuhood mentioned this issue Mar 31, 2023

Ensure that sandboxed processes exit before their sandboxes are cleaned up #18632

Merged

jsirois closed this as completed in #18632 Mar 31, 2023

This was referenced Mar 31, 2023

Ensure that sandboxed processes exit before their sandboxes are clean… #18640

Merged

Ensure that sandboxed processes exit before their sandboxes are clean… #18641

Merged

Ensure that sandboxed processes exit before their sandboxes are clean… #18642

Merged

jsirois mentioned this issue Aug 21, 2024

Pex POSIX locking appears to be broken. pex-tool/pex#1971

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nondeterministic `ImportError: No module named __pants_df_parser` while computing changed targets #16778

Nondeterministic `ImportError: No module named __pants_df_parser` while computing changed targets #16778

danxmoran commented Sep 6, 2022

danxmoran commented Sep 6, 2022

jsirois commented Sep 20, 2022

jsirois commented Sep 20, 2022

danxmoran commented Sep 20, 2022

jsirois commented Sep 20, 2022

jsirois commented Sep 20, 2022

jsirois commented Sep 20, 2022

danxmoran commented Sep 21, 2022

jsirois commented Sep 21, 2022

danxmoran commented Sep 21, 2022

jsirois commented Sep 21, 2022

Eric-Arellano commented Sep 21, 2022

danxmoran commented Sep 21, 2022

jsirois commented Sep 21, 2022

thejcannon commented Mar 16, 2023

thejcannon commented Mar 22, 2023 •

edited

Loading

danxmoran commented Mar 29, 2023

thejcannon commented Mar 29, 2023 •

edited

Loading

stuhood commented Mar 29, 2023 •

edited

Loading

jsirois commented Mar 29, 2023 •

edited

Loading

stuhood commented Mar 29, 2023 •

edited

Loading

jsirois commented Mar 29, 2023 •

edited

Loading

thejcannon commented Mar 29, 2023

danxmoran commented Mar 29, 2023

thejcannon commented Mar 30, 2023

thejcannon commented Mar 30, 2023 •

edited

Loading

jsirois commented Mar 31, 2023 •

edited

Loading

engnatha commented May 12, 2023

jsirois commented May 12, 2023

Nondeterministic ImportError: No module named __pants_df_parser while computing changed targets #16778

Nondeterministic ImportError: No module named __pants_df_parser while computing changed targets #16778

Comments

danxmoran commented Sep 6, 2022

danxmoran commented Sep 6, 2022

jsirois commented Sep 20, 2022

jsirois commented Sep 20, 2022

danxmoran commented Sep 20, 2022

jsirois commented Sep 20, 2022

jsirois commented Sep 20, 2022

jsirois commented Sep 20, 2022

danxmoran commented Sep 21, 2022

jsirois commented Sep 21, 2022

danxmoran commented Sep 21, 2022

jsirois commented Sep 21, 2022

Eric-Arellano commented Sep 21, 2022

danxmoran commented Sep 21, 2022

jsirois commented Sep 21, 2022

thejcannon commented Mar 16, 2023

thejcannon commented Mar 22, 2023 • edited Loading

danxmoran commented Mar 29, 2023

thejcannon commented Mar 29, 2023 • edited Loading

stuhood commented Mar 29, 2023 • edited Loading

jsirois commented Mar 29, 2023 • edited Loading

stuhood commented Mar 29, 2023 • edited Loading

jsirois commented Mar 29, 2023 • edited Loading

thejcannon commented Mar 29, 2023

danxmoran commented Mar 29, 2023

thejcannon commented Mar 30, 2023

thejcannon commented Mar 30, 2023 • edited Loading

jsirois commented Mar 31, 2023 • edited Loading

engnatha commented May 12, 2023

jsirois commented May 12, 2023

Nondeterministic `ImportError: No module named __pants_df_parser` while computing changed targets #16778

Nondeterministic `ImportError: No module named __pants_df_parser` while computing changed targets #16778

thejcannon commented Mar 22, 2023 •

edited

Loading

thejcannon commented Mar 29, 2023 •

edited

Loading

stuhood commented Mar 29, 2023 •

edited

Loading

jsirois commented Mar 29, 2023 •

edited

Loading

stuhood commented Mar 29, 2023 •

edited

Loading

jsirois commented Mar 29, 2023 •

edited

Loading

thejcannon commented Mar 30, 2023 •

edited

Loading

jsirois commented Mar 31, 2023 •

edited

Loading