Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nondeterministic ImportError: No module named __pants_df_parser while computing changed targets #16778

Closed
danxmoran opened this issue Sep 6, 2022 · 64 comments · Fixed by #16931 or #18632
Assignees
Labels
backend: Python Python backend-related issues bug

Comments

@danxmoran
Copy link
Contributor

(NOTE: There's been a lot of discussion on this in Slack here, but I realized I never filed an issue for it)

Describe the bug

A few times per day (out of ~hundreds of jobs), our CI jobs will fail when trying to build dockerfile_parser.pex. The error always hits at the beginning of goal execution, and affects all our goals (test, lint, check). The error looks like:

13:28:44.08 [INFO] Starting: Resolving plugins: toolchain.pants.plugin==0.20.0
13:28:56.61 [INFO] Completed: Resolving plugins: toolchain.pants.plugin==0.20.0
13:28:57.16 [INFO] [auth-plugin] Successfully loaded Toolchain token from env var 'TOOLCHAIN_AUTH_TOKEN', expiration: 2022-12-20T20:52:09+00:00.
13:29:17.25 [INFO] Starting: Building dockerfile_parser.pex from dockerfile-parser_default.lock
13:29:18.32 [INFO] Canceled: Building dockerfile_parser.pex from dockerfile-parser_default.lock
Exception caught: (pants.engine.internals.scheduler.ExecutionError)
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/bin/pants", line 8, in <module>
    sys.exit(main())
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 115, in main
    PantsLoader.main()
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 111, in main
    cls.run_default_entrypoint()
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 93, in run_default_entrypoint
    exit_code = runner.run(start_time)
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/pants_runner.py", line 99, in run
    runner = LocalPantsRunner.create(
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/bin/local_pants_runner.py", line 159, in create
    specs = calculate_specs(
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/init/specs_calculator.py", line 94, in calculate_specs
    (changed_addresses,) = session.product_request(
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 577, in product_request
    return self.execute(request)
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 521, in execute
    self._raise_on_error([t for _, t in throws])
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 498, in _raise_on_error
    raise ExecutionError(
Exception message: 1 Exception encountered:
Engine traceback:
  in select
  in pants.vcs.changed.find_changed_owners
  in pants.backend.project_info.dependees.find_dependees
  in pants.backend.project_info.dependees.map_addresses_to_dependees
  in pants.engine.internals.graph.resolve_dependencies (src/projects/rhp/operations:rhp-api)
  in pants.backend.docker.util_rules.dependencies.inject_docker_dependencies (src/projects/rhp/operations:rhp-api)
  in pants.backend.docker.subsystems.dockerfile_parser.parse_dockerfile
  in pants.engine.process.fallible_to_exec_result_or_raise
Traceback (most recent call last):
  File "/tmp/pants_setup/bootstrap-Linux-x86_64/2.13.0a1+gitb75932fb_py38/lib/python3.8/site-packages/pants/engine/process.py", line 272, in fallible_to_exec_result_or_raise
    raise ProcessExecutionFailure(
pants.engine.process.ProcessExecutionFailure: Process 'Parse Dockerfile.' failed with exit code 1.
stdout:
stderr:
Traceback (most recent call last):
  File "/tmp/pants-sandbox-kVePcL/.cache/pex_root/venvs/e2b00a6620ed72ee63184ca7a64555f5551da8c5/f679563b742633f5edaba61ca7d405616ffc49b3/pex", line 234, in <module>
    runpy.run_module(module_name, run_name="__main__", alter_sys=True)
  File "/opt/python/3.8.12/lib/python3.8/runpy.py", line 203, in run_module
    mod_name, mod_spec, code = _get_module_details(mod_name)
  File "/opt/python/3.8.12/lib/python3.8/runpy.py", line 138, in _get_module_details
    raise error("No module named %s" % mod_name)
ImportError: No module named __pants_df_parser

Pants version

  • The error started appearing on PANTS_SHA=5d8a328d72209863986c8959b20305505bc068ba, and has continued appearing as we've upgraded along the 2.13.x branch.
  • As far as I can tell, the error was not appearing on v2.13.0a0

OS

Linux

Additional info

Through debugging on Slack, we've found:

  1. __pants_df_parser.py truly isn't in the venv when this error occurs
  2. The error doesn't occur if we use --no-process-cleanup
  3. The sandbox PEX created for the dockerfile parser contains the file, even when the pex_root venv does not. Running the PEX creates a venv that contains the file.
@danxmoran danxmoran added the bug label Sep 6, 2022
@danxmoran
Copy link
Contributor Author

I just saw a new error for the first time, still in the Parse Dockerfile stage 😮

/tmp/pants_named_caches/pex_root/venvs/f92f4a34a585f3429d24d32fb8349b7f900ea322/cf26e88f2412df9ce2b3f6b83a6b129bc3e3bb55/bin/python: can't open file '/tmp/pants_named_caches/pex_root/venvs/f92f4a34a585f3429d24d32fb8349b7f900ea322/cf26e88f2412df9ce2b3f6b83a6b129bc3e3bb55/pex': [Errno 2] No such file or directory

@jsirois jsirois self-assigned this Sep 16, 2022
@jsirois jsirois added the backend: Python Python backend-related issues label Sep 16, 2022
@jsirois
Copy link
Contributor

jsirois commented Sep 20, 2022

Ok, I started re-reviewing this from the top instead of focusing in on Pex. That leads here:

# If the seeded venv has been removed from the PEX_ROOT, we re-seed from the original
# `--venv` mode PEX file.
if [ ! -e "${{target_venv_executable}}" ]; then
rm -rf "${{venv_dir}}" || true
PEX_INTERPRETER=1 ${{execute_pex_args}} -c ''
fi
exec "${{target_venv_executable}}" "$@"

Clearly? the rm -rf can be interrupted and since rm -r is implemented as a depth-1st delete, that can leave the root venv/pex file present while portions of its subtree have been removed. So, given that, imagine this:
*1. A process runs on a machine with no venv in the PEX_ROOT yet, and enters the outer guard.
2. Another process enters the guard and builds the venv and starts exec'uting it.
3. The original process starts processing the rm -rf on the venv the process from step 2 is using.

*N.B. 1 can only happen when the PEX in question is not built locally, since we pre --seed the PEX venv when building it. In CI, that means either the LMDB store is cached but the named_caches (PEX_ROOT) are not or that remote caching is enabled and that's where the pre-built PEX comes from.

@danxmoran can you check me on this logic / if it fully describes / comports with your setup where you are seeing failures?

@jsirois
Copy link
Contributor

jsirois commented Sep 20, 2022

I think if the above is the explanation, then just switching to Pex's --sh-boot mode and eliminating the wrapper script will do the trick and allow Pants to keep all the hard earned milliseconds saved by this old wrapper script pre-dating --sh-boot.

@danxmoran
Copy link
Contributor Author

@jsirois the explanation makes sense for our CI context - we use ephemeral runners so the Pants cache and PEX root are empty at the start of each run, and we use remote caching.

@jsirois
Copy link
Contributor

jsirois commented Sep 20, 2022

Alrighty, thanks for the confirmation. I think the issue is now fully understood. I want to stew a bit more on the --sh-boot fix vs just removing the rm -rf as a smaller step. I need to review why / when I added the rm -rf in the 1st place.

@jsirois
Copy link
Contributor

jsirois commented Sep 20, 2022

Ok, the rm -rf was in there from the get-go a year and 1/2 ago: https://github.com/pantsbuild/pants/pull/11557/files#r575485648

I seem to recall during development of that PR that I initially didn't have it and ran into issues, but I'll give that path a thorough analysis before committing to the longer term --sh-boot fix.

jsirois added a commit to jsirois/pants that referenced this issue Sep 20, 2022
There was a race in venv re-population due to a non-atomic `rm`, create
sequence. There was no real need for the `rm` and the create is atomic
on its own; so just remove the `rm` which was only attempting to guard
"corrupted" venvs in a slapdash way. Now the venv either exists or it
doesn't from the script point of view. If the venv exists but has been
tampered with, its execution will consistently fail until there is a
manual intervention removing the venv dir offline.

Fixes pantsbuild#16778
@jsirois
Copy link
Contributor

jsirois commented Sep 20, 2022

Ok, the --sh-boot will still require scripts or some new --sh-boot PEX_SCRIPT affordance or some sort of conscript-like system since code like this needs to point directly at various console scripts in a venv:

argv = [os.path.join(protoc_relpath, downloaded_protoc_binary.exe), "--python_out", output_dir]
if mypy_pex:
argv.extend(
[
f"--plugin=protoc-gen-mypy={mypy_pex.bin[protoc_gen_mypy_script].argv0}",
"--mypy_out",
output_dir,
]
)
if downloaded_grpc_plugin:
argv.extend(
[f"--plugin=protoc-gen-grpc={downloaded_grpc_plugin.exe}", "--grpc_out", output_dir]
)
if mypy_pex and protoc_gen_mypy_grpc_script in mypy_pex.bin:
argv.extend(
[
f"--plugin=protoc-gen-mypy_grpc={mypy_pex.bin[protoc_gen_mypy_grpc_script].argv0}",
"--mypy_grpc_out",
output_dir,
]
)

It looks like the rm -rf ... can simply be dropped. I'll proceed down that path.

jsirois added a commit that referenced this issue Sep 20, 2022
There was a race in venv re-population due to a non-atomic `rm`, create
sequence. There was no real need for the `rm` and the create is atomic 
on its own; so just remove the `rm` which was only attempting to guard
"corrupted" venvs in a slapdash way. Now the venv either exists or it
doesn't from the script point of view. If the venv exists but has been
tampered with, its execution will consistently fail until there is a
manual intervention removing the venv dir offline.

Fixes #14618
Fixes #16778
jsirois added a commit to jsirois/pants that referenced this issue Sep 20, 2022
There was a race in venv re-population due to a non-atomic `rm`, create
sequence. There was no real need for the `rm` and the create is atomic
on its own; so just remove the `rm` which was only attempting to guard
"corrupted" venvs in a slapdash way. Now the venv either exists or it
doesn't from the script point of view. If the venv exists but has been
tampered with, its execution will consistently fail until there is a
manual intervention removing the venv dir offline.

Fixes pantsbuild#14618
Fixes pantsbuild#16778

(cherry picked from commit cace851)
jsirois added a commit to jsirois/pants that referenced this issue Sep 20, 2022
There was a race in venv re-population due to a non-atomic `rm`, create
sequence. There was no real need for the `rm` and the create is atomic
on its own; so just remove the `rm` which was only attempting to guard
"corrupted" venvs in a slapdash way. Now the venv either exists or it
doesn't from the script point of view. If the venv exists but has been
tampered with, its execution will consistently fail until there is a
manual intervention removing the venv dir offline.

Fixes pantsbuild#14618
Fixes pantsbuild#16778

(cherry picked from commit cace851)
jsirois added a commit that referenced this issue Sep 21, 2022
There was a race in venv re-population due to a non-atomic `rm`, create
sequence. There was no real need for the `rm` and the create is atomic
on its own; so just remove the `rm` which was only attempting to guard
"corrupted" venvs in a slapdash way. Now the venv either exists or it
doesn't from the script point of view. If the venv exists but has been
tampered with, its execution will consistently fail until there is a
manual intervention removing the venv dir offline.

Fixes #14618
Fixes #16778

(cherry picked from commit cace851)
jsirois added a commit that referenced this issue Sep 21, 2022
There was a race in venv re-population due to a non-atomic `rm`, create
sequence. There was no real need for the `rm` and the create is atomic
on its own; so just remove the `rm` which was only attempting to guard
"corrupted" venvs in a slapdash way. Now the venv either exists or it
doesn't from the script point of view. If the venv exists but has been
tampered with, its execution will consistently fail until there is a
manual intervention removing the venv dir offline.

Fixes #14618
Fixes #16778

(cherry picked from commit cace851)
@danxmoran
Copy link
Contributor Author

@jsirois this is still a problem even with your patch - I just hit it running on PANTS_SHA f06bf22

@danxmoran danxmoran reopened this Sep 21, 2022
@jsirois
Copy link
Contributor

jsirois commented Sep 21, 2022

@danxmoran I don't believe you / hrm. Can you get a sandbox on that version, even if just locally, and confirm the wrapper script has the rm -rf ... removed?

@danxmoran
Copy link
Contributor Author

Confirmed that when I look at at sandbox locally the rm -rf ... is gone 🤔

@jsirois
Copy link
Contributor

jsirois commented Sep 21, 2022

Alrighty - thanks. I am totally out of ideas then. That rm -rf ... truly was a race that definitely should cause errors like you saw. With that gone, the focus moves back to Pex's atomicity of venv creation which is what I had previously scrutinized and found the only hole to be a posix leaky file lock leading to a uuid4 collision - and that is not a viable explanation. This may need new minds staring at code.

@Eric-Arellano
Copy link
Contributor

this is still a problem even with your patch - I just hit it running on PANTS_SHA f06bf22

Same stracktrace as the OP?

@danxmoran
Copy link
Contributor Author

I think yes, but here it is just in case:

Exception caught: (pants.engine.internals.scheduler.ExecutionError)
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/bin/pants", line 8, in <module>
    sys.exit(main())
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 112, in main
    PantsLoader.main()
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 108, in main
    cls.run_default_entrypoint()
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/bin/pants_loader.py", line 90, in run_default_entrypoint
    exit_code = runner.run(start_time)
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/bin/pants_runner.py", line 99, in run
    runner = LocalPantsRunner.create(
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/bin/local_pants_runner.py", line 158, in create
    specs = calculate_specs(
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/init/specs_calculator.py", line 72, in calculate_specs
    (changed_addresses,) = session.product_request(
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 577, in product_request
    return self.execute(request)
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 521, in execute
    self._raise_on_error([t for _, t in throws])
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/engine/internals/scheduler.py", line 498, in _raise_on_error
    raise ExecutionError(

Exception message: 1 Exception encountered:

Engine traceback:
  in select
  in pants.vcs.changed.find_changed_owners
  in pants.backend.project_info.dependees.find_dependees
  in pants.backend.project_info.dependees.map_addresses_to_dependees
  in pants.engine.internals.graph.resolve_dependencies (src/scripts/solutions_architects/lambdas/xifin_formatter:xifin-report-formatter)
  in pants.backend.docker.util_rules.dependencies.infer_docker_dependencies
  in pants.backend.docker.subsystems.dockerfile_parser.parse_dockerfile
  in pants.engine.process.fallible_to_exec_result_or_raise
Traceback (most recent call last):
  File "/home/runner/.pants/setup/bootstrap-Linux-x86_64/2.14.0rc1+gitf06bf22b_py38/lib/python3.8/site-packages/pants/engine/process.py", line 272, in fallible_to_exec_result_or_raise
    raise ProcessExecutionFailure(
pants.engine.process.ProcessExecutionFailure: Process 'Parse Dockerfile.' failed with exit code 1.
stdout:

stderr:
Traceback (most recent call last):
  File "/home/runner/.pants/execution/pants-sandbox-L4oqO4/.cache/pex_root/venvs/276ee4fbe15aa067f0a986eeb6845163f958680d/cf26e88f2412df9ce2b3f6b83a6b129bc3e3bb55/pex", line 235, in <module>
    runpy.run_module(module_name, run_name="__main__", alter_sys=True)
  File "/opt/python/3.8.13/lib/python3.8/runpy.py", line 203, in run_module
    mod_name, mod_spec, code = _get_module_details(mod_name)
  File "/opt/python/3.8.13/lib/python3.8/runpy.py", line 138, in _get_module_details
    raise error("No module named %s" % mod_name)
ImportError: No module named __pants_df_parser

@jsirois
Copy link
Contributor

jsirois commented Sep 21, 2022

@danxmoran I'm stepping away from work until the 28th; so I'll un-assign for now and pick back up then if no one else has dug in.

@thejcannon
Copy link
Member

Setting that env var did not stop the issue from recurring today. Unfortunately the machine went offline before I could debug, but I think we're prepped for debug next time.

@thejcannon
Copy link
Member

thejcannon commented Mar 22, 2023

I've made the following changes to try and reproduce, with still no luck after roughly 10 goes at CI:

  • In ./pants set export PANTS_NAMED_CACHES_DIR=$(mktemp -d) so that we get a new Named cache dir (and therefore Pex root) each run
  • Set [filter].target_type = ["docker_image"] in pants.toml so our work really only applies to dockerfiles (just a speedup, but putting it here for completeness)
  • Turned off both pantsd and the local cache, since this seemingly only reproduces on machines where dockerfile_parser.pex is being pulled remotely.
  • In the root BUILD.pants use the following:
for i in __import__("builtins").range(1000):
   CACHE_BUSTER = __import__("random").randint(0, 1000000000)

   docker_image(
       name=f"{CACHE_BUSTER}-img",
       skip_push=True,
       image_tags=["whatever"],
       repository="idkman",
       registries=["myregistry"],
       instructions=[
           "FROM ubuntu:focal",
           f"# {CACHE_BUSTER}",
       ],
   )

In CI I can confirm that dockerfile_parser.pex is being pulled remotely (seemingly a constant among Dan and I's reproductions?) and can confirm it is parsing the dockerfiles by looking at the --stats-log.

This is without _PEX_FILE_LOCK_STYLE=bsd set.

My hope was that we'd hit the race condition by forcing many many processes to all want a stab at the venv, but no such luck.

@danxmoran
Copy link
Contributor Author

Another data-point:

We saw this error a few times yesterday. In every case:

  1. The error occurred in a CI job running pants publish on a single docker_image target, so there were no concurrent Pex processes
  2. The "building Pex" process for the Dockerfile parser was cancelled after a remote-cache hit
  3. Looking in the post-failure sandboxes, dockerfile_parser.pex/__pants_df_parser.py is present! But there is no .pyc for the file within dockerfile_parser.pex/__pycache__/

Could there possibly be a race condition between populating the sandbox & invoking the dockerfile parser? Such that sometimes the parser is invoked before __pants_df_parser.py lands in the right place

@thejcannon
Copy link
Member

thejcannon commented Mar 29, 2023

Having this not be a pex concurrency issue would align with my current thinking, since banging on it all day didn't reproduce.

Additionally a thread on every reproduction is remote cache beating local execution.

@stuhood
Copy link
Member

stuhood commented Mar 29, 2023

While a race between materializing files in a sandbox and running a process is possible, it would have nothing at all to do with whether a cache hit had occurred: files are materialized from the Store, regardless of whether they came from the cache or another process in the same run.

Based on where this occurs, my intuition (without having investigated this in a long time) is that this is a PEX named cache entry not being cancellation safe, such that if a PEX process is interrupted (as by the cancellation), a cache is partially populated. When a PEX file is fetched from the remote cache, if it bakes in any assumptions about named cache entries being present, that can be (see: #12727) problematic.

To test that hypothesis, you'd need to:

  1. nail down precisely which mode PEX is running in in this case
  2. figure out which named cache entries it uses, and in which order
  3. then see whether some or all of those cache entries being missing could lead to this error

But also: I have had relatively good luck with adding targeted debug output. Adding additional debug output to the df-parser run, like adding PEX_VERBOSE=9 explicitly (or behind a temporary option) might help shake this out.

@jsirois
Copy link
Contributor

jsirois commented Mar 29, 2023

@stuhood it seems like we're stuck in a loop here. This might help:

Based on where this occurs, my intuition (without having investigated this in a long time) is that this is a PEX named cache entry not being cancellation safe

In other language this means a Pex process cancellation by Pants - which maps to what? - a kill SIGQUIT or a kill SIGKILL? leads to a Pex atomic directory operation - which is a block of Python code whose last step on the happy path of no exceptions raised prior, is an os.rename to the final cache path Pants can see, somehow running part way but somehow never raising an exception (this is how you get 1/2 a dir populated), and finally doing the os.rename anyhow to expose the 1/2 directory?

This sounds crazy, but from what you're saying this must be happening somehow.

Put another way - how does one make a process to be run by Pants cancellation safe? The atomic_directory code thinks in terms of being multi-thread / multi-process safe only. But maybe I'm missing a 3rd thing that is cancellation safe.

@stuhood
Copy link
Member

stuhood commented Mar 29, 2023

This sounds crazy, but from what you're saying this must be happening somehow.

I wasn't claiming to have reasoned this all out: only suggesting a course of action.

In other language this means a Pex process cancellation by Pants - which maps to what? - a kill SIGQUIT or a kill SIGKILL? leads to a Pex atomic directory operation - which is a block of Python code whose last step on the happy path of no exceptions raised prior, is an os.rename to the final cache path Pants can see, somehow running part way but somehow never raising an exception (this is how you get 1/2 a dir populated), and finally doing the os.rename anyhow to expose the 1/2 directory?

It will lead to SIGKILL: we don't do graceful shutdown for Processes currently (only InteractiveProcess). I agree that the atomic rename should be sufficient though, regardless.

@jsirois
Copy link
Contributor

jsirois commented Mar 29, 2023

I wasn't claiming to have reasoned this all out: only suggesting a course of action.

Aha, Ok. I'm pretty sensitive at this point to wasting people's time. Perhaps reporters have not gone through the suggested steps, but I seem to recall they've been down this road before. Basically in a cycle maybe its X, please collect ~data, repeat. I've probably spent ~1 full work week (at least) staring hard at Pex code, adding debug output and knobs to releases, etc to get to the bottom of this from the Pex side re multi-thread or multi-process errors in atomic directory logic or usage and I've come up blank. The only real fix so far has come from a much smaller effort on my part investigating Pants code, where I fixed the VenvPex wrapper script early on to not do an rm -rf which was on its face racy and wrong.

I'll have a look again at the Pants side later today since I'm fully juiced for ideas on the Pex side. I've noted to @danxmoran before the more eyes the better, but the key Pex code is here: https://github.com/pantsbuild/pex/blob/main/pex/atomic_directory.py

@thejcannon
Copy link
Member

The things I would keep fresh in mind, personally are that this only occurs:

  • For this one specific file in this one specific sandbox for this one specific process. It likely isn't as fingerprinted as anything in the file/sandbox/process. I'm guessing it's just the right size for
  • Remote cache is always at play and it always beats out local process execution for building the pex

There's something we never put on the table(, which I REALLY hope isn't the case) but I'm pretty sure these venvs have R+W perms. So technically speaking a rogue remove relative to some __file__ could muck up the venv. Again, implausible but possible.

@danxmoran
Copy link
Contributor Author

I just noticed that we're now getting BuildSense links for these errors (vs. when I first reported the problem and the error was hitting before the links were logged). Here's the latest one: https://app.toolchain.com/organizations/color/repos/color/builds/pants_run_2023_03_29_11_21_12_96_a150c7d599f6461b818e7e7c8cc63ef3/.

Looking at the workunits.json that was collected in that run, I realize I have to walk back one of my assumptions. Even though the CI job is running pants publish on a single docker_image target, we still end up running multiple "Parse Dockerfile" processes because of dependencies on in-repo base images. The total dependency stack is 3 images deep (docker_image -> docker_image -> docker_image). The lowest image in the chain is the one that failed.

Would it be useful for me to set --streaming-workunits-level=trace in this CI job? I don't mind eating the extra slowness if it helps us get to the bottom of things.

@thejcannon
Copy link
Member

Some new notes from the most recent occurrence with some hands-on time before our instance got terminated:

  • All the same information as prior:
    • No module named __pants_df_parser
    • __pants_df_parser.py is in the PEX
$ ls -ln dockerfile_parser.pex
total 20
-rw-r--r-- 1 1000 1000  777 Mar 29 15:41 PEX-INFO
-rwxr-xr-x 1 1000 1000 3503 Mar 29 15:41 __main__.py
-rwxr-xr-x 1 1000 1000 6949 Mar 29 15:41 __pants_df_parser.py
drwxrwxr-x 2 1000 1000 4096 Mar 29 15:41 __pex__

Then I remove the venv:

$ rm -rf .cache/pex_root/venvs/35e4cfd30810a27a5e9797d48e3ae2214064fd4c/ddab8011daaee380698ac2fb9701af18c90c03f6/

and re-run:

$ ./__run.sh
...
Ignoring the following environment variables in Pex venv mode:
_PEX_FILE_LOCK_STYLE=bsd

Ignoring the following environment variables in Pex venv mode:
_PEX_FILE_LOCK_STYLE=bsd

Traceback (most recent call last):
  File "/tmp/pants-sandbox-Dghqof/.cache/pex_root/venvs/35e4cfd30810a27a5e9797d48e3ae2214064fd4c/ddab8011daaee380698ac2fb9701af18c90c03f6/pex", line 243, in <module>
    runpy.run_module(module_name, run_name="__main__", alter_sys=True)
  File "/usr/lib/python3.8/runpy.py", line 203, in run_module
    mod_name, mod_spec, code = _get_module_details(mod_name)
  File "/usr/lib/python3.8/runpy.py", line 138, in _get_module_details
    raise error("No module named %s" % mod_name)
ImportError: No module named __pants_df_parser

So ignoring Pants for a moment, I can reproduce the issue just from the sandbox by running the script 🎉

Here's the contents of the bad venv:

$ ls -ln .cache/pex_root/venvs/35e4cfd30810a27a5e9797d48e3ae2214064fd4c/ddab8011daaee380698ac2fb9701af18c90c03f6/lib/python3.8/site-packages/
total 16
-rw-rw-r-- 1 1000 1000  230 Mar 29 16:08 PEX_EXTRA_SYS_PATH.pth
drwxrwxr-x 2 1000 1000 4096 Mar 29 16:08 dockerfile-3.2.0.dist-info
lrwxrwxrwx 1 1000 1000  187 Mar 29 16:08 dockerfile.abi3.h -> ../../../../../../installed_wheels/d69ac9c0d6d985f68e3b3d228605d457a9b688e60e27cdc426a6ce79ae214307/dockerfile-3.2.0-cp36-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl/dockerfile.abi3.h
lrwxrwxrwx 1 1000 1000  188 Mar 29 16:08 dockerfile.abi3.so -> ../../../../../../installed_wheels/d69ac9c0d6d985f68e3b3d228605d457a9b688e60e27cdc426a6ce79ae214307/dockerfile-3.2.0-cp36-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.whl/dockerfile.abi3.so

Running the PEX outside of the run script DOES NOT show the issue:

$ /usr/bin/python3.8 ./dockerfile_parser.pex
Not enough arguments.
Usage: /home/ubuntu/.pex/venvs/35e4cfd30810a27a5e9797d48e3ae2214064fd4c/ddab8011daaee380698ac2fb9701af18c90c03f6/lib/python3.8/site-packages/__pants_df_parser.py [DOCKERFILE ...]

Additionally, scping a tarball of the sandbox and running it on my machine also DOES NOT reproduce the issue.

I strongly suspect something is poisoned about the pex root and we've all been barking up the wrong tree, but John can help rule out any pex issues with the pex itself. I'll share privately in a DM on Slack.

@thejcannon
Copy link
Member

thejcannon commented Mar 30, 2023

I think I found a smoking gun today:

$ cat ./dockerfile_parser.pex/PEX-INFO  | jq .code_hash
"fa968c16b8df77369826d9e37f1517ba5fbc8eb4"
$ ls -lna  .cache/pex_root/user_code/fa968c16b8df77369826d9e37f1517ba5fbc8eb4/
total 8
drwxrwxr-x 2 1000 1000 4096 Mar 30 12:58 .
drwxrwxr-x 5 1000 1000 4096 Mar 30 12:58 ..

whereas if I point to a different PEX_ROOT:

$ ls -lna ~/.pex/user_code/fa968c16b8df77369826d9e37f1517ba5fbc8eb4/
total 24
drwxrwxr-x  4 1000 1000 4096 Mar 30 13:05 .
drwxrwxr-x 10 1000 1000 4096 Mar 30 13:05 ..
-rwxr-xr-x  3 1000 1000 6949 Mar 30 11:38 __pants_df_parser.py
drwxrwxr-x  2 1000 1000 4096 Mar 30 13:05 __pex__
drwxrwxr-x  2 1000 1000 4096 Mar 30 13:05 __pycache__

So it's a poisoned cache. Unclear when it got poisoned though. It could be in this run, if this is the first run with that code_hash or an early one.

@jsirois
Copy link
Contributor

jsirois commented Mar 31, 2023

Ok. The data @thejcannon provided was a second instance of similar data @danxmoran provided many months ago and this was finally enough to switch the focus to remote cache hit cancellation of the VenvPex build process more sharply. Thank you both for being willing to collect data.

My fundamental analysis error over these past 9 months was assuming the data loss was on the output side. It in fact looks like it was on the input side instead. I did not complete an analysis of the Pants code that sets up the sandbox the Pex CLI is run in to create the __df_parser.py VenvPex, but @stuhood provided expert input that the contents of that sandbox were all there if the Pex CLI process was launched.

Taking that to heart, but still focusing on the input side, the missing file is a user code file (the same file) in both reported cases. The Pex is a packed VenvPex. A packed VenvPex is a directory containing 4 classes of items:

  1. loose Pex-internal files: PEX-INFO, main.py, __pex__/ import hook package.
  2. loose user-code
  3. a .bootstrap zip of the Pex bootstrap code
  4. 0 or more .deps/*.whl installed wheel zips.

When such a VenvPex is built, Pants passes --seed to ensure the PEX_ROOT caches are pre-seeded. In the course of seeding a VenvPex from a packed PEX, the packed PEX is 1st installed as an unzipped pex in the PEX root before re-execing from that to do the venv install in the venvs cache under the PEX_ROOT. The unzipped PEX installation has 3 atomic inputs and 1 non-atomic input - the non-atomic one being the user code. That is gathered via an os.walk here:
https://github.com/pantsbuild/pex/blob/bfea50d28f875d74607d6cc5ef93271744d9dd7b/pex/layout.py#L274-L289

So, the dest_dir there is an atomic_directory work dir - which is as things should be. But the walk assumes the inputs will be stable. During Pants remote cache speculation however, a remote cache hit beating a local VenvPex build and --seed will issue a SIGKILL to the Pex CLI process and then go through Rust tear down which includes RAII finalization of the sandbox /tmp dir. The Pex CLI process can thus have the sandbox packed PEX dir open for reading while its contents are nuked by the Rust RAII tear-down leading to an os.walk over an empty sandbox packed PEX directory that results in populating the unzipped PEX user_code cache with 0 user code files. From that point on the cache is poisoned and the user code (__df_parser.py) will always be missing when a venv is created or re-created from that PEX_ROOT cache.

@stuhood has concurred this is a likely cause and has volunteered to take over providing a fix for this uncontrolled asynchrony on the Pants side. The idea being a Pants Process should not have to come equipped with its own bespoke code to guard against its inputs being mutated out from under it.

@jsirois jsirois assigned stuhood and unassigned jsirois Mar 31, 2023
jsirois added a commit that referenced this issue Mar 31, 2023
…ed up (#18632)

As @jsirois discovered and described in
#16778 (comment),
because the `local::CommandRunner` does not `wait` for its child process
to have exited, it might be possible for a SIGKILL to not have taken
effect on the child process before its sandbox has been torn down. And
that could result in the process seeing a partial sandbox (and in the
case of #16778, potentially cause named cache corruption).

This change ensures that we block to call `wait` when spawning local
processes by wrapping them in the `ManagedChild` helper that we were
already using for interactive processes. It then attempts to clarify the
contract of the `CapturedWorkdir` trait (but in order to improve
cherry-pickability does not attempt to adjust it), to make it clear that
child processes should fully exit before the `run_and_capture_workdir`
method has returned.

Fixes #16778 🤞.

---------

Co-authored-by: John Sirois <john.sirois@gmail.com>
jsirois pushed a commit to jsirois/pants that referenced this issue Mar 31, 2023
…ed up (Cherry-pick of pantsbuild#18632)

As @jsirois discovered and described in
pantsbuild#16778 (comment),
because the `local::CommandRunner` does not `wait` for its child process
to have exited, it might be possible for a SIGKILL to not have taken
effect on the child process before its sandbox has been torn down. And
that could result in the process seeing a partial sandbox (and in the
case of pantsbuild#16778, potentially cause named cache corruption).

This change ensures that we block to call `wait` when spawning local
processes by wrapping them in the `ManagedChild` helper that we were
already using for interactive processes. It then attempts to clarify the
contract of the `CapturedWorkdir` trait (but in order to improve
cherry-pickability does not attempt to adjust it), to make it clear that
child processes should fully exit before the `run_and_capture_workdir`
method has returned.

Fixes pantsbuild#16778 🤞.

---------

Co-authored-by: John Sirois <john.sirois@gmail.com>
(cherry picked from commit 1462bef)
jsirois pushed a commit to jsirois/pants that referenced this issue Mar 31, 2023
…ed up (Cherry-pick of pantsbuild#18632)

As @jsirois discovered and described in
pantsbuild#16778 (comment),
because the `local::CommandRunner` does not `wait` for its child process
to have exited, it might be possible for a SIGKILL to not have taken
effect on the child process before its sandbox has been torn down. And
that could result in the process seeing a partial sandbox (and in the
case of pantsbuild#16778, potentially cause named cache corruption).

This change ensures that we block to call `wait` when spawning local
processes by wrapping them in the `ManagedChild` helper that we were
already using for interactive processes. It then attempts to clarify the
contract of the `CapturedWorkdir` trait (but in order to improve
cherry-pickability does not attempt to adjust it), to make it clear that
child processes should fully exit before the `run_and_capture_workdir`
method has returned.

Fixes pantsbuild#16778 🤞.

---------

Co-authored-by: John Sirois <john.sirois@gmail.com>
(cherry picked from commit 1462bef)
jsirois pushed a commit to jsirois/pants that referenced this issue Mar 31, 2023
…ed up (Cherry-pick of pantsbuild#18632)

As @jsirois discovered and described in
pantsbuild#16778 (comment),
because the `local::CommandRunner` does not `wait` for its child process
to have exited, it might be possible for a SIGKILL to not have taken
effect on the child process before its sandbox has been torn down. And
that could result in the process seeing a partial sandbox (and in the
case of pantsbuild#16778, potentially cause named cache corruption).

This change ensures that we block to call `wait` when spawning local
processes by wrapping them in the `ManagedChild` helper that we were
already using for interactive processes. It then attempts to clarify the
contract of the `CapturedWorkdir` trait (but in order to improve
cherry-pickability does not attempt to adjust it), to make it clear that
child processes should fully exit before the `run_and_capture_workdir`
method has returned.

Fixes pantsbuild#16778 🤞.

---------

Co-authored-by: John Sirois <john.sirois@gmail.com>
(cherry picked from commit 1462bef)
jsirois added a commit that referenced this issue Mar 31, 2023
#18641)

…ed up (Cherry-pick of #18632)

As @jsirois discovered and described in

#16778 (comment),
because the `local::CommandRunner` does not `wait` for its child process
to have exited, it might be possible for a SIGKILL to not have taken
effect on the child process before its sandbox has been torn down. And
that could result in the process seeing a partial sandbox (and in the
case of #16778, potentially cause named cache corruption).

This change ensures that we block to call `wait` when spawning local
processes by wrapping them in the `ManagedChild` helper that we were
already using for interactive processes. It then attempts to clarify the
contract of the `CapturedWorkdir` trait (but in order to improve
cherry-pickability does not attempt to adjust it), to make it clear that
child processes should fully exit before the `run_and_capture_workdir`
method has returned.

Fixes #16778 🤞.

---------

Co-authored-by: John Sirois <john.sirois@gmail.com>
(cherry picked from commit 1462bef)

Co-authored-by: Stu Hood <stuhood@gmail.com>
jsirois added a commit that referenced this issue Mar 31, 2023
#18640)

…ed up (Cherry-pick of #18632)

As @jsirois discovered and described in

#16778 (comment),
because the `local::CommandRunner` does not `wait` for its child process
to have exited, it might be possible for a SIGKILL to not have taken
effect on the child process before its sandbox has been torn down. And
that could result in the process seeing a partial sandbox (and in the
case of #16778, potentially cause named cache corruption).

This change ensures that we block to call `wait` when spawning local
processes by wrapping them in the `ManagedChild` helper that we were
already using for interactive processes. It then attempts to clarify the
contract of the `CapturedWorkdir` trait (but in order to improve
cherry-pickability does not attempt to adjust it), to make it clear that
child processes should fully exit before the `run_and_capture_workdir`
method has returned.

Fixes #16778 🤞.

---------

Co-authored-by: John Sirois <john.sirois@gmail.com>
(cherry picked from commit 1462bef)

Co-authored-by: Stu Hood <stuhood@gmail.com>
jsirois added a commit that referenced this issue Mar 31, 2023
#18642)

…ed up (Cherry-pick of #18632)

As @jsirois discovered and described in

#16778 (comment),
because the `local::CommandRunner` does not `wait` for its child process
to have exited, it might be possible for a SIGKILL to not have taken
effect on the child process before its sandbox has been torn down. And
that could result in the process seeing a partial sandbox (and in the
case of #16778, potentially cause named cache corruption).

This change ensures that we block to call `wait` when spawning local
processes by wrapping them in the `ManagedChild` helper that we were
already using for interactive processes. It then attempts to clarify the
contract of the `CapturedWorkdir` trait (but in order to improve
cherry-pickability does not attempt to adjust it), to make it clear that
child processes should fully exit before the `run_and_capture_workdir`
method has returned.

Fixes #16778 🤞.

---------

Co-authored-by: John Sirois <john.sirois@gmail.com>
(cherry picked from commit 1462bef)

Co-authored-by: Stu Hood <stuhood@gmail.com>
@engnatha
Copy link

Sorry to revisit this can of worms, but we just got hit with this on 2.15.0.

[BOOTSTRAP] [INFO] Sending command: /buildbot/astranis-monorepo-python/build/pants --print-stacktrace export //:reqs
[BOOTSTRAP] [WARNING] Received non-zero return code (1) from /buildbot/astranis-monorepo-python/build/pants --print-stacktrace export //:reqs
[BOOTSTRAP] [WARNING] STDOUT: 
[BOOTSTRAP] [WARNING] STDERR: 22:02:43.94 [INFO] Initializing scheduler...
22:02:44.47 [INFO] Scheduler initialized.
22:02:44.59 [WARN] DEPRECATED: exporting resolves without using the --resolve option is scheduled to be removed in version 2.23.0.dev0.
Use the --resolve flag one or more times to name the resolves you want to export,
and don't provide any target specs. E.g.,
`/buildbot/astranis-monorepo-python/build/pants export --resolve=python-default --resolve=pytest`
22:02:47.98 [ERROR] 1 Exception encountered:
Engine traceback:
  in select
    ..
  in pants.core.goals.export.export
    `export` goal
  in pants.backend.python.goals.export.export_virtualenvs
    ..
  in pants.backend.python.goals.export.export_virtualenv_for_targets
    ..
  in pants.backend.python.util_rules.pex_from_targets.create_pex_from_targets
    ..
  in pants.backend.python.util_rules.pex_from_targets.determine_requirement_strings_in_closure
    ..
  in pants.engine.internals.graph.transitive_targets
    Resolve transitive targets
  in pants.engine.internals.graph.transitive_dependency_mapping
    ..
  in pants.engine.internals.graph.resolve_targets
    ..
  in pants.engine.internals.graph.resolve_unexpanded_targets
    ..
  in pants.engine.internals.graph.resolve_dependencies
    Resolve direct dependencies of target - astranis-python/astranis/ground/service/ground_correspondence:ground_correspondence
  in pants.backend.docker.util_rules.dependencies.infer_docker_dependencies
    ..
  in pants.backend.docker.subsystems.dockerfile_parser.parse_dockerfile
    ..
  in pants.engine.process.fallible_to_exec_result_or_raise
    ..
Traceback (most recent call last):
  File "/home/buildbot/.cache/pants/setup/bootstrap-Linux-x86_64/2.15.0_py38/lib/python3.8/site-packages/pants/engine/process.py", line 289, in fallible_to_exec_result_or_raise
    raise ProcessExecutionFailure(
pants.engine.process.ProcessExecutionFailure: Process 'Parse Dockerfile.' failed with exit code 1.
stdout:
stderr:
Traceback (most recent call last):
  File "/tmp/pants-sandbox-yhk9SH/.cache/pex_root/venvs/78566b08c5c7ade31237703b0f67d5496faeb045/ddab8011daaee380698ac2fb9701af18c90c03f6/pex", line 243, in <module>
    runpy.run_module(module_name, run_name="__main__", alter_sys=True)
  File "/usr/lib/python3.8/runpy.py", line 203, in run_module
    mod_name, mod_spec, code = _get_module_details(mod_name)
  File "/usr/lib/python3.8/runpy.py", line 138, in _get_module_details
    raise error("No module named %s" % mod_name)
ImportError: No module named __pants_df_parser
Use `--keep-sandboxes=on_failure` to preserve the process chroot for inspection.

Any chance there was a regression between when this was fixed and that version?

@jsirois
Copy link
Contributor

jsirois commented May 12, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend: Python Python backend-related issues bug
Projects
None yet
6 participants