-
-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nondeterministic ImportError: No module named __pants_df_parser
while computing changed targets
#16778
Comments
I just saw a new error for the first time, still in the
|
Ok, I started re-reviewing this from the top instead of focusing in on Pex. That leads here: pants/src/python/pants/backend/python/util_rules/pex.py Lines 750 to 757 in 0bbf1fb
Clearly? the rm -rf can be interrupted and since *N.B. 1 can only happen when the PEX in question is not built locally, since we pre @danxmoran can you check me on this logic / if it fully describes / comports with your setup where you are seeing failures? |
I think if the above is the explanation, then just switching to Pex's |
@jsirois the explanation makes sense for our CI context - we use ephemeral runners so the Pants cache and PEX root are empty at the start of each run, and we use remote caching. |
Alrighty, thanks for the confirmation. I think the issue is now fully understood. I want to stew a bit more on the |
Ok, the I seem to recall during development of that PR that I initially didn't have it and ran into issues, but I'll give that path a thorough analysis before committing to the longer term |
There was a race in venv re-population due to a non-atomic `rm`, create sequence. There was no real need for the `rm` and the create is atomic on its own; so just remove the `rm` which was only attempting to guard "corrupted" venvs in a slapdash way. Now the venv either exists or it doesn't from the script point of view. If the venv exists but has been tampered with, its execution will consistently fail until there is a manual intervention removing the venv dir offline. Fixes pantsbuild#16778
Ok, the pants/src/python/pants/backend/codegen/protobuf/python/rules.py Lines 146 to 167 in 7f98ead
It looks like the |
There was a race in venv re-population due to a non-atomic `rm`, create sequence. There was no real need for the `rm` and the create is atomic on its own; so just remove the `rm` which was only attempting to guard "corrupted" venvs in a slapdash way. Now the venv either exists or it doesn't from the script point of view. If the venv exists but has been tampered with, its execution will consistently fail until there is a manual intervention removing the venv dir offline. Fixes #14618 Fixes #16778
There was a race in venv re-population due to a non-atomic `rm`, create sequence. There was no real need for the `rm` and the create is atomic on its own; so just remove the `rm` which was only attempting to guard "corrupted" venvs in a slapdash way. Now the venv either exists or it doesn't from the script point of view. If the venv exists but has been tampered with, its execution will consistently fail until there is a manual intervention removing the venv dir offline. Fixes pantsbuild#14618 Fixes pantsbuild#16778 (cherry picked from commit cace851)
There was a race in venv re-population due to a non-atomic `rm`, create sequence. There was no real need for the `rm` and the create is atomic on its own; so just remove the `rm` which was only attempting to guard "corrupted" venvs in a slapdash way. Now the venv either exists or it doesn't from the script point of view. If the venv exists but has been tampered with, its execution will consistently fail until there is a manual intervention removing the venv dir offline. Fixes pantsbuild#14618 Fixes pantsbuild#16778 (cherry picked from commit cace851)
There was a race in venv re-population due to a non-atomic `rm`, create sequence. There was no real need for the `rm` and the create is atomic on its own; so just remove the `rm` which was only attempting to guard "corrupted" venvs in a slapdash way. Now the venv either exists or it doesn't from the script point of view. If the venv exists but has been tampered with, its execution will consistently fail until there is a manual intervention removing the venv dir offline. Fixes #14618 Fixes #16778 (cherry picked from commit cace851)
There was a race in venv re-population due to a non-atomic `rm`, create sequence. There was no real need for the `rm` and the create is atomic on its own; so just remove the `rm` which was only attempting to guard "corrupted" venvs in a slapdash way. Now the venv either exists or it doesn't from the script point of view. If the venv exists but has been tampered with, its execution will consistently fail until there is a manual intervention removing the venv dir offline. Fixes #14618 Fixes #16778 (cherry picked from commit cace851)
@danxmoran I don't believe you / hrm. Can you get a sandbox on that version, even if just locally, and confirm the wrapper script has the |
Confirmed that when I look at at sandbox locally the |
Alrighty - thanks. I am totally out of ideas then. That |
Same stracktrace as the OP? |
I think yes, but here it is just in case:
|
@danxmoran I'm stepping away from work until the 28th; so I'll un-assign for now and pick back up then if no one else has dug in. |
Setting that env var did not stop the issue from recurring today. Unfortunately the machine went offline before I could debug, but I think we're prepped for debug next time. |
I've made the following changes to try and reproduce, with still no luck after roughly 10 goes at CI:
for i in __import__("builtins").range(1000):
CACHE_BUSTER = __import__("random").randint(0, 1000000000)
docker_image(
name=f"{CACHE_BUSTER}-img",
skip_push=True,
image_tags=["whatever"],
repository="idkman",
registries=["myregistry"],
instructions=[
"FROM ubuntu:focal",
f"# {CACHE_BUSTER}",
],
) In CI I can confirm that This is without My hope was that we'd hit the race condition by forcing many many processes to all want a stab at the venv, but no such luck. |
Another data-point: We saw this error a few times yesterday. In every case:
Could there possibly be a race condition between populating the sandbox & invoking the dockerfile parser? Such that sometimes the parser is invoked before |
Having this not be a pex concurrency issue would align with my current thinking, since banging on it all day didn't reproduce. Additionally a thread on every reproduction is remote cache beating local execution. |
While a race between materializing files in a sandbox and running a process is possible, it would have nothing at all to do with whether a cache hit had occurred: files are materialized from the Based on where this occurs, my intuition (without having investigated this in a long time) is that this is a PEX named cache entry not being cancellation safe, such that if a PEX process is interrupted (as by the cancellation), a cache is partially populated. When a PEX file is fetched from the remote cache, if it bakes in any assumptions about named cache entries being present, that can be (see: #12727) problematic. To test that hypothesis, you'd need to:
But also: I have had relatively good luck with adding targeted debug output. Adding additional debug output to the df-parser run, like adding |
@stuhood it seems like we're stuck in a loop here. This might help:
In other language this means a Pex process cancellation by Pants - which maps to what? - a kill SIGQUIT or a kill SIGKILL? leads to a Pex atomic directory operation - which is a block of Python code whose last step on the happy path of no exceptions raised prior, is an This sounds crazy, but from what you're saying this must be happening somehow. Put another way - how does one make a process to be run by Pants cancellation safe? The atomic_directory code thinks in terms of being multi-thread / multi-process safe only. But maybe I'm missing a 3rd thing that is cancellation safe. |
I wasn't claiming to have reasoned this all out: only suggesting a course of action.
It will lead to SIGKILL: we don't do graceful shutdown for |
Aha, Ok. I'm pretty sensitive at this point to wasting people's time. Perhaps reporters have not gone through the suggested steps, but I seem to recall they've been down this road before. Basically in a cycle maybe its X, please collect ~data, repeat. I've probably spent ~1 full work week (at least) staring hard at Pex code, adding debug output and knobs to releases, etc to get to the bottom of this from the Pex side re multi-thread or multi-process errors in atomic directory logic or usage and I've come up blank. The only real fix so far has come from a much smaller effort on my part investigating Pants code, where I fixed the I'll have a look again at the Pants side later today since I'm fully juiced for ideas on the Pex side. I've noted to @danxmoran before the more eyes the better, but the key Pex code is here: https://github.com/pantsbuild/pex/blob/main/pex/atomic_directory.py |
The things I would keep fresh in mind, personally are that this only occurs:
There's something we never put on the table(, which I REALLY hope isn't the case) but I'm pretty sure these venvs have R+W perms. So technically speaking a rogue |
I just noticed that we're now getting BuildSense links for these errors (vs. when I first reported the problem and the error was hitting before the links were logged). Here's the latest one: https://app.toolchain.com/organizations/color/repos/color/builds/pants_run_2023_03_29_11_21_12_96_a150c7d599f6461b818e7e7c8cc63ef3/. Looking at the Would it be useful for me to set |
Some new notes from the most recent occurrence with some hands-on time before our instance got terminated:
Then I remove the
and re-run:
So ignoring Pants for a moment, I can reproduce the issue just from the sandbox by running the script 🎉 Here's the contents of the bad venv:
Running the PEX outside of the run script DOES NOT show the issue:
Additionally, I strongly suspect something is poisoned about the pex root and we've all been barking up the wrong tree, but John can help rule out any pex issues with the pex itself. I'll share privately in a DM on Slack. |
I think I found a smoking gun today:
whereas if I point to a different
So it's a poisoned cache. Unclear when it got poisoned though. It could be in this run, if this is the first run with that |
Ok. The data @thejcannon provided was a second instance of similar data @danxmoran provided many months ago and this was finally enough to switch the focus to remote cache hit cancellation of the VenvPex build process more sharply. Thank you both for being willing to collect data. My fundamental analysis error over these past 9 months was assuming the data loss was on the output side. It in fact looks like it was on the input side instead. I did not complete an analysis of the Pants code that sets up the sandbox the Pex CLI is run in to create the Taking that to heart, but still focusing on the input side, the missing file is a user code file (the same file) in both reported cases. The Pex is a packed VenvPex. A packed VenvPex is a directory containing 4 classes of items:
When such a VenvPex is built, Pants passes So, the @stuhood has concurred this is a likely cause and has volunteered to take over providing a fix for this uncontrolled asynchrony on the Pants side. The idea being a Pants |
…ed up (#18632) As @jsirois discovered and described in #16778 (comment), because the `local::CommandRunner` does not `wait` for its child process to have exited, it might be possible for a SIGKILL to not have taken effect on the child process before its sandbox has been torn down. And that could result in the process seeing a partial sandbox (and in the case of #16778, potentially cause named cache corruption). This change ensures that we block to call `wait` when spawning local processes by wrapping them in the `ManagedChild` helper that we were already using for interactive processes. It then attempts to clarify the contract of the `CapturedWorkdir` trait (but in order to improve cherry-pickability does not attempt to adjust it), to make it clear that child processes should fully exit before the `run_and_capture_workdir` method has returned. Fixes #16778 🤞. --------- Co-authored-by: John Sirois <john.sirois@gmail.com>
…ed up (Cherry-pick of pantsbuild#18632) As @jsirois discovered and described in pantsbuild#16778 (comment), because the `local::CommandRunner` does not `wait` for its child process to have exited, it might be possible for a SIGKILL to not have taken effect on the child process before its sandbox has been torn down. And that could result in the process seeing a partial sandbox (and in the case of pantsbuild#16778, potentially cause named cache corruption). This change ensures that we block to call `wait` when spawning local processes by wrapping them in the `ManagedChild` helper that we were already using for interactive processes. It then attempts to clarify the contract of the `CapturedWorkdir` trait (but in order to improve cherry-pickability does not attempt to adjust it), to make it clear that child processes should fully exit before the `run_and_capture_workdir` method has returned. Fixes pantsbuild#16778 🤞. --------- Co-authored-by: John Sirois <john.sirois@gmail.com> (cherry picked from commit 1462bef)
…ed up (Cherry-pick of pantsbuild#18632) As @jsirois discovered and described in pantsbuild#16778 (comment), because the `local::CommandRunner` does not `wait` for its child process to have exited, it might be possible for a SIGKILL to not have taken effect on the child process before its sandbox has been torn down. And that could result in the process seeing a partial sandbox (and in the case of pantsbuild#16778, potentially cause named cache corruption). This change ensures that we block to call `wait` when spawning local processes by wrapping them in the `ManagedChild` helper that we were already using for interactive processes. It then attempts to clarify the contract of the `CapturedWorkdir` trait (but in order to improve cherry-pickability does not attempt to adjust it), to make it clear that child processes should fully exit before the `run_and_capture_workdir` method has returned. Fixes pantsbuild#16778 🤞. --------- Co-authored-by: John Sirois <john.sirois@gmail.com> (cherry picked from commit 1462bef)
…ed up (Cherry-pick of pantsbuild#18632) As @jsirois discovered and described in pantsbuild#16778 (comment), because the `local::CommandRunner` does not `wait` for its child process to have exited, it might be possible for a SIGKILL to not have taken effect on the child process before its sandbox has been torn down. And that could result in the process seeing a partial sandbox (and in the case of pantsbuild#16778, potentially cause named cache corruption). This change ensures that we block to call `wait` when spawning local processes by wrapping them in the `ManagedChild` helper that we were already using for interactive processes. It then attempts to clarify the contract of the `CapturedWorkdir` trait (but in order to improve cherry-pickability does not attempt to adjust it), to make it clear that child processes should fully exit before the `run_and_capture_workdir` method has returned. Fixes pantsbuild#16778 🤞. --------- Co-authored-by: John Sirois <john.sirois@gmail.com> (cherry picked from commit 1462bef)
#18641) …ed up (Cherry-pick of #18632) As @jsirois discovered and described in #16778 (comment), because the `local::CommandRunner` does not `wait` for its child process to have exited, it might be possible for a SIGKILL to not have taken effect on the child process before its sandbox has been torn down. And that could result in the process seeing a partial sandbox (and in the case of #16778, potentially cause named cache corruption). This change ensures that we block to call `wait` when spawning local processes by wrapping them in the `ManagedChild` helper that we were already using for interactive processes. It then attempts to clarify the contract of the `CapturedWorkdir` trait (but in order to improve cherry-pickability does not attempt to adjust it), to make it clear that child processes should fully exit before the `run_and_capture_workdir` method has returned. Fixes #16778 🤞. --------- Co-authored-by: John Sirois <john.sirois@gmail.com> (cherry picked from commit 1462bef) Co-authored-by: Stu Hood <stuhood@gmail.com>
#18640) …ed up (Cherry-pick of #18632) As @jsirois discovered and described in #16778 (comment), because the `local::CommandRunner` does not `wait` for its child process to have exited, it might be possible for a SIGKILL to not have taken effect on the child process before its sandbox has been torn down. And that could result in the process seeing a partial sandbox (and in the case of #16778, potentially cause named cache corruption). This change ensures that we block to call `wait` when spawning local processes by wrapping them in the `ManagedChild` helper that we were already using for interactive processes. It then attempts to clarify the contract of the `CapturedWorkdir` trait (but in order to improve cherry-pickability does not attempt to adjust it), to make it clear that child processes should fully exit before the `run_and_capture_workdir` method has returned. Fixes #16778 🤞. --------- Co-authored-by: John Sirois <john.sirois@gmail.com> (cherry picked from commit 1462bef) Co-authored-by: Stu Hood <stuhood@gmail.com>
#18642) …ed up (Cherry-pick of #18632) As @jsirois discovered and described in #16778 (comment), because the `local::CommandRunner` does not `wait` for its child process to have exited, it might be possible for a SIGKILL to not have taken effect on the child process before its sandbox has been torn down. And that could result in the process seeing a partial sandbox (and in the case of #16778, potentially cause named cache corruption). This change ensures that we block to call `wait` when spawning local processes by wrapping them in the `ManagedChild` helper that we were already using for interactive processes. It then attempts to clarify the contract of the `CapturedWorkdir` trait (but in order to improve cherry-pickability does not attempt to adjust it), to make it clear that child processes should fully exit before the `run_and_capture_workdir` method has returned. Fixes #16778 🤞. --------- Co-authored-by: John Sirois <john.sirois@gmail.com> (cherry picked from commit 1462bef) Co-authored-by: Stu Hood <stuhood@gmail.com>
Sorry to revisit this can of worms, but we just got hit with this on
Any chance there was a regression between when this was fixed and that version? |
@engnatha the fix is here: https://pypi.org/project/pantsbuild.pants/2.15.1rc1/ |
(NOTE: There's been a lot of discussion on this in Slack here, but I realized I never filed an issue for it)
Describe the bug
A few times per day (out of ~hundreds of jobs), our CI jobs will fail when trying to build
dockerfile_parser.pex
. The error always hits at the beginning of goal execution, and affects all our goals (test
,lint
,check
). The error looks like:Pants version
PANTS_SHA=5d8a328d72209863986c8959b20305505bc068ba
, and has continued appearing as we've upgraded along the2.13.x
branch.v2.13.0a0
OS
Linux
Additional info
Through debugging on Slack, we've found:
__pants_df_parser.py
truly isn't in the venv when this error occurs--no-process-cleanup
The text was updated successfully, but these errors were encountered: