Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent / nondeterministic failure building requirements.pex due to missing site-packages #2066

Closed
danxmoran opened this issue Feb 24, 2023 · 22 comments
Assignees
Labels

Comments

@danxmoran
Copy link
Contributor

Our CI will occasionally fail with an error like:

Traceback (most recent call last):
  File "/home/color/.cache/nce/99421dee8fedc336d5a6bb8322fbfa602bc68fe63303ae702ec7b9c5672cd086/bindings/venvs/2.15.0rc6/lib/python3.9/site-packages/pants/engine/process.py", line 289, in fallible_to_exec_result_or_raise
    raise ProcessExecutionFailure(
pants.engine.process.ProcessExecutionFailure: Process 'Building 1 requirement for requirements.pex from the 3rdparty/lockfiles/resolves/pants-plugins.lockfile resolve: pantsbuild.pants<2.16,>=2.15.0a0' failed with exit code 1.
stdout:

stderr:
The virtualenv at /tmp/pants/named_caches/pex_root/venvs/4c85013f478e0393bbf8db8fcf02e1def7ff5031/ba7a55164c2afb363895254bbb1063124dd74d5b.lck.work is not valid. No site-packages directory was found in its sys.path:
/opt/python/3.9.16/lib/python39.zip
/opt/python/3.9.16/lib/python3.9
/opt/python/3.9.16/lib/python3.9/lib-dynload
/opt/python/3.9.16/lib/python3.9/site-packages

We've seen it hit across different resolves, Python interpreter versions, and Pex versions - the error above happened on Pex v2.1.122.

It happens very infrequently (once every few weeks).

I captured the Pants caches & execution dir for the error above. The archive is too big to attach via GitHub, but I can share it via Slack / GDrive upload.

@danxmoran danxmoran added the bug label Feb 24, 2023
@jsirois
Copy link
Member

jsirois commented Feb 24, 2023

@danxmoran I assume you're still using the _PEX_FILE_LOCK_STYLE=bsd env var experiment hack? I.E.: setting that env var and making sure its plumbed though via [subprocess-environment] config.

@danxmoran
Copy link
Contributor Author

@jsirois correct. Here's the __run.sh extracted from the sandbox post-failure:

#!/bin/bash
# This command line should execute the same process as pants did internally.
export CPPFLAGS= LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LDFLAGS= PATH=$'/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tmp/buildevents/be/bin-linux:/tmp/be/bin-linux' PEX_IGNORE_RCFILES=true PEX_ROOT=.cache/pex_root _PEX_FILE_LOCK_STYLE=bsd
cd /tmp/pants/execution/pants-sandbox-rRJetG
/opt/python/3.9.16/bin/python3.9 ./pex --tmpdir .tmp --jobs 1 --python-path $'/opt/python/3.8.16/bin:/opt/python/3.9.16/bin' --output-file requirements.pex --no-emit-warnings --python /opt/python/3.9.16/bin/python3.9 $'--sources-directory=source_files' $'pantsbuild.pants<2.16,>=2.15.0a0' --lock 3rdparty/lockfiles/resolves/pants-plugins.lockfile --no-pypi $'--index=https://pypi.org/simple/' --manylinux manylinux2014 --layout packed

@danxmoran
Copy link
Contributor Author

I mounted the failed sandbox into our CI image locally and re-executed __run.sh, and it passed. So sadly not a deterministic failure even with the captured state 😞

@danxmoran
Copy link
Contributor Author

danxmoran commented Feb 24, 2023

Another thing I notice:

The error says the venv at /tmp/pants/named_caches/pex_root/venvs/4c85013f478e0393bbf8db8fcf02e1def7ff5031/ba7a55164c2afb363895254bbb1063124dd74d5b.lck.work is not valid. That path doesn't actually exist in the cache dirs I captured from the failed node - I'm not sure if this is a clue, or a red herring (i.e. if the process is cleaning up after the failure):

$ ls /tmp/pants/named_caches/pex_root/venvs/4c85013f478e0393bbf8db8fcf02e1def7ff5031/
dfff76d25b9e7c347b251cb47472900c91ccd47c
fd12f23ec094a2fc40c08ba0cb22c53dd74b4917.lck.work

@jsirois
Copy link
Member

jsirois commented Feb 24, 2023

Ok, thanks @danxmoran I'm completely baffled how this happens with a bsd lock. I'll dig back in this weekend to check my understanding again and report back.

@jsirois jsirois self-assigned this Feb 24, 2023
@jsirois jsirois mentioned this issue Feb 25, 2023
3 tasks
@jsirois
Copy link
Member

jsirois commented Feb 26, 2023

@danxmoran can you remind me of your CI setup? As I recall Docker / k8s is involved. Is /tmp/pants/named_caches/pex_root/ shared between containers via volume mounts in some way or does each running kernel get its own non-shared /tmp/pants/named_caches/pex_root/ or am I just wrong here and there is only 1 running kernel?

@danxmoran
Copy link
Contributor Author

We're in the middle of switching from self-hosted GHA on k8s to docker CircleCI executors - we've seen the issue on both, but more often in Circle. AFAIK /tmp isn't anything special in that runtime. It's definitely not shared across concurrent jobs. In GHA I think it's a volume-mount to a tmpdir on the underlying host, but still not shared between jobs.

@jsirois
Copy link
Member

jsirois commented Feb 26, 2023

So, just to re-iterate, and focusing on the k8s setup, /tmp/pants (or /tmp/pants/named_caches or /tmp/pants/named_caches/pex_root) is a mount from, say, /tmp/<unique-for-container-mounted-into> on the host?

@danxmoran
Copy link
Contributor Author

danxmoran commented Feb 26, 2023

Double-checked and I was wrong, /tmp is not mounted anywhere in our k8s setup. So in both CI environments /tmp/pants is just a dir in the container FS.

@jsirois
Copy link
Member

jsirois commented Feb 26, 2023

Ok, thanks. I continue to be fully stumped. I have a few experiments left to try though.

@jsirois
Copy link
Member

jsirois commented Feb 27, 2023

@danxmoran if you can run this in the CI container you use - substituting the full path of the python interpreter your CI runs with, this will provide a sanity check that that Python thinks it has flock and is not falling back to fcntl emulation:

python -c 'import sysconfig; print(sysconfig.get_config_var("HAVE_FLOCK"))'

It's easy to get the same process owning a lock 2x under fcntl - I can find no way to get this to happen under flock given each thread opens its own private fd to lock with in the atomic_directory code.

The other sanity check is to triple-confirm that no part of /tmp/pants/named_caches/pex_root is an NFS mount. Linux will silently convert a flock lock to a fcntl lock for NFS files / file descriptors.

jsirois added a commit to jsirois/pex that referenced this issue Feb 27, 2023
This is needed to have independent POSIX fcntl locks in the same process
by multiple threads and also needed whenever BSD flock locks silently
use fcntl emulation (exotic systems or under NFS mounts). Pex is
designed to avoid multi-threading when using POSIX locks; so this just
serves as a design-error backstop for those style locks. For silently
emulated BSD locks, this provides correctness.

Analysis of pex-tool#2066 and pex-tool#1969 do not point to this enhancement solving any
existing problems, but this is an improvement for the cases mentioned
should we hit them.

Work regarding pex-tool#2066.
jsirois added a commit that referenced this issue Feb 27, 2023
This is needed to have independent POSIX fcntl locks in the same process
by multiple threads and also needed whenever BSD flock locks silently
use fcntl emulation (exotic systems or under NFS mounts). Pex is
designed to avoid multi-threading when using POSIX locks; so this just
serves as a design-error backstop for those style locks. For silently
emulated BSD locks, this provides correctness.

Analysis of #2066 and #1969 do not point to this enhancement solving any
existing problems, but this is an improvement for the cases mentioned
should we hit them.

Work regarding #2066.
@danxmoran
Copy link
Contributor Author

python -c 'import sysconfig; print(sysconfig.get_config_var("HAVE_FLOCK"))'

Prints 1.

Running mount prints:

overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/165536.165536/overlay2/l/NAYNYJRZ35IFCEGD3ZUZGMHI7V:/var/lib/docker/165536.165536/overlay2/l/ZBHAAOGV4A67JDSHUNCF24TLHT:/var/lib/docker/165536.165536/overlay2/l/G6HRGVQ2CDLIGZV77KROQLKGAY:/var/lib/docker/165536.165536/overlay2/l/NSFI6BA32XT3DMMA2TZ2XU4HWH:/var/lib/docker/165536.165536/overlay2/l/3PKEILSR25DY27IP2Y5QSMMA44:/var/lib/docker/165536.165536/overlay2/l/KCLKKZ6ISQHPNJTR62U66FUDDW:/var/lib/docker/165536.165536/overlay2/l/UH75C7O5PNFT4NWAL3CEYQ4GM6:/var/lib/docker/165536.165536/overlay2/l/23I6UVH7OCKDWBN2KEEOI7QGUL:/var/lib/docker/165536.165536/overlay2/l/WX67R73JQE3MMTYO2APSZ2PLGE:/var/lib/docker/165536.165536/overlay2/l/QEWJBRRK3IWQ6HX2BJ5RWY34KR:/var/lib/docker/165536.165536/overlay2/l/RG52JL7EOAHOWASOGQJX3Q3ZM5:/var/lib/docker/165536.165536/overlay2/l/X6RSP7NA4GFRDTWA66M2OBFC32:/var/lib/docker/165536.165536/overlay2/l/TLGJBHZ7DH24Q2XLL6BB3KAAK4:/var/lib/docker/165536.165536/overlay2/l/PMKQDJ7VJDCUVO7AOBXJHLPXID:/var/lib/docker/165536.165536/overlay2/l/LZY5LSWKB2VE4DMW3QSWDGOHOG:/var/lib/docker/165536.165536/overlay2/l/JKBUQBQ7DWE6UUQNCWDGF2AJ5J:/var/lib/docker/165536.165536/overlay2/l/B6EW62XJJXBEK3TS4LHQ5Y7J75:/var/lib/docker/165536.165536/overlay2/l/GC5A5HHWC74YGFLWY6XSNC2K4P:/var/lib/docker/165536.165536/overlay2/l/2CNETANW3XZYP6KFZNR3KLLV6P:/var/lib/docker/165536.165536/overlay2/l/56NYY5Q3QEWMOWIUYCYH3SPPDN:/var/lib/docker/165536.165536/overlay2/l/M7FEFT3KZQYPERECSFTE7NANGL:/var/lib/docker/165536.165536/overlay2/l/EZPB7U4FISDX3ON6XOI6VY55YQ:/var/lib/docker/165536.165536/overlay2/l/ALMJKPVMJV7WNTRGFY26JN6H4Y:/var/lib/docker/165536.165536/overlay2/l/MARQTMFH3HEF26CNVEKS2RM5GU:/var/lib/docker/165536.165536/overlay2/l/NYXRBOM2MEL3YXIRNCIXO37AZX:/var/lib/docker/165536.165536/overlay2/l/TKUERHZSIKOI2PPZTII7IWYCIL:/var/lib/docker/165536.165536/overlay2/l/ZKALR5XQUT7H56AEBJ62FHEMHJ:/var/lib/docker/165536.165536/overlay2/l/YZXORHJDAB3SDRDCS6VBSCGNWD:/var/lib/docker/165536.165536/overlay2/l/UJPT4ON6ZTC4I35XDY6IOS4PW3:/var/lib/docker/165536.165536/overlay2/l/TYD3EYNCKV6A2QCNATFIOIO37K:/var/lib/docker/165536.165536/overlay2/l/RL5WWY3ANEDFIZFGSW2TAC5VQH:/var/lib/docker/165536.165536/overlay2/l/HGJZXKQZE4TGXXCVHN4QOD7G3Z:/var/lib/docker/165536.165536/overlay2/l/REN45JMT2P2T2BZCKQ5HQRIKQZ:/var/lib/docker/165536.165536/overlay2/l/ZNO7EE54YAJURLEO5XN5BH7BWU:/var/lib/docker/165536.165536/overlay2/l/NFAUUH2QKY6OUOQBWBFPN7QYUS:/var/lib/docker/165536.165536/overlay2/l/YPU5ALQY7EVXZGZDCCT3JJB5SV:/var/lib/docker/165536.165536/overlay2/l/UBZDOVCHBHTE74PMM2VONDD6LU:/var/lib/docker/165536.165536/overlay2/l/P2GGL3TIXQGTQGRZCCESZKAYZF:/var/lib/docker/165536.165536/overlay2/l/MBD3OM3GEPEEE5S3MCWNUR3QQO,upperdir=/var/lib/docker/165536.165536/overlay2/cf5f8434feddcc4698a0922aa671acd877568168431f59bf17bf96380668208d/diff,workdir=/var/lib/docker/165536.165536/overlay2/cf5f8434feddcc4698a0922aa671acd877568168431f59bf17bf96380668208d/work)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k,mode=755,uid=165536,gid=165536,inode64)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=165541,mode=620,ptmxmode=666)
sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime)
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755,uid=165536,gid=165536,inode64)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/misc type cgroup (rw,nosuid,nodev,noexec,relatime,misc)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
/dev/root on /usr/sbin/docker-init type ext4 (ro,relatime,discard)
/dev/nvme1n1 on /etc/resolv.conf type xfs (rw,relatime,attr2,discard,inode64,logbufs=8,logbsize=32k,prjquota)
/dev/nvme1n1 on /etc/hostname type xfs (rw,relatime,attr2,discard,inode64,logbufs=8,logbsize=32k,prjquota)
/dev/nvme1n1 on /etc/hosts type xfs (rw,relatime,attr2,discard,inode64,logbufs=8,logbsize=32k,prjquota)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=50331648k,inode64)
tmpfs on /mnt/ramdisk type tmpfs (rw,nodev,relatime,size=50331648k,uid=165536,gid=165536,inode64)
devtmpfs on /dev/null type devtmpfs (rw,relatime,size=35948728k,nr_inodes=8987182,mode=755,inode64)
devtmpfs on /dev/random type devtmpfs (rw,relatime,size=35948728k,nr_inodes=8987182,mode=755,inode64)
devtmpfs on /dev/full type devtmpfs (rw,relatime,size=35948728k,nr_inodes=8987182,mode=755,inode64)
devtmpfs on /dev/tty type devtmpfs (rw,relatime,size=35948728k,nr_inodes=8987182,mode=755,inode64)
devtmpfs on /dev/zero type devtmpfs (rw,relatime,size=35948728k,nr_inodes=8987182,mode=755,inode64)
devtmpfs on /dev/urandom type devtmpfs (rw,relatime,size=35948728k,nr_inodes=8987182,mode=755,inode64)
devpts on /dev/console type devpts (rw,nosuid,noexec,relatime,gid=165541,mode=620,ptmxmode=666)
proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
tmpfs on /proc/acpi type tmpfs (ro,relatime,uid=165536,gid=165536,inode64)
devtmpfs on /proc/kcore type devtmpfs (rw,relatime,size=35948728k,nr_inodes=8987182,mode=755,inode64)
devtmpfs on /proc/keys type devtmpfs (rw,relatime,size=35948728k,nr_inodes=8987182,mode=755,inode64)
devtmpfs on /proc/timer_list type devtmpfs (rw,relatime,size=35948728k,nr_inodes=8987182,mode=755,inode64)
tmpfs on /proc/scsi type tmpfs (ro,relatime,uid=165536,gid=165536,inode64)
tmpfs on /sys/firmware type tmpfs (ro,relatime,uid=165536,gid=165536,inode64)

I don't see any mention of /tmp (and I confirmed that /tmp isn't a symlink to anything else in this system), but all of / is an overlay mount. Would that have similar effects?

@jsirois
Copy link
Member

jsirois commented Feb 27, 2023

Thanks @danxmoran. I'll be running some experiments using overlay2 today. The kernel notes indicate some POSIX non-compliance, the most interesting bit being how an inode can differ for a file if it starts in a lower layer and is copied up to the upper layer -> this would allow the same path to be treated as two different paths and foil locking. That said, the lower layer is read only and would not have a lockfile in it IIUC - i.e.: I think /tmp is fully in the upper layer in your setup. Also, the lock file is opened write-only which would necessitate being copied up to the upper layer during the open call of each and every attempt to lock it IIUC; so this bit of POSIX non-compliance seems to be ruled out as an issue here.

Even though it should have no bearing on your issue, you might try upgrading to Pex 2.1.124 and reporting back in a few weeks if you're game:

[pex-cli]
version = "v2.1.124"
known_versions = [
  "v2.1.124|macos_arm64|5088d00bc89cfaac537846413d8456caa3b2b021d9a5ce6b423635dd1a57b84c|4077988",
  "v2.1.124|macos_x86_64|5088d00bc89cfaac537846413d8456caa3b2b021d9a5ce6b423635dd1a57b84c|4077988",
  "v2.1.124|linux_x86_64|5088d00bc89cfaac537846413d8456caa3b2b021d9a5ce6b423635dd1a57b84c|4077988",
  "v2.1.124|linux_arm64|5088d00bc89cfaac537846413d8456caa3b2b021d9a5ce6b423635dd1a57b84c|4077988"
]

@jsirois
Copy link
Member

jsirois commented Feb 28, 2023

@danxmoran within the container image you experimented in above, there are 38! layers. If you have a chance to check their contents for /tmp/* paths, that would be great. You'd need to use the container to hold it open (the lower entries like /var/lib/docker/165536.165536/overlay2/l/NAYNYJRZ35IFCEGD3ZUZGMHI7V are all ephemeral symlinks that only live whil a container is using an image). Then, on the host you'd need to sudo ls -l /var/lib/docker/165536.165536/overlay2/l/NAYNYJRZ35IFCEGD3ZUZGMHI7V/ for each lowerdir entry.

This should do the trick:

$ (docker run --rm ubuntu:22.04 sh -c 'mount | grep "overlay on /" | sed -r -e "s|^.*lowerdir=||" -e "s|,upperdir=.*||" | tr : "\n" && sleep 1000' &) | while read layer; do if sudo stat $layer/tmp &>/dev/null; then echo "Layer $layer has /tmp content" && sudo ls -l $layer/tmp/ && echo; fi; done
Layer /var/lib/docker/overlay2/l/SSXOI6N73TWHVGNDL5NCF2ZCUF has /tmp content
total 0

Here, the only layer with /tmp was /var/lib/docker/overlay2/l/SSXOI6N73TWHVGNDL5NCF2ZCUF and it just had an empty /tmp dir. The meat I'm looking for is if you get a /tmp hit with stuff in it, namely /tmp/pants.

Just replace the image name with yours.

@danxmoran
Copy link
Contributor Author

I had some trouble running that exact command because of the extra indirection in Docker Desktop for Mac's setup, but I was able to walk through our image's layers using dive and I do see that one layer is adding some content to /tmp:

drwxrwxrwx         0:0     2.2 MB  ├── tmp
drwxr-xr-x         0:0     2.2 MB  │   └── v8-compile-cache-0
drwxr-xr-x         0:0     2.2 MB  │       └── 9.4.146.24-node.20
-rw-r--r--         0:0     2.2 MB  │           ├── zSoptzSyarn-v1.22.10zSbinzSyarn.js.BLOB
-rw-r--r--         0:0       88 B  │           └── zSoptzSyarn-v1.22.10zSbinzSyarn.js.MAP

@jsirois
Copy link
Member

jsirois commented Feb 28, 2023

Ok, thanks. No /tmp/pants though so that rules out lower layer copy up operations to the upper layer and the inherent inode mismatch that can lead to.

@jsirois
Copy link
Member

jsirois commented Feb 28, 2023

@danxmoran I have absolutely no ideas once again. I have not been able to repro in a container using overlay2 like this and I have exhausted the logical possibilities I'm aware of. I'll leave this open but I'm stopping work on it until new information becomes available.

@jsirois jsirois removed their assignment Feb 28, 2023
@danxmoran
Copy link
Contributor Author

@jsirois fair enough 😅 will let you know if we have any new insights

@jsirois
Copy link
Member

jsirois commented Apr 13, 2023

@danxmoran does this style error continue to occur? It's not immediately clear to me if a recent Pants sandbox cleanup fix could be related to this failure / backtrace.

@danxmoran
Copy link
Contributor Author

We haven't seen it since pulling in the sandbox-cleanup fix

@jsirois jsirois self-assigned this Apr 13, 2023
@jsirois
Copy link
Member

jsirois commented Apr 13, 2023

Ok, thanks. I'll do some analysis to see if that fix could conceivably be related here and close out with a comment if I can point out the link.

@jsirois
Copy link
Member

jsirois commented Apr 14, 2023

Ok - it is the case that the OP error can be emitted in a racy sandbox scenario.

Right here a venv (and its site-packages dir) are created on line 199:
https://github.com/pantsbuild/pex/blob/7aa0a15e2332c57d1a3816d2f0b7322cd8fb8b7e/pex/venv/virtualenv.py#L191-L201

If the sandbox then gets nuked, the constructor code on 201 below will fail to find the just created site-packages dir.

I'll close this as fixed / solved by the Pants racy sandbox fix; IOW: this is a won't fix here in Pex - if you externally mutate either the PEX_ROOT or a directory you tell Pex to create a venv in, it's on you.

@jsirois jsirois closed this as completed Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants