Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pip 19.x --upgrade breaks on nfs #6327

Closed
ntextreme3 opened this issue Mar 11, 2019 · 23 comments · Fixed by #11394
Closed

pip 19.x --upgrade breaks on nfs #6327

ntextreme3 opened this issue Mar 11, 2019 · 23 comments · Fixed by #11394
Labels
type: bug A confirmed bug or unintended behavior

Comments

@ntextreme3
Copy link

Environment

  • pip version: 19.x
  • Python version: 2.7, 3.6
  • OS: Linux

Description
pip19 uses AdjacentTempDirectory, where previous pip versions used tempfile. Because of nfs "silly rename", if working on an nfs mount and pip is called via a subprocess from a python script that happens to use one of the packages being upgraded, pip fails when trying to cleanup (since the C extensions are still mapped to interpreter -- can be seen in /proc/$PID/maps)

Expected behavior

Expect this either to not fail (as in previous versions of pip) or just raise more appropriate exception. Right now, this causes pip to enter this error handler (that seems to be designed for windows). chmod'ing the path to 200 works for Windows, but is not enough to delete on Linux. The chmod done here ultimately causes pip to raise Permission Error. Maybe that should be chmod to stat.S_IRWXU?

How to Reproduce

in some nfs location, install simplejson (or anything that uses a C extension)

python3.6 -m virtualenv env
source env/bin/activate
pip install simplejson==3.16.0

then import it, and call pip in a subprocess

import subprocess
import sys

import simplejson  # <-- causes this to break


INSTALL_COMMAND = [
    sys.executable,
    "-m",
    "pip",
    "install",
    "--disable-pip-version-check",
    "--upgrade",
    "--force-reinstall",
    "simplejson==3.16.0",
]
print("return code:", subprocess.Popen(INSTALL_COMMAND).wait())

You can also run the script interactively and see the mapped files. I saved above code as 'mini.py'

$ python -i mini.py &
# wait for the pip process to finish
$ cat /proc/$!/maps | grep nfs
# will show the .nfs* files

Output

Looking in indexes: https://...
Collecting simplejson==3.16.0
Installing collected packages: simplejson
  Found existing installation: simplejson 3.16.0
    Uninstalling simplejson-3.16.0:
      Successfully uninstalled simplejson-3.16.0
Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: 'env/lib/python3.6/site-packages/~implejson'
Consider using the `--user` option or check the permissions.

return code: 1

Taking a look at that location:

$ ls env/lib/python3.6/site-packages/~implejson
ls: cannot open directory 'env/lib/python3.6/site-packages/~implejson': Permission denied
$ chmod 700 env/lib/python3.6/site-packages/~implejson
$ ls -a env/lib/python3.6/site-packages/~implejson
.  ..  .nfsxxxxxxxxxxxxxxxxxxxxxxxx  # <-- some nfs silly-renamed file
Paste the output of the steps above, including the commands themselves and
pip's output/traceback etc.
@cjerdonek
Copy link
Member

This looks like the same as issue #6321 which was recently filed, but this issue has more information. Thanks a lot for the detailed report and analysis, @ntextreme3!

@cjerdonek cjerdonek added the type: bug A confirmed bug or unintended behavior label Mar 11, 2019
@zooba
Copy link
Contributor

zooba commented Mar 11, 2019

So... is this basically saying that Windows can rename an open DLL but can't delete it, and Linux+NFS can delete an open SO but can't rename it? That's... unfortunate.

In theory, the fix should be to swallow the failure to delete the folder (turn it into a warning rather than a crash). It's already been renamed and is out of the way - Python can't import it, and pip won't try to reuse the filename in future. All by design.

Preventing this scenario from happening in the first place is beyond my expertise. All pip has done is rename a directory, which has apparently succeeded. I have no idea why NFS behaves this way.

@cjerdonek
Copy link
Member

@zooba A couple questions: why wasn’t this issue happening before, and what were the changes meant to fix? Does this mean it would be better to be using the old code path in some situations if this particular case is a regression?

@zooba
Copy link
Contributor

zooba commented Mar 11, 2019

No idea why it wasn't happening before.

The changes fixed the performance impact of copying files multiple times to move them out of the way vs. renaming the directory, and inadvertently fixed (in most cases) the problem of hitting the maximum path length on Windows.

Given the files are already properly out of the way at this point, it's hard to choose a different code path. If you want to keep the old one on Linux because performance doesn't matter so much there, I'm not going to put up a fight, but copying files on Windows is far slower than renaming a directory and I'd rather keep that impact minimized there :)

@cjerdonek
Copy link
Member

No idea why it wasn't happening before.

Could it be because of what the original poster said? Namely--

pip19 uses AdjacentTempDirectory, where previous pip versions used tempfile. Because of nfs "silly rename", if working on an nfs mount ... pip fails when trying to cleanup

I'm thinking of a scenario where AdjacentTempDirectory would result in the files being renamed to an NSF share, whereas tempfile would result in the files being written to a local directory, thus avoiding the issues with NFS. Is that possible?

@zooba
Copy link
Contributor

zooba commented Mar 12, 2019

Apparently it is possible :) Hence my comment about NFS allowing you to delete files in use but not rename them (since the old method was deleting it rather than renaming its parent directory)

But I know nothing about NFS apart from what's been said in this thread. Maybe put out a call for help on Twitter to find someone who knows what it's doing?

@ntextreme3
Copy link
Author

It's not the delete vs rename, it's that the old method was deleting from temp directory (non-nfs in my case) which works. And the new one does the rename-then-delete from Adjacent (nfs, in my case).

When you delete the file in use on nfs, it gets renamed to .nfsXXX... (and then the nfs client should clean it up when no more processes are using it). The problem is that the directory containing that .nfs* (the AdjacentTempDir) is not fully empty while the process is running, so the directory can't be deleted.

@zooba
Copy link
Contributor

zooba commented Mar 12, 2019

It's not the delete vs rename, it's that the old method was deleting from temp directory (non-nfs in my case) which works.

It must still have been deleting it from the NFS as part of moving it to the temp directory. Otherwise the original file would be left behind.

@cjerdonek
Copy link
Member

Can the .nfsXXX... file be renamed (e.g. to a sibling of the AdjacentTempDir)? Or are renames also not permitted on those files?

@tgs
Copy link
Contributor

tgs commented Mar 20, 2019

Can the .nfsXXX... file be renamed (e.g. to a sibling of the AdjacentTempDir)? Or are renames also not permitted on those files?

Apparently no:

cat some-file-on-nfs | (sleep 100; cat > /dev/null)

In another terminal:

rm some-file-on-nfs
ls -a   # shows the file .nfs000000005bd10cc500000713
mv .nfs000000005bd10cc500000713 new-name

Result: mv: cannot move ‘.nfs000000005bd10cc500000713’ to ‘new-name’: Device or resource busy

@tgs
Copy link
Contributor

tgs commented Mar 21, 2019

Would it be helpful if I post an strace -e file from old and new versions of pip in this situation? I would use the demo script from the bug report and change INSTALL_COMMAND to include the strace call.

@ntextreme3
Copy link
Author

ntextreme3 commented Mar 21, 2019

That's what I used originally -- you can see here below (example with markupsafe instead of simplejson but it's the same -- anything that has a C extension basically)

The in-use directory is renamed. When it comes time to remove it, it's fails for Directory not empty then enters the error handler (I referenced in my first post), does chmod 200 (since the error handler is for Windows), does a bunch of retries, then raises the Permission Error. This is misleading since pip error handler is what caused the Permission Error (which causes people to start thinking they need --user or something).

Also, when pip was using /tmp, people were able to control the location with TMPDIR env var. I don't know how that was being used, but it seems there should be a way to still allow /tmp usage. Or, just always use old way when on nfs 😉 ... Orrr at least raise the Directory not empty error...

relevant strace section
rename("/env/lib/python3.6/site-packages/markupsafe/", "/env/lib/python3.6/site-packages/~arkupsafe") = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffd73bea8d0) = -1 EINVAL (Invalid argument)
write(1, "      Successfully uninstalled M"..., 48) = 48
... # bunch of lines
rmdir("/env/lib/python3.6/site-packages/~arkupsafe") = -1 ENOTEMPTY (Directory not empty)
stat("/env/lib/python3.6/site-packages/~arkupsafe", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
chmod("/env/lib/python3.6/site-packages/~arkupsafe", 0200) = 0
rmdir("/env/lib/python3.6/site-packages/~arkupsafe") = -1 ENOTEMPTY (Directory not empty)
close(4)                                = 0
select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
lstat("/env/lib/python3.6/site-packages/~arkupsafe", {st_mode=S_IFDIR|0200, st_size=4096, ...}) = 0
open("/env/lib/python3.6/site-packages/~arkupsafe", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
stat("/env/lib/python3.6/site-packages/~arkupsafe", {st_mode=S_IFDIR|0200, st_size=4096, ...}) = 0
select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
lstat("/env/lib/python3.6/site-packages/~arkupsafe", {st_mode=S_IFDIR|0200, st_size=4096, ...}) = 0
open("/env/lib/python3.6/site-packages/~arkupsafe", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
stat("/env/lib/python3.6/site-packages/~arkupsafe", {st_mode=S_IFDIR|0200, st_size=4096, ...}) = 0
select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
lstat("/env/lib/python3.6/site-packages/~arkupsafe", {st_mode=S_IFDIR|0200, st_size=4096, ...}) = 0
open("/env/lib/python3.6/site-packages/~arkupsafe", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
stat("/env/lib/python3.6/site-packages/~arkupsafe", {st_mode=S_IFDIR|0200, st_size=4096, ...}) = 0
select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
lstat("/env/lib/python3.6/site-packages/~arkupsafe", {st_mode=S_IFDIR|0200, st_size=4096, ...}) = 0
open("/env/lib/python3.6/site-packages/~arkupsafe", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
stat("/env/lib/python3.6/site-packages/~arkupsafe", {st_mode=S_IFDIR|0200, st_size=4096, ...}) = 0
select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
lstat("/env/lib/python3.6/site-packages/~arkupsafe", {st_mode=S_IFDIR|0200, st_size=4096, ...}) = 0
open("/env/lib/python3.6/site-packages/~arkupsafe", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
stat("/env/lib/python3.6/site-packages/~arkupsafe", {st_mode=S_IFDIR|0200, st_size=4096, ...}) = 0
select(0, NULL, NULL, NULL, {0, 500000}) = 0 (Timeout)
lstat("/env/lib/python3.6/site-packages/~arkupsafe", {st_mode=S_IFDIR|0200, st_size=4096, ...}) = 0
open("/env/lib/python3.6/site-packages/~arkupsafe", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)
stat("/env/lib/python3.6/site-packages/~arkupsafe", {st_mode=S_IFDIR|0200, st_size=4096, ...}) = 0
ioctl(2, SNDCTL_TMR_TIMEBASE or SNDRV_TIMER_IOCTL_NEXT_DEVICE or TCGETS, 0x7ffd73beb050) = -1 EINVAL (Invalid argument)
write(2, "Could not install packages due t"..., 233) = 233

@cjerdonek
Copy link
Member

cjerdonek commented Mar 21, 2019

Personally, I'm still weighing how to proceed. What options do we have? I know one option is to swallow those errors. For that option, would we be able to detect just that error (and not swallowing similar errors that we'd want to bubble up), and what consequences would that concretely have?

Alternatively, is there a way to detect this scenario and fall back to using normal tempfile instead of an AdjacentTempDirectory on NFS?

Any other options? [Edit: I didn't write this comment until after reading @ntextreme3's comment which does describe some more options.]

@zooba
Copy link
Contributor

zooba commented Mar 22, 2019

Also, when pip was using /tmp, people were able to control the location with TMPDIR env var. I don't know how that was being used, but it seems there should be a way to still allow /tmp usage.

To be clear, what pip was doing was a fairly expensive backup operation, which is made considerably cheaper without changing the semantics by doing an in-place rename. So by advocating for "configurability", you're also advocating for "much slower and less reliable uninstalls" (current issue notwithstanding - copying files across filesystems is in general less reliable than a single rename).

I'm not saying configurability is wrong, but both aspects have to be balanced. (Also note that the only way the rename can be done is by doing it in the directory you're uninstalling from - anything else introduces permission/ACL concerns or simply not being able to move files across drives.)

@cjerdonek
Copy link
Member

Maybe a good starting point on this issue would be to address the "raise a more appropriate exception" part, since that would help in any case and I think we can all agree.

@tgs
Copy link
Contributor

tgs commented Mar 28, 2019

Others may completely understand the situation already, but I didn't, so I'll write up a summary here (which I hope is accurate, please correct if not):

I looked at the strace for 18.x, and pip just gives up after the final rmdir('.../site-packages/simplejson') fails (with ENOTEMPTY because of the .nfs000 file). In 18.x, this is not a problem, because it just proceeds to copy all the files into that same .../site-packages/simplejson directory, and the .nfs000 file eventually gets removed by the filesystem driver, leaving everything as it should be.

But the idea with the Adjacent temp dir is that it wants to install to a temp directory that's right there, and then rename that entire temp directory to .../site-packages/simplejson. Renaming is a lot faster than copying. But renaming over the top of the existing directory doesn't work, of course. So that's why the existing directory gets moved out of the way. Somewhere in here, the incorrect Windows error handler happens, and does a chmod that is counterproductive on unixy systems (it's designed to fix .svn directories on windows).

@tgs
Copy link
Contributor

tgs commented Mar 28, 2019

A couple of ideas about getting past the situation...

Step one, it should stop triggering the windows error handler, or fix the chmod for unix compatibility or something. After that...

If pip renames the existing directory out of the way, could an atexit hook go back and try to delete it as pip is exiting? I don't have time today to test whether Python closes shared libraries early enough for that to work. I kind of suspect not, but who knows? It wouldn't work for the demo program, in any event, because the .so is opened by pip's parent process, not pip itself.

Rename the unwanted directory to a new directory called .../site-packages/.pip-trash/, and opportunistically rmtree that directoy?

Another option would be to rename the files into place piecewise. If a target directory already exists, try renaming the children into place inside it. If one of them already exists (e.g. a shared library is in a package two levels deep), recursively do the same thing. This would be annoying to code, but still faster than copying everything, I assume.

@zooba
Copy link
Contributor

zooba commented Mar 28, 2019

But the idea with the Adjacent temp dir is that it wants to install to a temp directory that's right there, and then rename that entire temp directory

That's the eventual hope, yeah, but that's not what happens right now. It's only implemented for uninstallation - so basically the rename of the directory fails because one of the files inside is locked open, and presumably NFS does a copy instead of that file and renames the locked one to .nfs… (but it could have deleted the file in use just fine, apparently). Then pip tries to finish removing the directory when uninstallation succeeds, and it can't because it's not empty, and it doesn't handle that error so it crashes.

The weird part is that the Windows file system behaves the opposite here, where it can rename the file in use to anywhere on the same drive, but can't delete it. And apparently most Linux file systems behave like this too (or at least work with this assumption). So you can't protect against both scenarios with one approach. I'd be fine with detecting Windows and using the slower approach on other platforms (though I do have Linux machines at work where we care about the slowness of installation, but those won't suffer from the "trying to uninstall something in use" problem).

The point of renaming the unwanted directory by replacing the leading characters in its name is to avoid making the path longer, as that's the cause of a whole other range of issues. Again, I'd be fine with detecting Windows to handle this case and using a different approach elsewhere.

(Just reread the post and realised you may have missed that it renames the directory first, and then tries to install the new version (if any), and then rmtrees the renamed directory. Similarly, on rollback, it should rmtree the new version (if any) and be able to rename the preserved directory without having to individually copy each file. Also, atexit is not going to work any better here - the best option is to start a new process that spins until the directory is deleted after the original one has exited, but if the file is locked by a totally different process then it may be a while before the directory fully goes away, and who knows what you've deleted in the meantime?)

@cjerdonek
Copy link
Member

cjerdonek commented Mar 28, 2019

I’ll repeat that I think the exception-masking issue should be fixed first as it shouldn’t be happening in any case.

@berrisfordjohn
Copy link

Did this issue get addressed?

@pradyunsg
Copy link
Member

No.

lpsinger added a commit to lpsinger/gwcelery that referenced this issue Aug 14, 2019
Pip leaves behind directories with names starting with `~` in the
site-packages directory due to a bug that occurs when uninstalling
packages from an NFS filesystem.

See pypa/pip#6327.

Fixes #206.
@kottmast
Copy link

I think I'm hitting this issue, but I'm confused about the affected version and/or I'm hitting something related. I'm on an ubuntu machine, with nfs mounted /home and trying to create a conda environment from a environment.yml file. The conda part of the installation works, but pip fails on version 19.3.1 with the .nfsxxx device or resource is busy error. I tried downgrading to several previous versions of pip, but it then produces the "directory is not empty" error. Has this ever worked? Is there some workaround available?

@4j4y
Copy link

4j4y commented Apr 20, 2022

@kottmast were you able to find a fix for it? we are using pip 20.x and facing this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type: bug A confirmed bug or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants