Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Symlink (and other) handling of archives #5919

Closed
waveform80 opened this issue Oct 25, 2018 · 14 comments
Closed

Symlink (and other) handling of archives #5919

waveform80 opened this issue Oct 25, 2018 · 14 comments
Labels
state: needs discussion This needs some more discussion

Comments

@waveform80
Copy link
Contributor

Following on from #5848 I've been looking into pip's handling of symlinks in zips (and tars). I've got some experiments to PR, but I think it might be more useful to have a quick conversation about what pip actually wants to do with symlinks in archives it handles, if anything.

The current state

At present, pip handles symlinks present in tar archives (largely because Python's TarFile handles them). By contrast, pip doesn't handle symlinks in zips properly (again, largely because Python's ZipFile doesn't); it extracts them as a regular file containing the name of the target. However, there is enough information in the file-attributes in a zip to accurately reconstruct symlinks (demonstrably as the infozip tools do this).

Then there's the OS differences. On UNIX-like platforms, symlinks (in tars) will be re-constructed normally. However, Windows presents certain challenges:

  1. Symlink support only appeared in Vista (is XP still a concern?)
  2. More importantly, it's a privileged operation (there's a capability for it, but it's only granted to administrators by default)

Hence, if pip is running as Administrator on a recent (Vista onwards) version of Windows, symlinks (in tars) will be re-constructed as symlinks. Otherwise, they'll be extracted as a regular file containing the target's contents (i.e. the target file will be duplicated under the symlink's name).

Questions

  1. Do we want pip to treat symlinks in all archives equally? I'm assuming that the current disparity between tars and zips should be corrected. Which leads onto...
  2. Do we want pip to handle symlinks at all? Given that support on Windows is unlikely to work (without administrative privileges, and I don't think it's normal to run pip with administrative privileges on Windows?), there's several options:
  • The simplest option: treat the presence of symlinks as an error and fail if we find one
  • The "only on supported platforms" option: attempt to create the symlink and if it fails, throw an error (or log a warning?)
  • The "always work" option: attempt to create the symlink; if it fails, extract the target file again under the symlink's name (as mentioned above, this is what pip's untar_file currently does via TarFile)

Other stuff

Basically, I think unzip_file and untar_file need a bit of a re-work to make them both consistent with each other. While I'm at it there's a few other things that I'd like to fix:

  1. I'm not particularly happy that untar_file handles symlinks by calling an "internal" method of TarFile (_extract_member 1); ideally I'd like to re-work that to avoid such calls. I admit it's unlikely to change (it's been there almost unchanged since 2.7) but still, I don't like relying on undocumented methods
  2. ZipFile.extract 2 fixes illegal characters in filenames when extracting in Windows (oddly TarFile.extract doesn't). Might be useful to add this functionality too - especially if the "always work" option is selected (as presumably the expectation there is that all archives should extract successfully on all platforms)
  3. What about hard-links, FIFOs, and devices (all perfectly valid in a tar, and in some cases, zips)? My gut instinct is: throw an error, or possibly just a warning?
  4. Finally, I'd like to throw in some protection against absolute paths in archives (like ZipFile.extract 3). Currently pip doesn't guard against this, and given it's occasionally run as root on UNIX-like systems that's a bit dangerous
@pfmoore
Copy link
Member

pfmoore commented Oct 25, 2018

Personal opinion (from a Windows background, so I tend to think of symlinks as a bad thing): we should treat symlinks as errors. I don't honestly see why there's a need for symlinks in a Python sdist at all (I assume that's when this code gets used?) and if we are to support them I'd like to see a compelling use case for having them that can't be handled any other way.

Bonus - with a good use case, we'd pretty immediately have a better understanding of "how to handle filesystems that don't support symlinks"[1] for free.

[1] Don't forget, that not all filesystems even on Unix support symlinks - consider smbfs, or fat32 on a USB drive - so it's a filesystem issue, not a Windows vs Unix one...

@dstufft
Copy link
Member

dstufft commented Oct 25, 2018

My only concern with erroring on symlinks, is that it might take packages that are already working and "break" them. it would be useful to try to audit PyPI to see how widespread the breakage would be. If it's bad then maybe we can emulate symlinks by copying the file instead of using a symlink across all platforms. If the breakage is small, then just start erroring.

@waveform80
Copy link
Contributor Author

Worth checking #5856 for an example of someone with a package trying to get symlinks to work (looks like (some?) package creation tools won't currently handle them, although pip's untar_file will).

I might try auditing piwheels in a bit to see if there's any wheels we've built with would-be symlinks in them - shouldn't take too much effort to whip up something that'll scan for 'em.

@pfmoore
Copy link
Member

pfmoore commented Oct 25, 2018

@waveform80 Wheels can't contain symlinks, because they are zipfiles and zipfiles don't support symlinks (at least the Python zipfile module doesn't, and the zipfile spec says nothing about them AIUI).

I'm a very strong -1 on allowing symlinks in wheels, certainly not without the PEP being updated to explain precisely how they should be handled on all platforms. Source distributions is another matter - while I remain -1, I'm fine with what @dstufft says, that we should review to determine if we risk breaking existing packages.

@waveform80
Copy link
Contributor Author

@pfmoore I'm rather on the fence about symlinks. Your arguments against them are good (and lead to a consistent treatment on all platforms) but at the same time I can see use-cases for them and there are plenty of packages on PyPI which are only intended for UNIX-like platforms (I'm not so concerned about file-systems that don't support symlinks on such UNIX-like; e.g. I don't think pip's generally concerned about supporting users attempting to install stuff on FAT32 under Linux!).

Basically, I'm happy to go with whatever the consensus turns out to be, but I do think the current situation isn't ideal and some work needs to be done on unzip_file (and untar_file).

As to whether zips can support symlinks ... the infozip tools on Linux handle them quite happily (storing and re-constructing them). On UNIX-like systems, the file-mode (including whether or not the member is a symlink, device, FIFO, etc.) is stored in the high 16-bits of the external-file-attributes which the zip spec simply defines as "...host-system dependent" (which can in turn be interpreted according to the version-made-by field, which includes a value for UNIX (3)). Now, as you note the Python ZipFile module doesn't handle them, but we could support them if we wanted to and there's nothing in the zip spec that forbids this.

Still, as noted above, I'm happy to go with the consensus and I certainly have some sympathy with the "just don't allow symlinks" position.

@cjerdonek
Copy link
Member

it would be useful to try to audit PyPI to see how widespread the breakage would be.

Is this easy for someone to do? What is involved?

@waveform80
Copy link
Contributor Author

waveform80 commented Oct 25, 2018

Something like this? https://gist.github.com/waveform80/7d0a74e59ab1d073908d99b7a4523c01

I'm running that on piwheels at the moment; we've only got wheels (obviously :) so I'm guessing it'll come up empty, but I've added in a branch for checking tars too (untested) in case anybody might find it useful for checking PyPI.

Update: left that running overnight on piwheels; as expected, no symlinks found (which most likely means that all the current methods of producing wheels won't currently attempt to store symlinks within them, or if they do, that they're stored as something other than a symlink, e.g. as a regular file). Anyway, I'd say with some high degree of confidence that means throwing an error in the event of finding a symlink in a wheel won't break any extant packages. However, that doesn't necessarily extend to sdists in zip format.

@cjerdonek cjerdonek added the state: needs discussion This needs some more discussion label Oct 27, 2018
@scopatz
Copy link

scopatz commented Apr 11, 2019

If a wheel is platform-specific, and that platform supports symlinks, then I think it makes a lot of sense for pip to support symlinks as well. Not supporting symlinks can lead to huge increases in filesize when dealing with shared object libraries.

Also there is the issue that you can't just duplicate the contents of a symlink if the symlink points to a file present in a dependency. This is because the dependency may not be around when constructing the wheel.

@cjerdonek
Copy link
Member

FYI, there is some recent discussion of this here: https://discuss.python.org/t/symbolic-links-in-wheels/1945

@grv87
Copy link

grv87 commented Nov 18, 2019

Do we want pip to handle symlinks at all? Given that support on Windows is unlikely to work (without administrative privileges, and I don't think it's normal to run pip with administrative privileges on Windows?)

I vote for supporting symlinks on Windows.
More dev tools and packages support and use (and require) them, more arguments to convince security to give developers this privilege.
It is already required by Google's repo.

@a-hurst
Copy link

a-hurst commented Dec 26, 2019

Just to chime in here: this would be immensely useful on macOS for bundling .frameworks, since they make heavy use of symlinks and thus balloon in size when getting zipped/unzipped by pip + wheel since this replaces all symlinks with copies of the files they point to (SDL2_mixer.framework goes from 4.6 MB to 40.2 MB after being run through bdist_wheel).

@virtuald
Copy link

If a wheel is platform-specific, and that platform supports symlinks, then I think it makes a lot of sense for pip to support symlinks as well. Not supporting symlinks can lead to huge increases in filesize when dealing with shared object libraries.

My sentiment exactly.

Note that #8368 will remove calls to setup.py install and force wheel creation in pip 21.0, thus breaking any packages that might depend on symlinks.

@pfmoore
Copy link
Member

pfmoore commented Oct 20, 2020

To support symlinks in wheels needs someone to go through the process of proposing a change to the standard. @cjerdonek linked here to the relevant discussion at the time, but no-one seems to have followed up on that, so there's been no progress since then.

@pradyunsg
Copy link
Member

I'm gonna close this off, since this needs a change to the wheel file format and that isn't a trackable change for pip's issue tracker.

For anyone interested in this, please see the discussion at https://discuss.python.org/t/symbolic-links-in-wheels/1945. Feel free to start a new thread there for pushing this forward if you're interested. If wheels get updated to support this, then there'd be a trackable task for pip here which would make sense for this issue tracker.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 26, 2021
@pradyunsg pradyunsg removed the S: needs triage Issues/PRs that need to be triaged label Mar 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
state: needs discussion This needs some more discussion
Projects
None yet
Development

No branches or pull requests

9 participants