Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pip installs a wheel even if --no-binary is given #12954

Open
1 task done
paugier opened this issue Sep 4, 2024 · 23 comments
Open
1 task done

pip installs a wheel even if --no-binary is given #12954

paugier opened this issue Sep 4, 2024 · 23 comments
Labels
state: awaiting PR Feature discussed, PR is needed type: feature request Request for a new feature

Comments

@paugier
Copy link

paugier commented Sep 4, 2024

Description

I like to install mpi4py with a new build (without using a wheel already locally built). I try to use --no-binary mpi4py but surprisingly a wheel is used.

(venv-mpi4py) [egi2153] augier@login2:~$ CC=mpicc pip install mpi4py --no-binary mpi4py
Looking in indexes: https://gorgone.cines.fr//root/pypi/+simple/
Collecting mpi4py
  Using cached mpi4py-4.0.0-cp311-cp311-linux_x86_64.whl
Installing collected packages: mpi4py
Successfully installed mpi4py-4.0.0

Note that there is a pip config file:

$ cat /etc/pip.conf
[global]
index-url=https://gorgone.cines.fr//root/pypi/+simple/
trusted-host=gorgone.cines.fr

Expected behavior

mpi4py should be compiled from source when --no-binary mpi4py is given.

See the help message:

  --no-binary <format_control>
                              Do not use binary packages. [...]

pip version

pip 24.2

Python version

3.11

OS

Linux

How to Reproduce

python -m venv venv-mpi4py
.  ~/venv-mpi4py/bin/activate
pip install --upgrade pip
pip install mpi4py --no-binary mpi4py

Output

$ pip install mpi4py --no-binary mpi4py
Looking in indexes: https://gorgone.cines.fr//root/pypi/+simple/
Collecting mpi4py
  Using cached mpi4py-4.0.0-cp311-cp311-linux_x86_64.whl
Installing collected packages: mpi4py
Successfully installed mpi4py-4.0.0

Code of Conduct

@paugier paugier added S: needs triage Issues/PRs that need to be triaged type: bug A confirmed bug or unintended behavior labels Sep 4, 2024
@pfmoore
Copy link
Member

pfmoore commented Sep 4, 2024

If we built the wheel previously, and the source is identical, we'll re-use the cached wheel rather than spending the time doing a build which will give the same result. Even with --no-binary, as we're in principle still using the sdist. Could that be the case here?

@paugier
Copy link
Author

paugier commented Sep 4, 2024

Yes, it is exactly the case and the issue.

IMHO, it is for most packages with extensions the wrong behavior. There can be different reasons so that the same source would give different wheels, for example different compilers, different library implementations (like for MPI) or different build options.

If the user specifically states --no-binary mpi4py, pip should trust the user and rebuild from source.

@eli-schwartz
Copy link
Contributor

I personally would find it fairly surprising to pass an option that says "do not install binary packages" and have it be interpreted as "Do not use manylinux wheels from PyPI, but do use binary packages from elsewhere".

That being said, there's overall two reasons for no-binary:

  • as a form of cache-busting, because you want to reconfigure the mechanism the sdist uses for building and try again
  • distrust of "other people's binaries" / wanting to bootstrap the ecosystem from source for purity reasons

My gut feeling is that the latter group doesn't mind ignoring caches. They've already committed to the possibility of needing to perform potentially lengthy builds by passing no-binary at all, so trying to spare them the extra work of recompiling a wheel they previously successfully created isn't something they needed to happen -- and if it was successfully built once, it will most likely successfully build again. :)

On the other hand, the former group needs to have some way of achieving their goals, which appear to currently be entirely impossible? It seems like the safe bet is to make both cases work, even if that means making one case work less optimally. (Or I suppose add a new command-line option, but this doesn't seem appetizing.)

@notatallshaw
Copy link
Member

notatallshaw commented Sep 17, 2024

IMO this is a bug, the sdist should be rebuilt unless there is an explicit option pip could provide to build or not rebuild sdists.

Do maintainers disagree? Would a PR be accepted by someone willing to submit a sufficently high quality PR? (e.g. include explicit tests). I'm not volunteering (at least yet), just trying to triage this issue.

@pfmoore
Copy link
Member

pfmoore commented Sep 17, 2024

I'm uncertain. I don't buy the argument that it's wrong to use a cached build of a sdist. Maybe we should just improve the message - rather than saying "Using cached mpi4py-4.0.0-cp311-cp311-linux_x86_64.whl" we could say "Using mpi4py-4.0.0.tar.gz (skipping rebuild as we can use the previously cached result saved as mpi4py-4.0.0-cp311-cp311-linux_x86_64.whl)".

If the only use case is cache busting, maybe what we need is finer grained options than --no-cache-dir? Maybe all we need is --no-wheel-cache?

But having said this, I'm not going to block a sufficiently high quality PR. I just don't think it's necessarily the right direction to take.

@notatallshaw notatallshaw added state: awaiting PR Feature discussed, PR is needed type: feature request Request for a new feature and removed S: needs triage Issues/PRs that need to be triaged labels Sep 18, 2024
@notatallshaw
Copy link
Member

Maybe all we need is --no-wheel-cache?

Or maybe --no-built-cache <package>? Using the same syntax as --no-binary and --only-binary, as this is just about sdists being built, a package which is a wheel candidate should still use the wheel cache right?

Anyway, seem like this can be solved multiple ways, I've updated this to "awaiting PR".

@sbidoul
Copy link
Member

sbidoul commented Sep 19, 2024

I'm -1 on changing the behaviour of --no-binary with regards to using cached locally built wheels. The current approach is useful and, while it does not fit all use cases, I think it is a good default.

We can refine the caching keys, though (see for instance #11164).

I'm also open to some sort of per-distribution-name cache control. If this becomes the tracking issue for that, and that is what is expected in the awaited PR, we should consider changing the issue title.

As workaround in the current state of affairs, would pip cache purge <package> ; pip install <package> --no-binary <package> work for you?

@sbidoul sbidoul removed the type: bug A confirmed bug or unintended behavior label Sep 19, 2024
@sbidoul
Copy link
Member

sbidoul commented Sep 19, 2024

I'm removing the bug label, as things are working as designed.

@paugier
Copy link
Author

paugier commented Sep 19, 2024

Note that the name --no-binary is quite misleading if it actually means "Do not use manylinux wheels from PyPI, but do use binary packages previously compiled locally".

If we cannot change the current behavior, there should be an easy way to "(re)install a package by recompiling it from source".

Note that pip cache purge <package> ; pip install <package> --no-binary <package> (i) is very verbose and (ii) leads to a new download of the source which most of the time is not necessary.

@sbidoul
Copy link
Member

sbidoul commented Sep 19, 2024

Note that the name --no-binary is quite misleading if it actually means "Do not use manylinux wheels from PyPI, but do use binary packages previously compiled locally".

That is indeed what it means. I'd be happy to review a PR improving the documentation in that area.

is very verbose

Agreed, that is why I consider it a work around until a better solution is proposed and merged.

leads to a new download of the source which most of the time is not necessary.

Actually, it should not download because, if I remember correctly, pip cache purge only affects the cache of locally built wheels and not the http cache.

@eli-schwartz
Copy link
Contributor

Like I said above as well:

I personally would find it fairly surprising to pass an option that says "do not install binary packages" and have it be interpreted as "Do not use manylinux wheels from PyPI, but do use binary packages from elsewhere".

As far as "improving the documentation" goes, that documentation should also be updated in the output of pip install --help but overall it seems like a half measure to change the documentation when the flag itself has a misleading name... perhaps if it were renamed from --no-binary to --no-download-binary.

I'm -1 on changing the behaviour of --no-binary with regards to using cached locally built wheels. The current approach is useful and, while it does not fit all use cases, I think it is a good default.

I still don't really understand this. Do you use --no-binary and want cached locally built wheels? If so, what are your reasons for using the flag, and what are your reasons for considering cached locally built wheels to uphold the intent of that reason?

The behavior that you claim is useful and a good default doesn't seem very useful to me! If I was okay with locally built wheels I would also be okay with PyPI wheels...

@sbidoul
Copy link
Member

sbidoul commented Sep 19, 2024

perhaps if it were renamed from --no-binary to --no-download-binary

We could also alias the option, indeed, to give it a more precise name, I agree.

default doesn't seem very useful to me

I see at least these scenarios where this is a good default:

  • when there are no published wheels on PyPI (I still meet these daily)
  • when building from a VCS URL referring to an immutable commit id, even pure python packages
  • and the user repeatedly build with the same config settings

Caching by default benefits these kind of scenarios and it is my impression that these are the more common. Hard to prove, of course, and as I said, I absolutely agree this does not suit all use cases.

We changed how --no-binary work fairly recently as this was needed to remove the setup.py install code path, and the change satisfied some, and all in all, not that many people complained? So I don't think we should change it again, but rather clarify the current behaviour and refine cache control.

@eli-schwartz
Copy link
Contributor

I see at least these scenarios where this is a good default:

  • when there are no published wheels on PyPI (I still meet these daily)

Then why would you be passing --no-binary for this use case?

  • when building from a VCS URL referring to an immutable commit id, even pure python packages

Then why would you be passing --no-binary for this use case?

  • and the user repeatedly build with the same config settings

This isn't in and of itself a reason to pass --no-binary. If you passed --no-binary because you simultaneously wanted to "don't trust someone else's binaries", but also repeatedly build with the same config settings, then that scenario makes sense.

I explicitly mentioned this case above -- and I made the claim that I believe it is a generally unlikely case, and that this group of people also don't mind rebuilding each time, so even if it is useful to reuse built binaries, I don't think it's sufficiently useful to justify the current behavior.

@sbidoul
Copy link
Member

sbidoul commented Sep 19, 2024

Then why would you be passing --no-binary for this use case?

Hm, right, I might be confusing things right now. I'll try to set aside some time to recall the reasons we did it this way.

In any case won't we still end up with subjective views about the prevalence of one group of users or scenarios over another? So

clarify the current behaviour and refine cache control

is probably easier compared to making a breaking change with all the deprecation cycle etc.

@notatallshaw
Copy link
Member

notatallshaw commented Sep 19, 2024

One scenario where the current behavior is preferable is where you are trying to only "allow list" sdists from an index by doing the following: --no-binary :all: --only-binary <package> ... --only-binary :all: --no-binary <package> .... The intent here isn't to rebuild the sdist on every install, but to specify you only want to accept wheels from the index, but in the cases where that's not possible you explicitly allow it via --only-binary.

I general I would reccomend --prefer-binary over this approach, but there are cases where that isn't good enough, either because you want to manually review anything you allow to be an sdist, or a package previously published wheels and now only publishes sdists (there are real world cases of this).

@pfmoore
Copy link
Member

pfmoore commented Sep 19, 2024

In general, my impression is that people using --no-binary want to ensure that they build everything themselves, never relying on artefacts that might have been built by someone else (and which therefore might not be trustworthy). Re-using builds that have been done locally, on the same build machine, using the same tools as are being used for the current build, ius completely safe under that model, and so the default behaviour offers an acceptable way of getting significant performance improvements when doing a large build.

From the discussion here, it's clear that this isn't true for all users of --no-binary, but as @sbidoul stated, that simply means that we're debating which group should be prioritised over the other. In the absence of any objective way of comparing the importance of the different use cases, the "status quo wins" principle applies - we avoid breaking existing users who are happy with things as they are and focus on improving the documentation and refining the options for controlling the behaviour of caching (while leaving the default unchanged), so that users who prefer not to re-use cached builds (a) know that they have to select a non-default behaviour, and (b) can do so as easily as possible.

But in the absence of compelling evidence that the users relying on the existing behaviour no longer want things to work the way they currently do, changing how --no-binary works is probably a non-starter.

@eli-schwartz
Copy link
Contributor

@notatallshaw,

I don't really understand what you're trying to describe:

  • trying to only "allow list" sdists
  • pass the option to reject wheels, except for a specific package allow-listed to use wheels
  • recommend --prefer-binary which is about using the default behavior of preferring binaries, but "even harder because ignoring version updates"

If your goal is to only allow wheels in general but permit sdists for selected packages where wheels don't currently exist, then I would humbly posit that the correct solution to this is --only-binary ":all:" --only-binary "-<package>", a subtraction operation that currently doesn't exist but would provide dramatically better UX than an IMO slightly tortured interpretation of the --no-binary option.

It would also provide forward-compatibility if a project starts to upload wheels, because you'd like to start using them.

All that being said, I don't actually understand the goal of this in the first place. Since:

  • pip already uses only binaries when it can
  • you want to use only binaries when you can, but allow all the cases where you know you can't

This would work by default, so the "allowlist" is presumably a security mechanism that allows you to vet that decision in advance... which implies you trust a project to provide executable code that you're going to execute without verifying, but do not trust the project to provide an executable build system in an sdist. The notion of "supply chain security by not trusting build systems" is not really a workable end result IMO. What you actually want is pinned hashes so that you can, well, install the things you trust.

or a package previously published wheels and now only publishes sdists (there are real world cases of this).

... but my understanding was that this is the literal only case where --prefer-binary makes a semantic difference to the resolver, so it seems strange to call this out as an exceptional case where --prefer-binary doesn't work?

Overall I still just do not understand what people mean when they say that the currently implemented pip functionality has an advantageous circumstance over the proposed change. The only case where it makes a difference as far as I can tell is when:

  1. for security reasons you do not trust wheels
  2. you are unwilling to use hashes
  3. you'd like builds to be a bit faster because they reuse caches

I don't get how not trusting sdists fits in here at all.

@eli-schwartz
Copy link
Contributor

In general, my impression is that people using --no-binary want to ensure that they build everything themselves, never relying on artefacts that might have been built by someone else (and which therefore might not be trustworthy).

Sdists are not trustworthy either if you don't pin the hashes, in which case --no-binary is a moot point. But the people using hashes will also tend to be hashing sdists so they can guarantee the results are trustworthy.

But the other reasons for no-binary are also significant, and matter even without hashing:

  • to force compiled C extensions to respect system options, such as compiling with different compilers or CFLAGS such as -march=native, or build against system versions of a large C++ dependency instead of bundling multiple ABI-incompatible versions into wheels for different packages that are commonly installed and used together
  • to pass --config-settings and have them be respected.

Re-using builds that have been done locally, on the same build machine, using the same tools as are being used for the current build, ius completely safe under that model, and so the default behaviour offers an acceptable way of getting significant performance improvements when doing a large build.

People concerned about "significant performance improvements" would be using a wheelhouse, surely. Like the PyPI one. I don't believe that there's significant overlap between people concerned about building everything themselves with different options but not hashes, and people who are concerned about the amount of time it takes to perform a build.

And it is not safe to reuse a cached artifact when the cache key doesn't correctly reflect the inputs, such as --config-settings at minimum but also $CC, $CXX, and arguably "any environment variable that setup.py looks up via os.environ.get()". It's a correctness bug, not a performance loss. Slow but correct is better than getting a wrong answer fast.

@eli-schwartz
Copy link
Contributor

As mentioned above, renaming --no-binary to --no-download-binary and perhaps adding a new option --never-binary which bypasses all caching, could work and make everyone happy other than the people who have to update a deprecated command line option in their scripts. :)

@pfmoore
Copy link
Member

pfmoore commented Sep 19, 2024

I don't think this discussion is particularly productive. Backward compatibility means that we won't change the existing behaviour without a clear transition plan for how people who rely on the existing behaviour will be warned of the upcoming change, and either given a way to retain the current behaviour, or given a viable transition to the new behaviour. Discussing changes without covering those points is speculation at best, pointless at worst.

And claiming that "nobody could want the current behaviour" without evidence is both unhelpful and naive. From experience I can state with some certainty that any behaviour, no matter how weird we might think it is, is almost certainly being relied on by someone who has no budget to change anything, and a business that relies on upgrading pip not breaking their workflow 🙁

renaming --no-binary to --no-download-binary and perhaps adding a new option --never-binary which bypasses all caching, could work and make everyone happy other than the people who have to update a deprecated command line option in their scripts. :)

Doing exactly this but not renaming --no-binary would achieve the same, except that nobody's code would be broken, and the only people who would be unhappy are people who are bothered purely by the name of the option.

@notatallshaw
Copy link
Member

notatallshaw commented Sep 19, 2024

I don't really understand what you're trying to describe:

Sorry, I put the commands the wrong way around, it should have been --only-binary :all: --no-binary <package> ..., I've edited my comment.

a subtraction operation that currently doesn't exist but would provide dramatically better UX than an IMO slightly tortured interpretation of the --no-binary option.

Agreed, but that doesn't exist, so it's a real use case of pip right now.

@paugier
Copy link
Author

paugier commented Oct 11, 2024

For the record, this works:

pip cache remove fluidfft_fftwmpi*; pip install fluidfft-fftwmpi --no-binary fluidfft-fftwmpi

Note the _ for the first command (remove). This is a pattern for the wheels filename and not a package name.

@sbidoul
Copy link
Member

sbidoul commented Oct 11, 2024

Note the _ for the first command (remove). This is a pattern for the wheels filename and not a package name.

Thanks for the feedback @paugier ! This would warrant a dedicated issue, which might be easy to solve (although not entirely trivially, due to the wildcard).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state: awaiting PR Feature discussed, PR is needed type: feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

5 participants