Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache VCS repositories locally when installing #11126

Closed
1 task done
sp-ricard-valverde opened this issue May 18, 2022 · 13 comments
Closed
1 task done

Cache VCS repositories locally when installing #11126

sp-ricard-valverde opened this issue May 18, 2022 · 13 comments

Comments

@sp-ricard-valverde
Copy link

sp-ricard-valverde commented May 18, 2022

What's the problem this feature will solve?

When installing a package from a VCS source from which there's no locally cached wheel yet, pip has to download the repository every time to get its metadata.
For big repositories and/or many VCS dependencies, this can increase installation time by a lot.

Describe the solution you'd like

I would like pip to cache the VCS repositories too whenever they are downloaded, and work on that local copy(fetch latest changes, etc).

Alternative Solutions

Changing all VCS packages urls in direct requirements to to local file repositories would be a partial workaround.
It could be used for local development to speed up solving dependency issues, but would not work for transient VCS dependencies.

Additional context

I came to think of this feature while iterating on solving a dependency hell in a project's requirements.txt file. Because the packages in requirements are not built until all dependency issues are solved(one at a time) the packages for from VCS are never actually cached, and every time I try to fix a dependency it has to download all repositories for the VCS packages yet again.

Code of Conduct

@sp-ricard-valverde sp-ricard-valverde added S: needs triage Issues/PRs that need to be triaged type: feature request Request for a new feature labels May 18, 2022
@uranusjr
Copy link
Member

Personally I don’t particularly like this. A VCS repository cache is too easy to corrupt, and compared to other kinds of caches (which are validated by hash), it’s not as simple to detect such corruption. Personally I would prefer to keep things simple, and leave such optimisation to the user.

@pfmoore
Copy link
Member

pfmoore commented May 19, 2022

I agree. Doing the fetch could fail for all sorts of reasons - what if there was a rebase in the source repo, which couldn't be merged without manual intervention? It would take a lot of care in pip to handle those cases cleanly, and IMO the gain isn't worth the complexity.

As the OP noted, changing all VCS urls to local directories, and manually managing the refreshes is a workaround here. I don't think we should try to make pip handle all possible workflows "out of the box", and in this case there's an alternative workflow that doesn't need special support from pip.

@sp-ricard-valverde
Copy link
Author

It's a cache so it could be a best effort, opt-in solution. A fetch with -f(for Git VCS) will solve that particular problem of history rewrites(you don't care about local changes at all).

As I said, changing the VCS urls will only be useful for local development(you will need to remember to change the urls in requirements back again!) and will not work on transitive dependencies.

@RonnyPfannschmidt
Copy link
Contributor

for all content hash based vcs's an optimization can be done (like read if the hash of the remote is still the same,
or if its a tag, even error out if its changed

i believe this applies to mercurial and git

@sbidoul
Copy link
Member

sbidoul commented Jun 4, 2022

I also think pip is not the place to implement VCS-level caching. One approach that helped my group with large repos is a caching git wrapper such as git-autoshare.

pip already caches wheels built from VCS references to git commits (it probably also works for mercurial and bazaar).

But as the OP mentions, wheels are not built until the resolution was successful so the current wheel cache does not help in that case, and I agree the described scenario can be painful.

There are two optimizations I have in the back of my mind that could help:

  • caching prepared metadata (which would help with sdists too)
  • also caching VCS references to branches and tag (based on the resolved commit)

These are not very high on my priority list, though.

@pfmoore
Copy link
Member

pfmoore commented Jun 5, 2022

caching prepared metadata (which would help with sdists too)

This is potentially dangerous in any case. There's no guarantee that a source tree will generate the same prepared metadata when rebuilt at a later time. Consider a hatch plugin similar to hatch-vcs, which generated a calver version based on the date...

We can make some assumptions about metadata for a sdist (name and version are static), but not for a source tree. (Of course, we could declare such edge cases as unsupported, but we can't even know to explicitly reject them without doing the build, so we'd risk silently using incorrect data).

@sbidoul
Copy link
Member

sbidoul commented Jun 5, 2022

There's no guarantee that a source tree will generate the same prepared metadata when rebuilt at a later time.

This applies to our current wheel cache too. There are indeed many ways two builds of the same source tree can yield different results, due to environmental parameters that our wheel cache can't possibly know about.

If we use the same criteria as the wheel cache to decide whether to cache or not, I think the situation would be exactly the same with a metadata cache ?

And in such situations, users can use different caches or disable caching entirely, as they already need to do today with the wheel cache.

@pfmoore
Copy link
Member

pfmoore commented Jun 5, 2022

Good point. But I thought the wheel cache was keyed by project name (and used as essentially an extra source of potential wheels). We can’t do the same with metadata for a source tree as we can’t know the name without getting the metadata. So what would be the key for the metadata cache?

You did say this wasn’t a high priority for you though, so I’m fine if you want to park the discussion for now.

@sbidoul
Copy link
Member

sbidoul commented Jun 5, 2022

Wheel cache entries are keyed by source artifact URL, and then the name and supported tags are necessary to look up a wheel in a cache entry. So to benefit from the wheel cache with direct urls, one has to use the "name @ url" syntax. For sdists obtained via an index, the name is known before looking up the url.

My current intuition is that these mechanisms would work for a metadata cache too, but being sure of that will require investigation and that will be for other times indeed.

@sbidoul
Copy link
Member

sbidoul commented Jun 5, 2022

As a side note, this question of environmental parameters makes me think that we could consider taking into account --config-settings in addition to the URL in wheel cache keys. That may also mean we should encourage users to use config settings over environment variables to pass options to build backends.

@pfmoore
Copy link
Member

pfmoore commented Jun 5, 2022

That might be worthwhile but I’d be reluctant to include the extra complexity until there’s more sign that backends will actually use the config settings.

@sbidoul
Copy link
Member

sbidoul commented Jun 5, 2022

I've created 3 separate issues to discuss and track possible optimizations that could help in the OP scenario.

@sbidoul sbidoul added S: awaiting response Waiting for a response/more information resolution: out of scope and removed S: needs triage Issues/PRs that need to be triaged labels Jun 5, 2022
@sbidoul
Copy link
Member

sbidoul commented Sep 25, 2022

Closing this as the conclusions are tracked in separate issues.

@sbidoul sbidoul closed this as not planned Won't fix, can't repro, duplicate, stale Sep 25, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 26, 2022
@pradyunsg pradyunsg removed the S: awaiting response Waiting for a response/more information label Mar 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants
@RonnyPfannschmidt @uranusjr @sbidoul @pfmoore @pradyunsg @sp-ricard-valverde and others