Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid downloading large package metadata #3431

Open
adrinjalali opened this issue Sep 4, 2024 · 12 comments
Open

Avoid downloading large package metadata #3431

adrinjalali opened this issue Sep 4, 2024 · 12 comments
Labels
type::feature-request New feature proposal type::question Further information is requested

Comments

@adrinjalali
Copy link

In an IRL conversation with @wolfv and others he mentioned the infra for avoiding the download of the massive package metadata when running micromamba commands is there, but I can't find a way to use it. Is there a way already to do that with a micromamba command?

To be specific, this is what I want to avoid:

$ mmamba install jupyter
conda-forge/noarch                                  16.3MB @   7.9MB/s  2.1s
conda-forge/linux-64                                37.6MB @   9.8MB/s  3.9s

...
@wolfv
Copy link
Member

wolfv commented Sep 4, 2024

Sadly, no. It's only implemented in rattler / pixi for now. There you can point to https://fast.prefix.dev/conda-forge to get a speedy version.

@Hind-M Hind-M added the type::question Further information is requested label Sep 19, 2024
@Hind-M Hind-M closed this as completed Sep 19, 2024
@adrinjalali
Copy link
Author

Is this completed? Or it's a "wontfix" kinda thing?

@Hind-M
Copy link
Member

Hind-M commented Sep 19, 2024

So at the moment it's not implemented in mamba/micromamba.
It could be at some point, but not in the short/medium term unfortunately.
We can also mark it as a feature request.

@Hind-M Hind-M reopened this Sep 19, 2024
@Hind-M Hind-M added the type::feature-request New feature proposal label Sep 19, 2024
@jjerphan
Copy link
Member

jjerphan commented Oct 8, 2024

For some context, conda/ceps#75 formalizes sharded repodata, an improved indexing solution for conda channels.

It has been accepted, but it needs to be implemented by forges (including conda-forge), and by packages managers (such as conda and mamba/micromamba).

@jjerphan
Copy link
Member

jjerphan commented Oct 9, 2024

conda/conda#14060 is the main requirement which is being implemented by conda/conda-index#161.

@jjerphan
Copy link
Member

I would prioritize it, since UX-wise this is what cause latency and fatigue for users.

@baszalmstra
Copy link
Contributor

Good to mention that the conda-forge and bioconda mirrors (but any channel really) at https://prefix.dev/ fully support shared repodata. So I think an implementation is not blocked on server support.

@jjerphan
Copy link
Member

Yes, mamba can implement the support of the sharded repodata now without waiting for conda-lock to implement it.

@jjerphan
Copy link
Member

I am not sure about libsolv being usable with the sharded repodata since libsolv has to load all the metadata of packages upfront AFAIK.

@baszalmstra
Copy link
Contributor

Im not entirely sure how you "feed" the data to libsolv but we also have this working in rattler with our libsolv backend.

We preprocess the input specs and recursively fetch the packages for which we encounter depedencies. This is still an eager process but most of the time fetches only a fraction of the total number of packages from the channel. It does include all the records needed for a solve.

As an example, given just the spec "python", we start by fetching all records for python. We then iterate over all dependencies of all python records and see which package names we encounter. For instance libsqlite. We fetch all records of libsqlite and do the same there. This will crawl the entire search space that libsolv would needs to solve the input spec.

You can also implement some smarts tricks here to shrink the total search space. Rattler also dispatch requests for different packages in parallel (make sure to use http/2) and we can aggressively cache the individual shards. This is what makes sharded repodata so fast.

You could also create .solv files for the individual shards for possibly even more speed! (We didnt implement this)

Let me know if you want more details, I can point you to where this happens in the code.

@jjerphan
Copy link
Member

As of now in mamba, entire channels' repodata.json are loaded in libsolv's DataBase, then a problem is constructed and solved as far as I remember.

@baszalmstra
Copy link
Contributor

I see. Yeah that's a bit of a shortcoming as that won't scale very well. It means reading all the data ever created even if that is not needed at all. I also assume that will use much more memory than needed..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type::feature-request New feature proposal type::question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants