Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add endpoints to obtain files from within registry tarballs #210

Open
bauglir opened this issue Jun 26, 2024 · 4 comments
Open

Add endpoints to obtain files from within registry tarballs #210

bauglir opened this issue Jun 26, 2024 · 4 comments

Comments

@bauglir
Copy link
Contributor

bauglir commented Jun 26, 2024

I have been working on adding package servers as a "datasource" to the dependency manager Renovate (renovatebot/renovate#29623). This seems like a nice approach as it ensures the dependency manager uses the same mechanism as (most) Julia clients and is therefore more likely to have accurate data (i.e. instead of directly interacting with the underlying Git repository).

One of the things I have run into is that, as far as I can tell, I can only obtain the full tarball of a registry (at a particular "state") and need to handle all the extraction and parsing locally. Given that these tarballs are relatively big and we often only need one or two files from them, it'd be nice if package server had endpoints available "indexing" into the tarballs and handling the necessary extraction and caching in a centralized location. I'm not sure how well this plays with the infrastructure and architecture, but having this information available through specific endpoints would make it a lot more straightforward to write tools like Renovate against package servers.

Some additional notes as to why I consider relying on package servers a nicer approach over directly interacting with the underlying Git repositories:

  • The Git repositories can be hosted on arbitrary hosting platforms, so support would have to be added for all those hosting platforms (i.e. interacting with their individual APIs) in Renovate.
  • Package servers can be used to provide access to "private" packages for which access to the source repositories needs to be restricted (including the registry's repository). It'd be nice if a dependency management system could support these.
@StefanKarpinski
Copy link
Collaborator

This would be doable, of course, but it does add complexity, work and risk to the package server. We want to minimize that, especially on the public package servers, which are exposed to the entire world. Anyone can put up an arbitrary tarball via automated package registration, so adding this would allow them to then trigger extracting that tarball on our servers, which is a lot riskier than just serving arbitrary tarballs as opaque blobs. Of course, I think we're already rewriting tarballs on storage servers, so there's some risk already, but that doesn't actually touch disk at all—it's implemented entirely in-memory in Tar.rewrite. So we may want to limit this to private package servers rather than the public ones. Along those lines, this could be an optional protocol extension. A natural syntax would be to use a $url#$path fragment specifier to get a single file out of a tarball. E.g. something like this:

https://pkg.julialang.org/registry/23338594-aafe-5451-b93e-139f81909106/778a038f4f7f7a4c161ed9bbcdf07a0b0e0e8efb#Registry.toml

@staticfloat, any thoughts?

@staticfloat
Copy link
Member

I'm a little against doing this kind of processing on the server-side. We would need to decompress, extract, recompress and transmit the piece of the data that you're asking for, and while that's not that much work to do, we should assume that any feature we add here could potentially become extremely popular. I'm imagining people using this to extract files from large artifacts when they only want a subset of what the artifact contains. Of course we could then cache these results separately from the full artifact, but it's a decent amount of work to save not a lot of work on the client side. In general, I'd really like to burn client-side CPU time than server CPU time, because if we offer it up via the server I'm pretty sure we're going to run out of server CPU eventually.

@StefanKarpinski
Copy link
Collaborator

Right, but what if this was only on private package servers like the ones JuliaHub offers?

@staticfloat
Copy link
Member

If it's just a private server, I think I would instead just create a second service that gets registries from an upstream package server, and does whatever this application requires. I don't think the motivation behind this (the renovate app) is meant to work with private package servers, is it? It looks to me like it's supposed to work with the opensource package server, although maybe that's just an example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants