Skip to content

Distributing large binary payloads as separate downloads #7852

@achimnol

Description

@achimnol

What's the problem this feature will solve?

Including me, many people are requesting the size limit increase for their wheel packages uploaded to PyPI. In particular, distributing whole OS-specific prebuilt binaries and GPU code binaries often takes hundreds of megabytes.

I think if we have a way to distribute large binary payloads separately, similar to Git LFS, it would be good for both reducing network traffic and PyPI maintenance.
#474 is also related to the idea.

Describe the solution you'd like

This is my rough idea. Maybe there are many edges to clear out.

  • Extend the wheel package format so that:
    • MANIFEST.in can associate specific files and file patterns with paths prefixed with external resource identifiers.
      e.g., assets/mydata.bin -> mybinary/mydata.bin
    • setup.py or setup.cfg can define external resource identifiers as a mapping from slug names to URL prefixes.
      e.g., mybinary -> https://mys3bucket.s3.amazonaws.com/mypackage
  • Extend setuptools so that:
    • Use the external resource identifiers to compose the actual resource URLs and download the payload during the installation process.
    • Let users be able to override the external resource identifiers using environment variables with arbitrary URL or local paths for offline distributions.

Additional context
I'd like to avoid overloading PyPI and network resources, but still want have a seamless way to distribute large-size binaries with the Python's standard packaging mechanism.

The disadvantage of this approach is that wheels become not self-contained, and versioning of external resources may be broken by package maintainers' mistakes. (Maybe, PyPI can provide a fallback repository of external resources, because it is already hosting large-sized packages now based on requests.)
We could mitigate human errors by enforcing specific rules about naming the external resource directories, like requiring them to have the names same to the wheel file names, and using checksums. Moreover, we could extend wheel and twine to handle split-packaging files that exceeds a certain size limit automatically and use a user-provided credentials to upload to sepcific locations (e.g., S3) and PyPI as a fallback.

I just want to give an idea and see what people think.
For example, considering a significant amount of technical efforts to implement and maintain the above idea, it might be more feasible to just allow larger uploads to PyPI.
There may be a past discussion about the same topic, but please forgive me if this is a duplication and guide me to the discussion thread.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions