Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation on wheel file characteristics in CI #110

Open
jameslamb opened this issue Oct 22, 2024 · 2 comments
Open

Add validation on wheel file characteristics in CI #110

jameslamb opened this issue Oct 22, 2024 · 2 comments
Assignees
Labels
improvement Improves an existing functionality

Comments

@jameslamb
Copy link
Member

Description

As we start to publish more RAPIDS wheels directly to https://pypi.org/, it'll be important to validate the following characteristics:

These and any other non-functional characteristics of wheels that can be validated easily, quickly, and programmatically should be validated in CI.

Benefits of this work

  • improves release confidence ... reduces the risk of "oops these wheels are too big for PyPI" surprises near the end of a release
  • reduces the effort required to identify which changes led to increases in wheel sizes

Acceptance Criteria

  • CI for RAPIDS libraries fails on pull requests when wheel builds produce wheels that:
    • do not pass twine check --strict
    • are too large (compressed)

Approach

Such checks should run in wheel-building CI jobs, on PRs... they don't require GPUs, and should run so quickly that it isn't worth the spin-up time and cost of a dedicated wheel-validation CI job.

This could be accomplished with something like the following in wheel-build scripts, after the wheel file is finalized (e.g. has gone through auditwheel repair).

# check README formatting
twine check --strict dist/*

# assert size, and print a summary of contents
pydistcheck \
  --inspect \
  --select 'distro-too-large-compressed' \
  --max-allowed-size-compressed '100M' \
  ./delete-me/*
example of this using 24.12 libcudf-cu12 nightlies (click me)

On an x86_64 linux system:

mkdir -p ./delete-me
pip download \
    --extra-index-url https://pypi.anaconda.org/rapidsai-wheels-nightly/simple/ \
    --no-deps \
    -d ./delete-me \
    'libcudf-cu12==24.12.*,>=0.0.0a0'

Checking the formatting.

twine check --strict ./delete-me/*
Checking ./delete-me/libcudf_cu12-24.12.0a223-py3-none-manylinux_2_28_x86_64.whl: PASSED

And checking the size (using 5M just to show how this would fail in CI if the wheels got too big):

pydistcheck \
  --inspect \
  --select 'distro-too-large-compressed' \
  --max-allowed-size-compressed '5M' \
  ./delete-me/*
==================== running pydistcheck ====================

checking './delete-me/libcudf_cu12-24.12.0a223-py3-none-manylinux_2_28_x86_64.whl'
----- package inspection summary -----
file size
  * compressed size: 0.4G
  * uncompressed size: 0.7G
  * compression space saving: 35.2%
contents
  * directories: 163
  * files: 1974 (2 compiled)
size by extension
  * .so - 0.6G (97.0%)
  * .h - 6.7M (1.0%)
  * no-extension - 5.3M (0.8%)
  * .cuh - 3.8M (0.6%)
  * .hpp - 2.2M (0.3%)
  * .a - 1.2M (0.2%)
  * .inl - 0.8M (0.1%)
  * .cmake - 0.1M (0.0%)
  * .md - 8.3K (0.0%)
  * .py - 3.6K (0.0%)
  * .pc - 0.2K (0.0%)
  * .txt - 34.0B (0.0%)
largest files
  * (0.6G) libcudf/lib64/libcudf.so
  * (3.8M) libcudf/bin/flatc
  * (1.1M) libcudf/lib64/libflatbuffers.a
  * (0.5M) libcudf/include/libcudf/rapids/libcudacxx/cuda/std/__atomic/functions/cuda_ptx_generated.h
  * (0.2M) libcudf_cu12-24.12.0a223.dist-info/RECORD
------------ check results -----------
1. [distro-too-large-compressed] Compressed size 0.4G is larger than the allowed size (5.0M).
errors found while checking: 1

==================== done running pydistcheck ===============
echo $?
# 1

This should be rolled out to all RAPIDS projects producing wheels, starting with those targeting publishing to https://pypi.org/.

Notes

Disclaimer

pydistcheck is something I (@jameslamb) created and am currently the sole contributor on, so I'm biased about it being the best tool for this purpose.

The narrow goal here, "assert files are not too big", could be accomplished other ways... like unzipping the wheel and using some mix of stat / du / find -size, or with a Python script.

I think the summaries pydistcheck generates (see "Approach" section) and all the others things we could ask it to check in the future (docs link) make it preferable to those approaches, but I shouldn't be the one to decide that.

@jameslamb jameslamb added the improvement Improves an existing functionality label Oct 22, 2024
@bdice
Copy link
Contributor

bdice commented Oct 22, 2024

I took a brief look at pydistcheck and would support adopting it. It seems like a great tool for this job. It's also good to lean into your expertise in this area. 👍

@vyasr
Copy link
Contributor

vyasr commented Oct 22, 2024

Yup I'm happy with trying out pydistcheck. Maybe we should do a group review of it in a future build-infra team meeting to make sure we're all aligned. Adding a size check to CI is definitely critical so that we know when someone tries to add a feature that suddenly bloats binary size.

@jameslamb jameslamb self-assigned this Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improves an existing functionality
Projects
None yet
Development

No branches or pull requests

3 participants