Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate wheel filesize #3238

Closed
mpenkov opened this issue Sep 17, 2021 · 3 comments
Closed

Investigate wheel filesize #3238

mpenkov opened this issue Sep 17, 2021 · 3 comments
Labels
housekeeping internal tasks and processes

Comments

@mpenkov
Copy link
Collaborator

mpenkov commented Sep 17, 2021

We build wheels for 4 platforms x 4 Python versions = 16 wheels per release. Each wheel is 25 MB (why so large?), so that's > 400 MB per release! I can see 10 GB disappearing quickly.

How do other projects with extensive wheel support (scikit-learn?) solve this?

Originally posted by @piskvorky in #3237 (comment)

@mpenkov mpenkov added the housekeeping internal tasks and processes label Sep 17, 2021
@piskvorky
Copy link
Owner

piskvorky commented Sep 17, 2021

I checked and the overwhelming portion of the distribution size is due to tests.

In particular, we bundle a lot of test data, from https://github.com/RaRe-Technologies/gensim/tree/develop/gensim/test/test_data

The tests are critical for us / CI, but IMO useless for user installations. I doubt anyone runs the tests locally, after installing gensim from PyPI.

I propose not bundling tests in wheels at all. This should cut the wheel size down to ~1MB = almost nothing. @mpenkov @gojomo WDYT?

@gojomo
Copy link
Collaborator

gojomo commented Sep 17, 2021

Somewhat of a duplicate of #1783.

Much of the test_data may not even be currently used. In #2967 I suggested:

  • running the full test suite against a test_data directory where we're logging file-touches - so that obsolete, untouched files can be finally discarded

This could be as simple as doing it one time on a developer machine where the filesystem tracks atime (last-accessed time), then deleting all files not accessed during a full-run. If the CI build/test machine filesystems support atime (or would allow other hooks to be added somehow), it could even be added to the build process: log, or possibly even fail, if a 'full test run' doesn't touch some of the test_data files.

Most of test_data could also be left out of user-distros like wheels entirely. There might be some smaller subset we want there for ease of use in example-code - maybe, a demo_data in the project source alongside the test_data, and only demo_data goes into user-builds?

I'd still want to provide some robust 1-liner that lets even end-users, who want to run the full test suite, get all test_data. Perhaps, something that pulls the corresponding build-tag/commit test_data dir directly from github? Unsure of best practice here, but we'd want to be careful that, for example, develop HEAD CI test/build runs always use the exact-sync'd test files, not something that might even be a while-warming-up moment-later.

@piskvorky
Copy link
Owner

piskvorky commented Sep 18, 2021

Oh wow, I clean forgot about that thread. Thanks for the link.

I'll revisit #2967 (comment) and clean up / drop the test data as part of the bigger solution there.

Closing this ticket – nothing more to investigate here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
housekeeping internal tasks and processes
Projects
None yet
Development

No branches or pull requests

3 participants