Investigate wheel filesize #3238

mpenkov · 2021-09-17T09:23:11Z

We build wheels for 4 platforms x 4 Python versions = 16 wheels per release. Each wheel is 25 MB (why so large?), so that's > 400 MB per release! I can see 10 GB disappearing quickly.

How do other projects with extensive wheel support (scikit-learn?) solve this?

Originally posted by @piskvorky in #3237 (comment)

piskvorky · 2021-09-17T13:54:23Z

I checked and the overwhelming portion of the distribution size is due to tests.

In particular, we bundle a lot of test data, from https://github.com/RaRe-Technologies/gensim/tree/develop/gensim/test/test_data

The tests are critical for us / CI, but IMO useless for user installations. I doubt anyone runs the tests locally, after installing gensim from PyPI.

I propose not bundling tests in wheels at all. This should cut the wheel size down to ~1MB = almost nothing. @mpenkov @gojomo WDYT?

gojomo · 2021-09-17T21:37:20Z

Somewhat of a duplicate of #1783.

Much of the test_data may not even be currently used. In #2967 I suggested:

running the full test suite against a test_data directory where we're logging file-touches - so that obsolete, untouched files can be finally discarded

This could be as simple as doing it one time on a developer machine where the filesystem tracks atime (last-accessed time), then deleting all files not accessed during a full-run. If the CI build/test machine filesystems support atime (or would allow other hooks to be added somehow), it could even be added to the build process: log, or possibly even fail, if a 'full test run' doesn't touch some of the test_data files.

Most of test_data could also be left out of user-distros like wheels entirely. There might be some smaller subset we want there for ease of use in example-code - maybe, a demo_data in the project source alongside the test_data, and only demo_data goes into user-builds?

I'd still want to provide some robust 1-liner that lets even end-users, who want to run the full test suite, get all test_data. Perhaps, something that pulls the corresponding build-tag/commit test_data dir directly from github? Unsure of best practice here, but we'd want to be careful that, for example, develop HEAD CI test/build runs always use the exact-sync'd test files, not something that might even be a while-warming-up moment-later.

piskvorky · 2021-09-18T09:10:29Z

Oh wow, I clean forgot about that thread. Thanks for the link.

I'll revisit #2967 (comment) and clean up / drop the test data as part of the bigger solution there.

Closing this ticket – nothing more to investigate here.

mpenkov added the housekeeping internal tasks and processes label Sep 17, 2021

piskvorky closed this as completed Sep 18, 2021

mpenkov mentioned this issue Dec 6, 2021

Limit Request: Gensim - 100 GB pypi/support#1315

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate wheel filesize #3238

Investigate wheel filesize #3238

mpenkov commented Sep 17, 2021

piskvorky commented Sep 17, 2021 •

edited

Loading

gojomo commented Sep 17, 2021

piskvorky commented Sep 18, 2021 •

edited

Loading

Investigate wheel filesize #3238

Investigate wheel filesize #3238

Comments

mpenkov commented Sep 17, 2021

piskvorky commented Sep 17, 2021 • edited Loading

gojomo commented Sep 17, 2021

piskvorky commented Sep 18, 2021 • edited Loading

piskvorky commented Sep 17, 2021 •

edited

Loading

piskvorky commented Sep 18, 2021 •

edited

Loading