Exclude unnecssary data from pvlib packaging #2271

AdamRJensen · 2024-10-20T14:01:39Z

Most of the data in the data folder are only used for testing and not useful for any examples. However, they add up a considerable amount of space (MBs) relative to the pvlib package itself. For applications such as WASM (code running in the browser), all this clutter slows down loading significantly. The data folder is currently 57 MB out of a total of 63 MB.

Of course, there are data files that we want to keep such as the Linke turbidity and CEC database. However, a rough estimation shows that removing unused data files can reduce the pvlib package but half!

Excluding/including data files can be specified in the pyproject.toml file, for example:

[tool.setuptools.packages.find]
where = ["src"]  # list of folders that contain the packages (["."] by default)
include = ["my_package*"]  # package names should match these glob patterns (["*"] by default)
exclude = ["my_package.tests*"]  # exclude packages matching these glob patterns (empty by default)

This can also be done in the MANIFEST.in file, which is pvlib's current approach, and seems to have finer control.

The pvlib/tests folder can also be excluded, and save 2.5 MB.

The text was updated successfully, but these errors were encountered:

cwhanse · 2024-10-20T16:39:15Z

#1056 this has been needed for some time.

adriesse · 2024-10-20T20:39:34Z

Good initiative. Is it common to exclude the tests from a package?

kandersolar · 2024-10-20T21:48:49Z

scipy and pandas both include their tests in their distribution files (although some pandas maintainers have agreed that it would be better to exclude them, pandas-dev/pandas#30741).

I see little value in including the tests in the wheels. If we exclude most data files, then the test functions become even less useful to include. It may make sense to continue including the tests (and data files?) in the sdist distributions, I am not sure.

adriesse · 2024-10-21T18:40:49Z

Few more thoughts:

@AdamRJensen is it really the volume in MB that is slowing down WASM? 63 MB isn't much nowadays.
Some (many?) files could be reduced in size by truncation (e.g. trailing zeros) and/or compression (e.g., zip)
Perhaps some files could serve their purpose with fewer records.

echedey-ls · 2024-10-21T23:40:50Z

I've finished classifying the files in #1056's spreadsheet and the result shows the pvlib/data folder would reduce to 24MB (41% of the original) if only necessary files for the API code are shipped in that folder.

Although I understand facilitating the tests can be a source of easing and validating the working environments, it is a driving bottleneck in the installation process. I would just remove them from wheels due to that.

Some (many?) files could be reduced in size by truncation (e.g. trailing zeros) and/or compression (e.g., zip)

That would add extra time at running tests and adding/maintaining them. Last alternative does not sound too bad, but it's not a small effort considering the amount of files used in the tests (58) and that an unknown amount of expected results are hardcoded, I can't quantify how many.

cwhanse · 2024-10-21T23:49:40Z

If we remove data files for tests from the distribution, does that mean that pytest test_xxxx.py will return a bunch of errors unrelated to the contributor's new tests?

echedey-ls · 2024-10-21T23:59:49Z

If we remove data files for tests from the distribution, does that mean that pytest test_xxxx.py will return a bunch of errors unrelated to the contributor's new tests?

No since pytest.yml workflow already checks out the repo, so all files are available at run time, it's just a matter of telling the distribution builder whether to include them or not before posting it to PyPI.

https://github.com/pvlib/pvlib-python/blob/c52e600b996733a4a3371623a3784ccda5667e68/.github/workflows/pytest.yml#L33C1-L36C27

I just wonder if the conda recipe will need to be changed, IDK how conda-forge recipes work in the background:

https://github.com/conda-forge/pvlib-python-feedstock/blob/f37d40d81becdb025578d0e09ca44ebe3cbeb466/recipe/meta.yaml#L60C1-L62C16

cwhanse · 2024-10-22T00:14:41Z

If we remove data files for tests from the distribution, does that mean that pytest test_xxxx.py will return a bunch of errors unrelated to the contributor's new tests?

No since pytest.yml workflow already checks out the repo, so all files are available at run time, it's just a matter of telling the distribution builder whether to include them or not before posting it to PyPI.

But wouldn't the errors raise if the test is run locally instead of using a GH action via a pull request?

echedey-ls · 2024-10-22T00:20:41Z

Ah I see, you refer to that if I install pvlib (non editable mode, either from the cloned repo or PyPI), then go to the package folder in Python's site-packages/pvlib/tests and run that. Yeah, you are right, it would fail. For that [edge-]case Adam's statement:

The pvlib/tests folder can also be excluded, and save 2.5 MB.

would solve it by not allowing any tests to be run from the distribution installation. I personally support that.

Edit: but still on the cloned repo everything should work (it's essentially the same as the workflow), so a contributor shouldn't find any problem running them.

cwhanse · 2024-10-22T00:36:30Z

I understand my own question better. Excluding the test files from the distribution wouldn't affect running tests locally on a fork, since the fork clones the repository rather than installs from the distribution. I don't have a concern here.

AdamRJensen · 2024-10-22T02:29:55Z

@AdamRJensen is it really the volume in MB that is slowing down WASM? 63 MB isn't much nowadays.

I'd like to develop widgets that can run using WASM, and these need to be relatively fast so as not to annoy the user with a wait time. As long as there is no downsides to the package, I think it's preferred to have it as small as possible.

Some (many?) files could be reduced in size by truncation (e.g. trailing zeros) and/or compression (e.g., zip)

I don't think the required files have significant trailing zeros (at least not the files I inspected). The big culprit is the LinkeTurbidities.h5 file(15.3 MB), so if we were to do something, it should start with this file.

Perhaps some files could serve their purpose with fewer records.

This predominantly applies to test files, which can just be excluded.

adriesse · 2024-10-23T12:18:43Z

My strong impression is that the suggestion to remove all tests is being promoted and implemented much too rapidly. Why the furious flurry of activity?

kandersolar · 2024-10-23T13:12:38Z

Order of operations here should be something like this:

Merge Show distribution file sizes in publish.yml GH Actions workflow #2278, and update [MAINT]: Remove tests from wheel distro #2277 accordingly, so the CI can check that the "official" wheel file sizes are reduced as expected and that the resulting wheel is still installable
Manual inspection/verification of the resulting wheel & sdist, similar to what we have done in other PRs that change our packaging (Support PEP621 #1910, Remove setup.cfg in favor of pyproject.toml twoaxistracking#58)
If automated and manual checks look good, merge [MAINT]: Remove tests from wheel distro #2277
Sometime before the next release (mid December), make a pre-release to verify that the new wheels don't run into any issues downstream of us with PyPI/conda.

the suggestion to remove all tests

Just to clarify for anyone reading this issue: the proposed changes to pvlib's tests is to exclude them only from distribution bundles, i.e. the python files delivered when running pip install pvlib. Tests will not be removed from the repository itself, and they will continue to be a central aspect of pvlib's development.

kandersolar · 2024-10-23T13:25:41Z

By the way, a fun estimate: reducing wheel size by 1/3 (as suggested by #2277), multiplied by 14k pvlib downloads per day, eliminates >100GB/day data transfer from PyPI. Maybe not world-changing, but if there is no downside...

kandersolar · 2024-11-05T17:00:23Z

A problem: how can we automatically verify that all the "necessary" data files are included in the wheels? For example, if somehow someone introduced a bug into pyproject.toml that caused the CEC tables to be excluded from the wheel, is there any way we can have a CI failure to alert us?

We cannot really use the tests for this, since if we exclude the test data files from the wheels, then we cannot run the pytests on an installation that came from a wheel. The only thing I can think of is to have an optional environment variable that sets the TESTS_DATA_DIR to point to the source tree location, which I guess could work, but is a bit messy.

mikofski · 2024-11-06T00:18:36Z

You can create a source distribution or sdist which is a *.tar.gz or zip file that has everything in the manifest, but doesn’t have the same package data as in the wheel or bdist. Check out PVmismatch for an example, the CI builds and deploys both a wheel with package data and a source distribution with everything in the manifest. You can check it on PyPI or GitHub.

echedey-ls · 2024-11-06T19:51:51Z

A problem: how can we automatically verify that all the "necessary" data files are included in the wheels? For example, if somehow someone introduced a bug into pyproject.toml that caused the CEC tables to be excluded from the wheel, is there any way we can have a CI failure to alert us?

@kandersolar , I think it would be very rare to overlook changes that:

Affect MANIFEST.in or pyproject.toml, or
Use files in the test data folder both in the examples and in API code

But I understand that any overhead in maintenance seems as a problem, and if I were you I would be concerned about that too. Ya know, I'm just a contributor 🎀

Some brainstorming:

(1.) We can rely on the users as the beta testers and yank versions if needed. (Edit: this is an euphemism for not doing anything)
(2.) Some script that compares what gets into site-packages/pvlib against the local clone of the repo could be made, but I doubt if it really adds something to the rare case somebody messes these files or uses incorrect paths.

I think just cloning the tests folder and running them, given pvlib is already installed would work.
(2.1.) This could be a script bundled in all distros, maybe just for linux-based OS's
(2.2.) Or a new job / step.

(3.) if making a heavier distinction between the package folder and the excluded folders (including tests/) works for you, we can change to a flat repo layout:

...

pvlib/

data/

...

tests/

data/

...

...

From all these options, I would pick (3.); (2.) may over-engineer things too much and be a burden to maintain in the future. Personally, I find (3.) easy enough, makes the code more intuitive for people not used to the new feature, and is no different from all these years that if someone used a file outside of pvlib/ in API code then tests would pass but an installation would be wrong. The difference now is that if a file under pvlib/tests/ is used or the rare case a file for the distro is not included, then that would fail on use.

(1.) is not much different from (3.), as neither of them detect these errors prior to publishing.

We cannot really use the tests for this, since if we exclude the test data files from the wheels, then we cannot run the pytests on an installation that came from a wheel. The only thing I can think of is to have an optional environment variable that sets the TESTS_DATA_DIR to point to the source tree location, which I guess could work, but is a bit messy.

(4.) It's also an option I wouldn't have thought of.

markcampanelli · 2024-11-07T00:26:17Z

Echedey, Did you consider using https://pypi.org/project/check-manifest/? It’s unclear to me if this would catch the situations that you’re concerned about. Mark Campanelli LinkedIn <https://www.linkedin.com/in/markcampanelli/> Intelligent Measurement Systems LLC <https://intelligentmeasurementsystems.com> Try PVfit <https://pvfit.app> today!

…

On Wed, Nov 6, 2024 at 12:52 Echedey Luis ***@***.***> wrote: A problem: how can we automatically verify that all the "necessary" data files are included in the wheels? For example, if somehow someone introduced a bug into pyproject.toml that caused the CEC tables to be excluded from the wheel, is there any way we can have a CI failure to alert us? @kandersolar <https://github.com/kandersolar> , I think it would be very rare to overlook changes that: 1. Affect MANIFEST.in or pyproject.toml, or 2. Use files in the test data folder both in the examples and in API code But I understand that any overhead in maintenance seems as a problem, and if I were you I would be concerned about that too. Ya know, I'm just a contributor 🎀 Some brainstorming: (1.) We can rely on the users as the beta testers and yank versions if needed. (2.) Some script that compares what gets into site-packages/pvlib against the local clone of the repo could be made, but I doubt if it really adds something to the rare case somebody messes these files or uses incorrect paths. - I think just cloning the tests folder <https://stackoverflow.com/questions/600079/how-do-i-clone-a-subdirectory-only-of-a-git-repository> and running them, given pvlib is already installed would work. - (2.1.) This could be a script bundled in all distros, maybe just for linux-based OS's - (2.2.) Or a new job / step. (3.) if making a heavier distinction between the package folder and the excluded folders (including tests/) works for you, we can change to a flat repo layout: - ... - pvlib/ - data/ - ... - tests/ - data/ - ... - ... From all these options, I would pick (3.); (2.) may over-engineer things too much and be a burden to maintain in the future. Personally, I find (3.) easy enough, makes the code more intuitive for people not used to the new feature, and is no different from all these years that if someone used a file outside of pvlib/ in API code then tests would pass but an installation would be wrong. The difference now is that if a file under pvlib/tests/ is used or the rare case a file for the distro is not included, then that would fail on use. (1.) is not much different from (3.), as neither of them detect these errors prior to publishing. We cannot really use the tests for this, since if we exclude the test data files from the wheels, then we cannot run the pytests on an installation that came from a wheel. The only thing I can think of is to have an optional environment variable that sets the TESTS_DATA_DIR to point to the source tree location, which I guess could work, but is a bit messy. (4.) It's also an option I wouldn't have thought of. — Reply to this email directly, view it on GitHub <#2271 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAISX46N27R4SA6MARS3WOLZ7JXO3AVCNFSM6AAAAABQIROZS6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRQGY2DMMJSG4> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

echedey-ls · 2024-11-09T12:53:18Z

That's an interesting package @markcampanelli , but I doubt it's useful in this case, since that checks the sdist and not the bdist. The latter excludes the tests since it's the one that gets installed via a normal pip install pvlib/. Tests are still being included in sdist.

Source: https://pypi.org/project/check-manifest/ , only sdist is mentioned, bdist is nowhere in that page:

$ check-manifest -u -v
listing source files under version control: 6 files and directories
building an sdist: check-manifest-0.7.tar.gz: 4 files and directories
lists of files in version control and sdist do not match!
missing from sdist:
  tests.py
  tox.ini
suggested MANIFEST.in rules:
  include *.py
  include tox.ini
updating MANIFEST.in

$ cat MANIFEST.in
include *.rst

# added by check_manifest.py
include *.py
include tox.ini

kandersolar · 2024-12-04T19:54:51Z

Another consideration: pvlib tutorial/example code often assumes that the 723170TYA.CSV TMY3 file will be available via something like:

DATA_DIR = pathlib.Path(pvlib.__file__).parent / 'data'
df_tmy, meta_dict = pvlib.iotools.read_tmy3(DATA_DIR / '723170TYA.CSV', coerce_year=1990)

I count ten pages in our example gallery that expect it to be available: https://pvlib-python.readthedocs.io/en/latest/reference/generated/pvlib.iotools.read_tmy3.html#examples-using-pvlib-iotools-read-tmy3

This pattern also shows up in other tutorial material, e.g.: https://pv-tutorials.github.io/PVSC50/Tutorial%201%20-%20TMY%20Weather%20Data.html#reading-a-tmy-dataset-with-pvlib

In some sense, we've elevated that file (originally included for testing) to be a de-facto member of the pvlib API. Removing it from distribution bundles would be disruptive to users, and we would need to rework all of those tutorials somehow.

Would it make sense to treat 723170TYA.CSV as a special case and retain it in the distribution? I lean towards yes, but perhaps there is some creative alternative I'm not considering.

Edit to add: I suppose the train of thought here really applies not just to the 723170TYA.CSV file, but its specific location as well: pvlib/data/723170TYA.CSV.

adriesse · 2024-12-04T22:29:18Z

Ironically, I (re)used that file in order to avoid making pvlib bigger!

AdamRJensen · 2024-12-09T23:56:37Z

Would it make sense to treat 723170TYA.CSV as a special case and retain it in the distribution? I lean towards yes, but perhaps there is some creative alternative I'm not considering.

The file is small enough (1.7 MB) relative to the pvlib package, that I am in favor of including it as a special file for now.

AdamRJensen · 2024-12-09T23:57:41Z

@wholmgren I'm curious as to your opinion on @echedey-ls's proposal for a flat structure. I recall that you had an strong opinion on something similar a long time ago.

wholmgren · 2024-12-10T16:59:20Z

I did a quick search and didn't find the strong opinion that you're referring to. In any case, Python tooling, packaging, testing, and layout recommendations have evolved a lot even in the last few years and I don't feel that I've kept up well enough to add much to what's already discussed above. So I don't object to the flat structure. What's the latest community thinking on using src/package instead of just package?

cwhanse added enhancement help wanted performance labels Oct 20, 2024

echedey-ls linked a pull request Oct 22, 2024 that will close this issue

[MAINT]: Remove tests from wheel distro #2277

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude unnecssary data from pvlib packaging #2271

Exclude unnecssary data from pvlib packaging #2271

AdamRJensen commented Oct 20, 2024 •

edited

Loading

cwhanse commented Oct 20, 2024

adriesse commented Oct 20, 2024

kandersolar commented Oct 20, 2024 •

edited

Loading

adriesse commented Oct 21, 2024

echedey-ls commented Oct 21, 2024

cwhanse commented Oct 21, 2024

echedey-ls commented Oct 21, 2024

cwhanse commented Oct 22, 2024

echedey-ls commented Oct 22, 2024 •

edited

Loading

cwhanse commented Oct 22, 2024

AdamRJensen commented Oct 22, 2024 •

edited by kandersolar

Loading

adriesse commented Oct 23, 2024 •

edited

Loading

kandersolar commented Oct 23, 2024

kandersolar commented Oct 23, 2024

kandersolar commented Nov 5, 2024

mikofski commented Nov 6, 2024

echedey-ls commented Nov 6, 2024 •

edited

Loading

markcampanelli commented Nov 7, 2024 via email

echedey-ls commented Nov 9, 2024

kandersolar commented Dec 4, 2024 •

edited

Loading

adriesse commented Dec 4, 2024

AdamRJensen commented Dec 9, 2024

AdamRJensen commented Dec 9, 2024

wholmgren commented Dec 10, 2024

Exclude unnecssary data from pvlib packaging #2271

Exclude unnecssary data from pvlib packaging #2271

Comments

AdamRJensen commented Oct 20, 2024 • edited Loading

cwhanse commented Oct 20, 2024

adriesse commented Oct 20, 2024

kandersolar commented Oct 20, 2024 • edited Loading

adriesse commented Oct 21, 2024

echedey-ls commented Oct 21, 2024

cwhanse commented Oct 21, 2024

echedey-ls commented Oct 21, 2024

cwhanse commented Oct 22, 2024

echedey-ls commented Oct 22, 2024 • edited Loading

cwhanse commented Oct 22, 2024

AdamRJensen commented Oct 22, 2024 • edited by kandersolar Loading

adriesse commented Oct 23, 2024 • edited Loading

kandersolar commented Oct 23, 2024

kandersolar commented Oct 23, 2024

kandersolar commented Nov 5, 2024

mikofski commented Nov 6, 2024

echedey-ls commented Nov 6, 2024 • edited Loading

markcampanelli commented Nov 7, 2024 via email

echedey-ls commented Nov 9, 2024

kandersolar commented Dec 4, 2024 • edited Loading

adriesse commented Dec 4, 2024

AdamRJensen commented Dec 9, 2024

AdamRJensen commented Dec 9, 2024

wholmgren commented Dec 10, 2024

AdamRJensen commented Oct 20, 2024 •

edited

Loading

kandersolar commented Oct 20, 2024 •

edited

Loading

echedey-ls commented Oct 22, 2024 •

edited

Loading

AdamRJensen commented Oct 22, 2024 •

edited by kandersolar

Loading

adriesse commented Oct 23, 2024 •

edited

Loading

echedey-ls commented Nov 6, 2024 •

edited

Loading

kandersolar commented Dec 4, 2024 •

edited

Loading