Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapt to pooch and cleanup testing data #29

Merged
merged 5 commits into from
Aug 23, 2024
Merged

Adapt to pooch and cleanup testing data #29

merged 5 commits into from
Aug 23, 2024

Conversation

Zeitsperre
Copy link
Contributor

@Zeitsperre Zeitsperre commented Aug 16, 2024

  • added a .md5 checksum file generated with make_check_sums.py

This will be paired with changes to be made in:

Changes

  • Moves all entries into a common data folder.
  • Replaces the md5 file signatures with a top-level registry.txt file containing paired filepaths and sha256sums for each file under data.
  • Moves more data information into the README.md file and updates the examples to show the new API.

More Information

This new system will leverage the pooch library to handle checksum verification and file caching, removing the need for many helper functions currently copied in xclim and clisops (and RavenPy). The pooch approach should speed up testing data fetching as checksums will no longer verified on a file-to-file basis using remote calls to GitHub. Instead, locally cached files will be compared to a registry.txt file that is downloaded from here when tests are spun up and (assuming the local checksums do not match existing file checksums) new files will be downloaded as needed.

The goal here is to reduce the number of lines of code at both xclim, clisops, possibly other projects.

@Zeitsperre
Copy link
Contributor Author

FYI, this PR requires changes in existing projects first before merging.

@Zeitsperre Zeitsperre merged commit d206f33 into main Aug 23, 2024
@Zeitsperre Zeitsperre deleted the cleanup branch August 23, 2024 15:01
Zeitsperre added a commit to bird-house/birdhouse-deploy that referenced this pull request Aug 23, 2024
## Overview

This PR updates the cloning of the xclim-testdata repo to reflect
structural changes.

## Changes

**Non-breaking changes**
- Adjusts the location of the xclim-testdata data folder

## Related Issue / Discussion

- Ouranosinc/xclim-testdata#29
- Ouranosinc/xclim#1889

## CI Operations

<!--
The test suite can be run using a different DACCS config with
``birdhouse_daccs_configs_branch: branch_name`` in the PR description.
To globally skip the test suite regardless of the commit message use
``birdhouse_skip_ci`` set to ``true`` in the PR description.
Note that using ``[skip ci]``, ``[ci skip]`` or ``[no ci]`` in the
commit message will override ``birdhouse_skip_ci`` from the PR
description.
-->

birdhouse_daccs_configs_branch: master
birdhouse_skip_ci: false
Zeitsperre added a commit to Ouranosinc/xclim that referenced this pull request Aug 28, 2024
<!--Please ensure the PR fulfills the following requirements! -->
<!-- If this is your first PR, make sure to add your details to the
AUTHORS.rst! -->
### Pull Request Checklist:
- [x] This PR addresses an already opened issue (for bug fixes /
features)
- This PR relies on changes to be merged in
Ouranosinc/xclim-testdata#29
- [x] Tests for the changes have been added (for bug fixes / features)
- [x] (If applicable) Documentation has been added / updated (for bug
fixes / features)
- [x] CHANGELOG.rst has been updated (with summary of main changes)
- [x] Link to issue (:issue:`number`) and pull request (:pull:`number`)
has been added

### What kind of change does this PR introduce?

* Replaces the logic for file gathering and caching from the in-house
developed version to instead use `pooch`.
  * In order to fetch testing data, one can now use the following:
  ```python
  from xclim.testing.utils import nimbus

  n = nimbus()
# from a fork of xclim-testdata:
nimbus(repo="https://github.com/Me/My_Repo", branch="my_test_branch")
  file = n.fetch("some_folder/some_data.nc")
  ```
* Removes the remote GitHub calls for every file request (which was
performed by `_get()`).
* Exports most of the file request and cache handling to `pooch`, while
maintaining a relatively unchanged API for users.
* (To be confirmed) Speeds up the delivery of test data to tests by
reducing the amount of redundant calls to fixtures and relying on a
single pooch instance of pooch to prevent multiple setup stages.

### Does this PR introduce a breaking change?

Absolutely. `get_file` and `open_dataset` no longer fetch remote files
from GitHub. Instead, a locally-stored `registry.txt` file contains all
the checksums of all files needed to run the tests and returns the
appropriate file from a locally-held cache. If the file checksum does
not match the expected value, it will attempt to replace it from the
remote storage.

Additionally, the `md5` files that accompanied all testing data files
are now obsolete thanks to the use of the registry. The testing data is
now versioned according to the `xclim-testdata` version/tag.

All the `prefetch` logic baked into the `pytest` calls has been removed,
making the setup code much easier to follow. There is no longer a need
to run `$ xclim prefetch_testing_data` unless users are running on
Windows (for the very first run of `pytest` only).

There are now three environment variables to help developers:
- XCLIM_TESTDATA_BRANCH
    - Controls the branch name of `xclim-testdata`.
- XCLIM_TESTDATA_CACHE_DIR
    - Controls the local folder to be used when fetching the test data.
- XCLIM_TESTDATA_REPO_URL
    - Controls the repository URL for `xclim-testdata` (for forks) 

`platformdirs` is no longer a hard dependency. The default cache
directory will only be determined if `pooch` is installed.

### Other information:

There is still a lot of potential here to tighten this up; I'd like to
land on a design that is clean and easily portable to other projects.

What is unchanged is that `pytest` will still do the following on every
run:
1. Check that a locally stored copy of the test data exists in a
platform-dependent default location and, if not found, will fetch a
copy.
2. Each worker of `pytest` creates its own copy of the test data, which
is delivered by its own `pooch` instance, written to a threadsafe
temporary directory
3. The equivalent to the `get_file()` fixture is now `nimbus.fetch()`,
providing the absolute paths to files, respective of platform and
workers.

Many tests related to testing the file accessors have also been removed
(as these are now out of scope).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants