Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve reproducibility dd #667

Merged
merged 21 commits into from
Jan 2, 2024
Merged

Improve reproducibility dd #667

merged 21 commits into from
Jan 2, 2024

Conversation

Tjalling-dejong
Copy link
Contributor

@Tjalling-dejong Tjalling-dejong commented Nov 30, 2023

Issue addressed

Fixes #356 #521 and partially #537

Explanation

This PR updates the deltares data catalog to include new features such as versions/variants of data sources. Also if a data source has been pre-processed the path to the pre-processing script or the notes describing the pre-processing is included.
In addition I added the temporal extent of the datasets with a temporal dimension.

Checklist

  • Updated tests or added new tests
  • Branch is up to date with main
  • Tests & pre-commit hooks pass
  • Updated documentation if needed
  • Updated changelog.rst if needed

Additional Notes (optional)

Add any additional notes or information that may be helpful.

@Tjalling-dejong Tjalling-dejong marked this pull request as ready for review December 1, 2023 09:40
Copy link
Contributor

@savente93 savente93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work Tjalling. It would be nice if we can also add the spatial extent and also host the pre-processing scripts we now have locally (I assume on the P drive) somewhere but these can be separate issues I think, so we can merge this for now.

@DirkEilander DirkEilander restored the improve_reproducibility_dd branch January 5, 2024 12:11
@DirkEilander
Copy link
Contributor

DirkEilander commented Jan 5, 2024

Thanks for this work @Tjalling-dejong! Unfortunately this PR got merged too soon I think seeing some comments on Teams. This change directly affects all users and we need to think about how to properly test it. I've reverted the merge (see #703) and restored the branch. You can create a new PR from the branch to continue the review process.

DirkEilander added a commit that referenced this pull request Jan 5, 2024
@Tjalling-dejong
Copy link
Contributor Author

I replicated the error with the soilgrids dataset that people have raised in the HydroMT chat. The error persists in earlier versions of the data catalog as well. What has changed with this PR is that I added versions to data catalog items. With this change the users are served the latest version of a dataset by default. Thus, the soilgrids dataset given to the users is the latest, namely soilgrids_v2.0. This version of the dataset also breaks in earlier versions of the data catalog.

Error can be replicated with datacatalog.get_rasterdataset("soilgrids_v2.0").

@Tjalling-dejong
Copy link
Contributor Author

I have made an issue regarding the error with soilgrids_v2.0 dataset (#706).

@DirkEilander
Copy link
Contributor

DirkEilander commented Jan 8, 2024

Thanks for checking this issue @Tjalling-dejong 👍

The issue was indeed not caused by this PR, but it did change the default dataset as you already mention. I think it would be good to:

  1. have a small test that only runs locally to actually reads (a random slice of) all the datasets in the catalog
  2. keep track and communicate for which datasets the default version changes when we publish. Perhaps we can also add a note in the meta data of the items like updated_in_version: xx
  3. make it common practice to add the version of your dataset in your hydromt scripts / model builder yml files (for discussion in the next plugin meeting @alimeshgi)

Would it be possible to add point 1 to this PR? And let's discuss about point 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: better document in the data catalogs if datasets were pre-processed
3 participants