Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adds Resource.hash in datapackage.json #665

Merged
merged 5 commits into from
Jan 12, 2025
Merged

Conversation

dangotbanned
Copy link
Member

@dangotbanned dangotbanned commented Jan 12, 2025

Related

Description

This PR adds a hash for each dataset, piggybacking directly from the GitHub trees API.

More info

"""
Use `Get a tree`_ to retrieve a hash for each dataset.
Parameters
----------
ref
The SHA1 value or ref (`branch`_ or `tag`_) name of the tree.
api_version
The `GitHub REST API version`_.
Returns
-------
Mapping from `Resource.path`_ to `Resource.hash`_.
.. _Get a tree:
https://docs.github.com/en/rest/git/trees?apiVersion=2022-11-28#get-a-tree
.. _branch:
https://github.com/vega/vega-datasets/branches
.. _tag:
https://github.com/vega/vega-datasets/tags
.. _GitHub REST API version:
https://docs.github.com/en/rest/about-the-rest-api/api-versions?apiVersion=2022-11-28
.. _Resource.path:
https://datapackage.org/standard/data-resource/#path-or-data
.. _Resource.hash:
https://datapackage.org/standard/data-resource/#hash
"""

Mostly just upstreaming existing code from https://github.com/vega/altair/blob/fdffed0a15be3967c6b9513787fd40feb59c9cdc/tools/datasets/github.py#L145-L159 with some tweaks.

I think this is a broadly useful inclusion to datapackage.json, since we have 74 datasets that have very few revisions.
I've been using this in vega/altair#3631 for caching datasets across versions - and found that as of vega-datasets@v2.11.0 there have only been 115 unique hash values.

Moving this logic here will greatly simplify (vega/altair#3631), as the hash is the last bit of metadata I'm currently not able to get from datapackage.json.
I've been planning out some revisions to get that PR over the line, this will let me remove most of (https://github.com/vega/altair/tree/fdffed0a15be3967c6b9513787fd40feb59c9cdc/tools/datasets) since I no longer need to collect any metadata from multiple endpoints

No longer needed since we have `pyproject.toml`
Resolves an error I got while trying to run the script:

```cmd
>>> uv run scripts/build_datapackage.py
```

```py
Traceback (most recent call last):
  File "../vega-datasets/scripts/build_datapackage.py", line 60, in <module>
    import niquests
ModuleNotFoundError: No module named 'niquests'
```
@dsmedia dsmedia marked this pull request as ready for review January 12, 2025 14:31
@dsmedia dsmedia marked this pull request as draft January 12, 2025 14:33
@dangotbanned
Copy link
Member Author

@dsmedia ah you saw this early 😉

I've still got some docs to do, and then filling out the description (this PR simplifies altair.datasets significantly)

@dangotbanned dangotbanned marked this pull request as ready for review January 12, 2025 16:06
@domoritz domoritz changed the title docs: adds Resource.hash in datapackage.json feat: adds Resource.hash in datapackage.json Jan 12, 2025
@dangotbanned dangotbanned merged commit 9176bda into main Jan 12, 2025
4 checks passed
@dangotbanned dangotbanned deleted the dpkg-resource-hash branch January 12, 2025 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants