Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load GitHub datasets from Hub #4059

Merged
merged 6 commits into from
Sep 16, 2022
Merged

Conversation

albertvillanova
Copy link
Member

@albertvillanova albertvillanova requested a review from lhoestq March 30, 2022 09:21
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Mar 30, 2022

The documentation is not available anymore as the PR was closed or merged.

@lhoestq
Copy link
Member

lhoestq commented Mar 31, 2022

Currently the github datasets versioning is synced with the datasets lib versioning: when you load a github dataset using datasets==x.y.z, then the version of the dataset will be the one at the git tag x.y.z. This is for reproducibility reasons.

We could stop having this behavior and always use the latest version of the dataset, but when we do a breaking change it will break github datasets for previous versions of the library. It could be nice to think about tools that will allow backward compatibility if we ever need to to a breaking change in some datasets. Maybe a way to specify which revision of the dataset to use based on the datasets major version.

If we keep this behavior, then maybe add a note in setup.py to push to PyPI only after the Update Hub repositories CI job is done. It can take a few minutes to add the version tag to all the dataset repositories on the Hub. If we push to PyPI before the tags are pushed, then some users might get some 404 if at the same time they installed datasets and run load_dataset.

@albertvillanova
Copy link
Member Author

albertvillanova commented Mar 31, 2022

@lhoestq I was going to increase the max_retries as done for metrics:

But then I realized that loading from the Hub would work as well. That is why I opened this PR.

Definitely, we should decide which behavior we want:

  • We have been working in the direction of eliminating the distinctions between canonical/community datasets
  • If we continue to go in that direction, then passing (or not passing) revision should have the same behavior for canonical/community
  • If we want to continue to tight the library version with the canonical datasets version, that is definitely a difference between canonical and community datasets

Not sure what could be better in the long term...

@albertvillanova
Copy link
Member Author

albertvillanova commented Mar 31, 2022

We could stop having this behavior and always use the latest version of the dataset, but when we do a breaking change it will break github datasets for previous versions of the library.

Not sure of understanding this. Previous versions of the datasets library will continue to download GitHub datasets from GitHub, syncing library/dataset versions... Where is the problem?

@lhoestq
Copy link
Member

lhoestq commented Mar 31, 2022

Yes you're right, previous versions of datasets will still continue to download from github, but not future versions.
If we release datasets 2.1 by removing this behavior and if one day we release datasets 3.0 with a breaking change in the dataset scripts, then all version >=2.1 will break.

@lhoestq
Copy link
Member

lhoestq commented Mar 31, 2022

Ideally we should drop the differences between github datasets and community datasets, and maybe provide a way to fallback on an older version of a dataset repository if the user's datasets version is too old and incompatible with it.

@lhoestq
Copy link
Member

lhoestq commented Sep 13, 2022

I just noticed I literally opened the same PR lol

I'm still convinced that we should do a better version compatibility check but we can see that later IMO

@albertvillanova
Copy link
Member Author

albertvillanova commented Sep 15, 2022

Normally in open source projects, when there is a duplicate PR, the latter is tagged as "duplicate" and closed. 😜

Let me make things clear in my mind: so you say that the blocking point that was preventing this PR from merging, now is no longer a blocking point and could be addresses in a subsequent PR?

@lhoestq
Copy link
Member

lhoestq commented Sep 15, 2022

Let me close the duplicate one, sorry

Let me make things clear my mind: so you say that the blocking point that was preventing this PR from merging now is no longer a blocking point and could be addresses in a subsequent PR?

Yes 🙈

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool ! LGTM :)

Finally we'll remove the differences between Hub datasets and GitHub datasets ^^

(Note that after this PR, all the changes made to a dataset will affect all the datasets version from now on)

@albertvillanova
Copy link
Member Author

albertvillanova commented Sep 16, 2022

Note that after this PR, all the changes made to a dataset will affect all the datasets version from now on

Yes, we have aligned this behavior with Hub datasets, as this is already the case for Hub datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

github is not always available - probably need a back up
3 participants