-
Notifications
You must be signed in to change notification settings - Fork 427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BE week] - huggingface repo support for datasets_download.py #2130
Conversation
… ReplicaCAD example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thank you!
subprocess.check_call(shlex.split(clone_command)) | ||
|
||
# prune the repo to reduced wasted memory consumption | ||
prune_command = "git lfs prune -f --recent" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…andlig for repo datasources: no version in dir name, fetch checkout instead of delete and re-download.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work! I left some minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
…h lfs after checkout. Polish messages and flow.
Motivation and Context
To support migration of datasets to huggingface, this PR adds datasets_download.py support for datasources with huggingface repo origins.
The tool clones the repo, supports authentication for private repos, and prunes unneeded asset tracking.
Uses the
"version"
metadata field to checkout a particular tag on the dataset repo. Only clones that tag without other history.Updating an existing repo executes a checkout instead of re-cloning.
NOTE: git-lfs and gitpython are required to use this tool. Example install on Ubuntu:
NOTE: older versions of git-lfs (< 3.0) don't support
prune -f
option (e.g. Ubuntu 20.04). Use--no-prune
for these systems. To check your version:How Has This Been Tested
CI testing with new public repo sources. Local testing with private repo sources.
TODO:
Types of changes
Checklist