Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BE week] - huggingface repo support for datasets_download.py #2130

Merged
merged 23 commits into from
Jul 17, 2023

Conversation

aclegg3
Copy link
Contributor

@aclegg3 aclegg3 commented Jun 13, 2023

Motivation and Context

To support migration of datasets to huggingface, this PR adds datasets_download.py support for datasources with huggingface repo origins.

The tool clones the repo, supports authentication for private repos, and prunes unneeded asset tracking.

Uses the "version" metadata field to checkout a particular tag on the dataset repo. Only clones that tag without other history.
Updating an existing repo executes a checkout instead of re-cloning.

NOTE: git-lfs and gitpython are required to use this tool. Example install on Ubuntu:

sudo apt install git-lfs
git lfs install
conda install -y gitpython

NOTE: older versions of git-lfs (< 3.0) don't support prune -f option (e.g. Ubuntu 20.04). Use --no-prune for these systems. To check your version:

git --version
git-lfs --version

How Has This Been Tested

CI testing with new public repo sources. Local testing with private repo sources.

TODO:

  • test without lfs installed.

Types of changes

  • Docs change / refactoring / dependency upgrade
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have completed my CLA (see CONTRIBUTING)
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jun 13, 2023
Copy link
Contributor

@0mdc 0mdc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thank you!

subprocess.check_call(shlex.split(clone_command))

# prune the repo to reduced wasted memory consumption
prune_command = "git lfs prune -f --recent"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@jturner65 jturner65 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aclegg3 aclegg3 marked this pull request as ready for review July 12, 2023 21:58
Copy link
Contributor

@0mdc 0mdc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work! I left some minor comments.

src_python/habitat_sim/utils/datasets_download.py Outdated Show resolved Hide resolved
src_python/habitat_sim/utils/datasets_download.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jimmytyyang jimmytyyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@aclegg3 aclegg3 merged commit 2419cf6 into main Jul 17, 2023
@aclegg3 aclegg3 deleted the datasets_download-huggingface branch July 17, 2023 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed Do not delete this pull request or issue due to inactivity.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants