-
Notifications
You must be signed in to change notification settings - Fork 659
List hf hub datasets #5226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List hf hub datasets #5226
Conversation
WalkthroughA new function named Changes
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (2)
fiftyone/utils/huggingface.py (2)
67-72
: Documentation could be more comprehensive.The docstring should include:
- Return value description
- Possible exceptions
- Example usage with expected output
def list_hub_datasets(): """Lists all FiftyOne datasets available on the Hugging Face Hub. This function prints instructions for loading datasets and displays a list of all datasets published by Voxel51 on the Hugging Face Hub. + + Returns: + None + + Raises: + huggingface_hub.utils.HfHubHTTPError: If there are connectivity issues + or authentication errors with the Hugging Face Hub. + + Example: + >>> import fiftyone.utils.huggingface as fhub + >>> fhub.list_hub_datasets() + You can load any of the following datasets as follows: + + from fiftyone.utils.huggingface import load_from_hub + dataset = load_from_hub("Voxel51/dataset-name") + + Datasets on the Hugging Face Hub: + + quickstart + open-images-v7 + ... """
67-85
: Consider performance optimization for large dataset listings.The current implementation might be inefficient for organizations with many datasets.
Consider implementing pagination or limiting the number of datasets retrieved at once:
def list_hub_datasets(limit=None): """Lists FiftyOne datasets available on the Hugging Face Hub. Args: limit (None): Optional maximum number of datasets to list """ try: datasets = list_datasets(author="Voxel51") if limit: datasets = itertools.islice(datasets, limit) print('You can load any of the following datasets as follows:\n') print('from fiftyone.utils.huggingface import load_from_hub') print('dataset = load_from_hub("Voxel51/dataset-name")\n') print('Datasets on the Hugging Face Hub:\n') for dataset in datasets: dataset_name = dataset.id.split('Voxel51/')[-1] print(dataset_name) except Exception as e: logger.error(f"Failed to fetch datasets from Hugging Face Hub: {e}") raise
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
📒 Files selected for processing (1)
fiftyone/utils/huggingface.py
(2 hunks)
🔇 Additional comments (1)
fiftyone/utils/huggingface.py (1)
Line range hint 34-42
: Verify huggingface_hub dependency version compatibility.
The new list_hub_datasets()
function relies on the list_datasets
function from huggingface_hub. Ensure it's compatible with the minimum version requirement.
✅ Verification successful
Dependency version requirement is compatible with the implementation
The implementation of list_hub_datasets()
uses list_datasets
from huggingface_hub, and the minimum version requirement is set to >=0.20.0
. The list_datasets
function has been a core feature of huggingface_hub since well before version 0.20.0, making this implementation safe and compatible with the specified version constraint.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if list_datasets is available in the minimum required version
# Get the minimum required version from the lazy import
min_version=$(rg -A 1 'huggingface_hub.*callback.*ensure_package' | grep -o '[0-9]\+\.[0-9]\+\.[0-9]\+')
# Check when list_datasets was introduced in huggingface_hub
gh api graphql -f query='
{
repository(owner: "huggingface", name: "huggingface_hub") {
object(expression: "main") {
... on Commit {
history(first: 1, path: "src/huggingface_hub/hf_api.py", search: "def list_datasets") {
nodes {
associatedPullRequests(first: 1) {
nodes {
title
mergedAt
}
}
}
}
}
}
}
}'
Length of output: 901
Script:
#!/bin/bash
# Let's try a different approach to verify the dependency compatibility
# Search for list_datasets usage in the codebase
rg "list_datasets" -B 2 -A 2
# Check the actual implementation to understand the dependency
ast-grep --pattern 'def list_hub_datasets($$$)'
Length of output: 15882
Script:
#!/bin/bash
# Let's check the actual implementation of list_hub_datasets and its usage of huggingface_hub
# Find the implementation of list_hub_datasets
rg -B 5 -A 10 "def list_hub_datasets"
# Check the huggingface_hub version requirement in other files
rg -g "*.py" "huggingface.*>=[0-9]+\.[0-9]+\.[0-9]+"
Length of output: 1324
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Returns: | ||
a list of dataset names or objects | ||
""" | ||
datasets = list(hfh.list_datasets(filter="library:fiftyone")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This used to be author="Voxel51"
but I switched it to filter="library:fiftyone"
so that we include all FiftyOne-compatible datasets
What changes are proposed in this pull request?
Added proper documentation and improved implementation of the
list_hub_datasets()
function infiftyone/utils/huggingface.py
. The changes include:How is this patch tested? If it is not, please explain why.
The changes can be tested by:
Release Notes
Is this a user-facing change that should be mentioned in the release notes?
Added documentation and improved implementation of
list_hub_datasets()
function, making it easier for users to discover and load FiftyOne datasets from the Hugging Face Hub.What areas of FiftyOne does this PR affect?
fiftyone
Python library changesSummary by CodeRabbit