Skip to content

Conversation

@robertnishihara
Copy link
Collaborator

@robertnishihara robertnishihara commented Nov 10, 2025

This pull request updates the documentation for reading Hugging Face datasets, recommending the use of ray.data.read_parquet with HfFileSystem for better performance and scalability.

@robertnishihara robertnishihara requested a review from a team as a code owner November 10, 2025 05:54
@robertnishihara robertnishihara changed the title [WIP] Improve instructions for reading Hugging Face datasets [WIP] Improve instructions for reading Hugging Face datasets with Ray Data Nov 10, 2025
@robertnishihara robertnishihara changed the title [WIP] Improve instructions for reading Hugging Face datasets with Ray Data Improve instructions for reading Hugging Face datasets with Ray Data Nov 10, 2025
@robertnishihara robertnishihara added the go add ONLY when ready to merge, run all tests label Nov 10, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the documentation for reading Hugging Face datasets, recommending the use of ray.data.read_parquet with HfFileSystem for better performance and scalability. The changes are a good improvement. I've provided a few suggestions to make the code examples more robust and clearer for users. Specifically, I've recommended using os.environ.get() to avoid KeyError when the Hugging Face token is not set, and suggested using a simpler dataset for the examples. For the second example, I've also proposed using a public API from the datasets library instead of internal ones to make the example more stable across library versions.

@ray-gardener ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues labels Nov 10, 2025
@richardliaw
Copy link
Contributor

tests failing

Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
@richardliaw richardliaw merged commit 978b86c into ray-project:master Nov 19, 2025
6 checks passed
@robertnishihara robertnishihara deleted the hfdocs branch November 19, 2025 22:38
400Ping pushed a commit to 400Ping/ray that referenced this pull request Nov 21, 2025
…th Ray Data (ray-project#58492)

This pull request updates the documentation for reading Hugging Face
datasets, recommending the use of ray.data.read_parquet with
HfFileSystem for better performance and scalability.

---------

Signed-off-by: Robert Nishihara <rkn@anyscale.com>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
…th Ray Data (ray-project#58492)

This pull request updates the documentation for reading Hugging Face
datasets, recommending the use of ray.data.read_parquet with
HfFileSystem for better performance and scalability.

---------

Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…th Ray Data (ray-project#58492)

This pull request updates the documentation for reading Hugging Face
datasets, recommending the use of ray.data.read_parquet with
HfFileSystem for better performance and scalability.

---------

Signed-off-by: Robert Nishihara <rkn@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants