Skip to content

Commit af8eec8

Browse files
Removing code for testing purposes.
Signed-off-by: Robert Nishihara <rkn@anyscale.com>
1 parent 4ee1b52 commit af8eec8

File tree

1 file changed

+0
-53
lines changed

1 file changed

+0
-53
lines changed

doc/source/data/loading-data.rst

Lines changed: 0 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -652,59 +652,6 @@ Ray Data interoperates with distributed data processing frameworks like `Daft <h
652652
{'col1': 1, 'col2': '1'}
653653
{'col1': 2, 'col2': '2'}
654654

655-
.. _loading_huggingface_datasets:
656-
657-
Loading Hugging Face datasets
658-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
659-
660-
To read datasets from the Hugging Face Hub, use :func:`~ray.data.read_parquet` (or other
661-
read functions) with the ``HfFileSystem`` filesystem. This approach provides better
662-
performance and scalability than loading datasets into memory first.
663-
664-
First, install the required dependencies:
665-
666-
.. code-block:: console
667-
668-
pip install huggingface_hub
669-
670-
Set your Hugging Face token to authenticate. While public datasets can be read without
671-
a token, Hugging Face rate limits are more aggressive without a token. To read Hugging
672-
Face datasets without a token, simply set the filesystem argument to ``HfFileSystem()``.
673-
674-
.. code-block:: console
675-
676-
export HF_TOKEN=<YOUR HUGGING FACE TOKEN>
677-
678-
For most Hugging Face datasets, the data is stored in Parquet files. You can directly
679-
read from the dataset path:
680-
681-
.. testcode::
682-
:skipif: True
683-
684-
import os
685-
import ray
686-
from huggingface_hub import HfFileSystem
687-
688-
ds = ray.data.read_parquet(
689-
"hf://datasets/wikimedia/wikipedia",
690-
file_extensions=["parquet"],
691-
filesystem=HfFileSystem(token=os.environ["HF_TOKEN"]),
692-
)
693-
694-
print(f"Dataset count: {ds.count()}")
695-
print(ds.schema())
696-
697-
.. testoutput::
698-
699-
Dataset count: 61614907
700-
Column Type
701-
------ ----
702-
id string
703-
url string
704-
title string
705-
text string
706-
707-
708655
.. _loading_datasets_from_ml_libraries:
709656

710657
Loading data from ML libraries

0 commit comments

Comments
 (0)