@@ -652,59 +652,6 @@ Ray Data interoperates with distributed data processing frameworks like `Daft <h
652652 {'col1': 1, 'col2': '1'}
653653 {'col1': 2, 'col2': '2'}
654654
655- .. _loading_huggingface_datasets :
656-
657- Loading Hugging Face datasets
658- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
659-
660- To read datasets from the Hugging Face Hub, use :func: `~ray.data.read_parquet ` (or other
661- read functions) with the ``HfFileSystem `` filesystem. This approach provides better
662- performance and scalability than loading datasets into memory first.
663-
664- First, install the required dependencies:
665-
666- .. code-block :: console
667-
668- pip install huggingface_hub
669-
670- Set your Hugging Face token to authenticate. While public datasets can be read without
671- a token, Hugging Face rate limits are more aggressive without a token. To read Hugging
672- Face datasets without a token, simply set the filesystem argument to ``HfFileSystem() ``.
673-
674- .. code-block :: console
675-
676- export HF_TOKEN=<YOUR HUGGING FACE TOKEN>
677-
678- For most Hugging Face datasets, the data is stored in Parquet files. You can directly
679- read from the dataset path:
680-
681- .. testcode ::
682- :skipif: True
683-
684- import os
685- import ray
686- from huggingface_hub import HfFileSystem
687-
688- ds = ray.data.read_parquet(
689- "hf://datasets/wikimedia/wikipedia",
690- file_extensions=["parquet"],
691- filesystem=HfFileSystem(token=os.environ["HF_TOKEN"]),
692- )
693-
694- print(f"Dataset count: {ds.count()}")
695- print(ds.schema())
696-
697- .. testoutput ::
698-
699- Dataset count: 61614907
700- Column Type
701- ------ ----
702- id string
703- url string
704- title string
705- text string
706-
707-
708655.. _loading_datasets_from_ml_libraries :
709656
710657Loading data from ML libraries
0 commit comments