ray-project · richardliaw · Nov 19, 2025 · Nov 10, 2025 · Nov 10, 2025 · Nov 10, 2025
@@ -652,47 +652,25 @@ Ray Data interoperates with distributed data processing frameworks like `Daft <h
             {'col1': 1, 'col2': '1'}
             {'col1': 2, 'col2': '2'}
 
-.. _loading_datasets_from_ml_libraries:
-
-Loading data from ML libraries
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Ray Data interoperates with HuggingFace, PyTorch, and TensorFlow datasets.
+.. _loading_huggingface_datasets:
 
-.. tab-set::
-
-    .. tab-item:: HuggingFace
-
-        To convert a HuggingFace Dataset to a Ray Datasets, call
-        :func:`~ray.data.from_huggingface`. This function accesses the underlying Arrow
-        table and converts it to a Dataset directly.
-
-        .. warning::
-            :class:`~ray.data.from_huggingface` only supports parallel reads in certain
-            instances, namely for untransformed public HuggingFace Datasets. For those datasets,
-            Ray Data uses `hosted parquet files <https://huggingface.co/docs/datasets-server/parquet#list-parquet-files>`_
-            to perform a distributed read; otherwise, Ray Data uses a single node read.
-            This behavior shouldn't be an issue with in-memory HuggingFace Datasets, but may cause a failure with
-            large memory-mapped HuggingFace Datasets. Additionally, HuggingFace `DatasetDict <https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.DatasetDict>`_ and
-            `IterableDatasetDict <https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.IterableDatasetDict>`_
-            objects aren't supported.
+Loading Hugging Face datasets
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-        .. This snippet below is skipped because of  https://github.com/ray-project/ray/issues/54837.
+To read datasets from the Hugging Face Hub, use :func:`~ray.data.read_parquet` (or other
+read functions) with the ``HfFileSystem`` filesystem. This approach provides better
+performance and scalability than loading datasets into memory first.
 
-        .. testcode::
-            :skipif: True
+First, install the required dependencies
 
-            import ray.data
-            from datasets import load_dataset
+.. _loading_datasets_from_ml_libraries:
 
-            hf_ds = load_dataset("wikitext", "wikitext-2-raw-v1")
-            ray_ds = ray.data.from_huggingface(hf_ds["train"])
-            ray_ds.take(2)
+Loading data from ML libraries
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-        .. testoutput::
-            :options: +MOCK
+Ray Data interoperates with PyTorch and TensorFlow datasets.
 
-            [{'text': ''}, {'text': ' = Valkyria Chronicles III = \n'}]
+.. tab-set::
 
     .. tab-item:: PyTorch