Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 12 additions & 34 deletions doc/source/data/loading-data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -652,47 +652,25 @@ Ray Data interoperates with distributed data processing frameworks like `Daft <h
{'col1': 1, 'col2': '1'}
{'col1': 2, 'col2': '2'}

.. _loading_datasets_from_ml_libraries:

Loading data from ML libraries
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Ray Data interoperates with HuggingFace, PyTorch, and TensorFlow datasets.
.. _loading_huggingface_datasets:

.. tab-set::

.. tab-item:: HuggingFace

To convert a HuggingFace Dataset to a Ray Datasets, call
:func:`~ray.data.from_huggingface`. This function accesses the underlying Arrow
table and converts it to a Dataset directly.

.. warning::
:class:`~ray.data.from_huggingface` only supports parallel reads in certain
instances, namely for untransformed public HuggingFace Datasets. For those datasets,
Ray Data uses `hosted parquet files <https://huggingface.co/docs/datasets-server/parquet#list-parquet-files>`_
to perform a distributed read; otherwise, Ray Data uses a single node read.
This behavior shouldn't be an issue with in-memory HuggingFace Datasets, but may cause a failure with
large memory-mapped HuggingFace Datasets. Additionally, HuggingFace `DatasetDict <https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.DatasetDict>`_ and
`IterableDatasetDict <https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.IterableDatasetDict>`_
objects aren't supported.
Loading Hugging Face datasets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. This snippet below is skipped because of https://github.com/ray-project/ray/issues/54837.
To read datasets from the Hugging Face Hub, use :func:`~ray.data.read_parquet` (or other
read functions) with the ``HfFileSystem`` filesystem. This approach provides better
performance and scalability than loading datasets into memory first.

.. testcode::
:skipif: True
First, install the required dependencies

import ray.data
from datasets import load_dataset
.. _loading_datasets_from_ml_libraries:

hf_ds = load_dataset("wikitext", "wikitext-2-raw-v1")
ray_ds = ray.data.from_huggingface(hf_ds["train"])
ray_ds.take(2)
Loading data from ML libraries
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. testoutput::
:options: +MOCK
Ray Data interoperates with PyTorch and TensorFlow datasets.

[{'text': ''}, {'text': ' = Valkyria Chronicles III = \n'}]
.. tab-set::

.. tab-item:: PyTorch

Expand Down