diff --git a/doc/source/train/user-guides/data-loading-preprocessing.rst b/doc/source/train/user-guides/data-loading-preprocessing.rst index ce1d2ec962e8..f20eb25aa0d6 100644 --- a/doc/source/train/user-guides/data-loading-preprocessing.rst +++ b/doc/source/train/user-guides/data-loading-preprocessing.rst @@ -13,6 +13,10 @@ Key advantages include: For more details about Ray Data, including comparisons to alternatives, see :ref:`Ray Data Overview `. +.. note:: + + In addition to Ray Data, you can continue to use framework-native data utilities with Ray Train, such as PyTorch Dataset, Hugging Face Dataset, and Lightning DataModule. + In this guide, we will cover how to incorporate Ray Data into your Ray Train script, and different ways to customize your data ingestion pipeline. .. TODO: Replace this image with a better one. @@ -258,8 +262,7 @@ Some frameworks provide their own dataset and data loading utilities. For exampl - **Hugging Face:** `Dataset `_ - **PyTorch Lightning:** `LightningDataModule `_ -These utilities can still be used directly with Ray Train. In particular, you may want to do this if you already have your data ingestion pipeline set up. -However, for more performant large-scale data ingestion we do recommend migrating to Ray Data. +You can still use these framework data utilities directly with Ray Train. At a high level, you can compare these concepts as follows: @@ -276,21 +279,25 @@ At a high level, you can compare these concepts as follows: - n/a - :meth:`ray.data.Dataset.iter_torch_batches` - -For more details, see the following sections for each framework. +For more details, see the following sections for each framework: .. tab-set:: - .. tab-item:: PyTorch Dataset and DataLoader + .. tab-item:: PyTorch DataLoader - **Option 1 (with Ray Data):** Convert your PyTorch Dataset to a Ray Dataset and pass it into the Trainer via ``datasets`` argument. - Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`. - You can convert this to replace the PyTorch DataLoader via :meth:`ray.data.DataIterator.iter_torch_batches`. + **Option 1 (with Ray Data):** + + 1. Convert your PyTorch Dataset to a Ray Dataset. + 2. Pass the Ray Dataset into the TorchTrainer via ``datasets`` argument. + 3. Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`. + 4. Create a dataset iterable via :meth:`ray.data.DataIterator.iter_torch_batches`. For more details, see the :ref:`Migrating from PyTorch Datasets and DataLoaders `. - **Option 2 (without Ray Data):** Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``. - You can use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training. + **Option 2 (without Ray Data):** + + 1. Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``. + 2. Use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training. .. tab-item:: LightningDataModule @@ -298,12 +305,19 @@ For more details, see the following sections for each framework. .. tab-item:: Hugging Face Dataset - **Option 1 (with Ray Data):** Convert your Hugging Face Dataset to a Ray Dataset and pass it into the Trainer via the ``datasets`` argument. - Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`. + **Option 1 (with Ray Data):** + + 1. Convert your Hugging Face Dataset to a Ray Dataset. For instructions, see :ref:`Ray Data for Hugging Face `. + 2. Pass the Ray Dataset into the TorchTrainer via the ``datasets`` argument. + 3. Inside your ``train_loop_per_worker``, access the sharded dataset via :meth:`ray.train.get_dataset_shard`. + 4. Create a iterable dataset via :meth:`ray.data.DataIterator.iter_torch_batches`. + 5. Pass the iterable dataset while initializing ``transformers.Trainer``. + 6. Wrap your transformers trainer with the :meth:`ray.train.huggingface.transformers.prepare_trainer` utility. - For instructions, see :ref:`Ray Data for Hugging Face `. + **Option 2 (without Ray Data):** - **Option 2 (without Ray Data):** Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``. + 1. Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``. + 2. Pass the Hugging Face Dataset into ``transformers.Trainer`` during initialization. .. tip::