ray-project · justinvyu · Jun 25, 2024 · May 17, 2024 · Jun 3, 2024 · Jun 11, 2024
@@ -258,8 +258,7 @@ Some frameworks provide their own dataset and data loading utilities. For exampl
 - **Hugging Face:** `Dataset <https://huggingface.co/docs/datasets/index>`_
 - **PyTorch Lightning:** `LightningDataModule <https://lightning.ai/docs/pytorch/stable/data/datamodule.html>`_
 
-These utilities can still be used directly with Ray Train. In particular, you may want to do this if you already have your data ingestion pipeline set up.
-However, for more performant large-scale data ingestion we do recommend migrating to Ray Data.
+You can still use these framework data utilities directly with Ray Train.
 
 At a high level, you can compare these concepts as follows:
 
@@ -276,34 +275,66 @@ At a high level, you can compare these concepts as follows:
      - n/a
      - :meth:`ray.data.Dataset.iter_torch_batches`
 
+Why using Ray Data?
-Why using Ray Data?
+Comparison with Ray Data
-Why using Ray Data?
+Comparison with Ray Data
+~~~~~~~~~~~~~~~~~~~
+
+The framework's data utilities work well with small datasets requiring light preprocessing. 
+However, they can become performance bottlenecks when handling large-scale datasets with complex preprocessing logic. 
+Ray Data is designed to address these challenges, providing efficient large-scale data ingestion. 
+
+Specifically, you can benefit from the following features of Ray Data:
+
+**Streaming execution**:
+
+- The preprocessing pipeline will be executed lazily and stream the data batches into training workers.
+- Training can start immediately without significant up-front preprocessing time.
+
+**Automatic data sharding**: 
+
+- The dataset will be automatically sharded across all training workers. 
 
-For more details, see the following sections for each framework.
+**Leverage additional resources for preprocessing**
+
+- Ray Data can utilize all resources in the Ray cluster for preprocessing, not just those on your training nodes. 
+
+For more details, see the following sections for each framework:
 
 .. tab-set::
 
-    .. tab-item:: PyTorch Dataset and DataLoader
+    .. tab-item:: PyTorch
+
+        **Option 1 (with Ray Data):** 
 
-        **Option 1 (with Ray Data):** Convert your PyTorch Dataset to a Ray Dataset and pass it into the Trainer via  ``datasets`` argument.
-        Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
-        You can convert this to replace the PyTorch DataLoader via :meth:`ray.data.DataIterator.iter_torch_batches`.
+        1. Convert your PyTorch Dataset to a Ray Dataset and 
-        1. Convert your PyTorch Dataset to a Ray Dataset and 
+        1. Convert your PyTorch Dataset to a Ray Dataset.
-        1. Convert your PyTorch Dataset to a Ray Dataset and 
+        1. Convert your PyTorch Dataset to a Ray Dataset.
+        2. Pass the Ray Dataset into the TorchTrainer via  ``datasets`` argument.
+        3. Inside your ``train_loop_per_worker``, access the sharded dataset via :meth:`ray.train.get_dataset_shard`.
+        4. Create a dataset iterable via :meth:`ray.data.DataIterator.iter_torch_batches`.
 
         For more details, see the :ref:`Migrating from PyTorch Datasets and DataLoaders <migrate_pytorch>`.
 
-        **Option 2 (without Ray Data):** Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``.
-        You can use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training.
+        **Option 2 (with PyTorch DataLoader):** 
+
+        1. Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``.
+        2. Use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training.
 
-    .. tab-item:: LightningDataModule
+    .. tab-item:: Lightning
 
         The ``LightningDataModule`` is created with PyTorch ``Dataset``\s and ``DataLoader``\s. You can apply the same logic here.
 
-    .. tab-item:: Hugging Face Dataset
+    .. tab-item:: Hugging Face
+
+        **Option 1 (with Ray Data):** 
 
-        **Option 1 (with Ray Data):** Convert your Hugging Face Dataset to a Ray Dataset and pass it into the Trainer via the ``datasets`` argument.
-        Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
+        1. Convert your Hugging Face Dataset to a Ray Dataset. For instructions, see :ref:`Ray Data for Hugging Face <loading_datasets_from_ml_libraries>`.
+        2. Pass the Ray Dataset into the TorchTrainer via the ``datasets`` argument.
+        3. Inside your ``train_loop_per_worker``, access the sharded dataset via :meth:`ray.train.get_dataset_shard`.
+        4. Create a iterable dataset via :meth:`ray.data.DataIterator.iter_torch_batches`. 
+        5. Pass the iterable dataset into ``transformers.Trainer`` during initialization.
 
-        For instructions, see :ref:`Ray Data for Hugging Face <loading_datasets_from_ml_libraries>`.
+        **Option 2 (with HuggingFace Dataset):** 
 
-        **Option 2 (without Ray Data):** Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``.
+        1. Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``.
+        2. Pass the Hugging Face Dataset into ``transformers.Trainer`` during initialization.
 
 .. tip::