From 83feb793ca385cdb943daff3f49c45e46ea1aab1 Mon Sep 17 00:00:00 2001 From: woshiyyya Date: Fri, 17 May 2024 15:04:00 -0700 Subject: [PATCH 1/4] update Signed-off-by: woshiyyya --- .../data-loading-preprocessing.rst | 61 ++++++++++++++----- 1 file changed, 46 insertions(+), 15 deletions(-) diff --git a/doc/source/train/user-guides/data-loading-preprocessing.rst b/doc/source/train/user-guides/data-loading-preprocessing.rst index ce1d2ec962e8..0ca5683564b5 100644 --- a/doc/source/train/user-guides/data-loading-preprocessing.rst +++ b/doc/source/train/user-guides/data-loading-preprocessing.rst @@ -258,8 +258,7 @@ Some frameworks provide their own dataset and data loading utilities. For exampl - **Hugging Face:** `Dataset `_ - **PyTorch Lightning:** `LightningDataModule `_ -These utilities can still be used directly with Ray Train. In particular, you may want to do this if you already have your data ingestion pipeline set up. -However, for more performant large-scale data ingestion we do recommend migrating to Ray Data. +You can still use these framework data utilities directly with Ray Train. At a high level, you can compare these concepts as follows: @@ -276,34 +275,66 @@ At a high level, you can compare these concepts as follows: - n/a - :meth:`ray.data.Dataset.iter_torch_batches` +Why using Ray Data? +~~~~~~~~~~~~~~~~~~~ + +The framework's data utilities work well with small datasets requiring light preprocessing. +However, they can become performance bottlenecks when handling large-scale datasets with complex preprocessing logic. +Ray Data is designed to address these challenges, providing efficient large-scale data ingestion. + +Specifically, you can benefit from the following features of Ray Data: + +**Streaming execution**: + +- The preprocessing pipeline will be executed lazily and stream the data batches into training workers. +- Training can start immediately without significant up-front preprocessing time. + +**Automatic data sharding**: + +- The dataset will be automatically sharded across all training workers. -For more details, see the following sections for each framework. +**Leverage additional resources for preprocessing** + +- Ray Data can utilize all resources in the Ray cluster for preprocessing, not just those on your training nodes. + +For more details, see the following sections for each framework: .. tab-set:: - .. tab-item:: PyTorch Dataset and DataLoader + .. tab-item:: PyTorch + + **Option 1 (with Ray Data):** - **Option 1 (with Ray Data):** Convert your PyTorch Dataset to a Ray Dataset and pass it into the Trainer via ``datasets`` argument. - Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`. - You can convert this to replace the PyTorch DataLoader via :meth:`ray.data.DataIterator.iter_torch_batches`. + 1. Convert your PyTorch Dataset to a Ray Dataset and + 2. Pass the Ray Dataset into the TorchTrainer via ``datasets`` argument. + 3. Inside your ``train_loop_per_worker``, access the sharded dataset via :meth:`ray.train.get_dataset_shard`. + 4. Create a dataset iterable via :meth:`ray.data.DataIterator.iter_torch_batches`. For more details, see the :ref:`Migrating from PyTorch Datasets and DataLoaders `. - **Option 2 (without Ray Data):** Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``. - You can use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training. + **Option 2 (with PyTorch DataLoader):** + + 1. Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``. + 2. Use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training. - .. tab-item:: LightningDataModule + .. tab-item:: Lightning The ``LightningDataModule`` is created with PyTorch ``Dataset``\s and ``DataLoader``\s. You can apply the same logic here. - .. tab-item:: Hugging Face Dataset + .. tab-item:: Hugging Face + + **Option 1 (with Ray Data):** - **Option 1 (with Ray Data):** Convert your Hugging Face Dataset to a Ray Dataset and pass it into the Trainer via the ``datasets`` argument. - Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`. + 1. Convert your Hugging Face Dataset to a Ray Dataset. For instructions, see :ref:`Ray Data for Hugging Face `. + 2. Pass the Ray Dataset into the TorchTrainer via the ``datasets`` argument. + 3. Inside your ``train_loop_per_worker``, access the sharded dataset via :meth:`ray.train.get_dataset_shard`. + 4. Create a iterable dataset via :meth:`ray.data.DataIterator.iter_torch_batches`. + 5. Pass the iterable dataset into ``transformers.Trainer`` during initialization. - For instructions, see :ref:`Ray Data for Hugging Face `. + **Option 2 (with HuggingFace Dataset):** - **Option 2 (without Ray Data):** Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``. + 1. Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``. + 2. Pass the Hugging Face Dataset into ``transformers.Trainer`` during initialization. .. tip:: From d58ce3328383f8138feaa51c8f48f4a7759d8a89 Mon Sep 17 00:00:00 2001 From: woshiyyya Date: Mon, 3 Jun 2024 09:59:12 -0700 Subject: [PATCH 2/4] update Signed-off-by: woshiyyya --- .../data-loading-preprocessing.rst | 36 ++++++------------- 1 file changed, 10 insertions(+), 26 deletions(-) diff --git a/doc/source/train/user-guides/data-loading-preprocessing.rst b/doc/source/train/user-guides/data-loading-preprocessing.rst index 0ca5683564b5..ead3fc9fb371 100644 --- a/doc/source/train/user-guides/data-loading-preprocessing.rst +++ b/doc/source/train/user-guides/data-loading-preprocessing.rst @@ -13,6 +13,11 @@ Key advantages include: For more details about Ray Data, including comparisons to alternatives, see :ref:`Ray Data Overview `. +.. note:: + + In addition to Ray Data, you can continue using the data utilities provided by Machine Learning frameworks, such as PyTorch Dataset, + Hugging Face Dataset, and Lightning Data Module. Ray Train also integrates seamlessly with these tools. + In this guide, we will cover how to incorporate Ray Data into your Ray Train script, and different ways to customize your data ingestion pipeline. .. TODO: Replace this image with a better one. @@ -275,28 +280,6 @@ At a high level, you can compare these concepts as follows: - n/a - :meth:`ray.data.Dataset.iter_torch_batches` -Why using Ray Data? -~~~~~~~~~~~~~~~~~~~ - -The framework's data utilities work well with small datasets requiring light preprocessing. -However, they can become performance bottlenecks when handling large-scale datasets with complex preprocessing logic. -Ray Data is designed to address these challenges, providing efficient large-scale data ingestion. - -Specifically, you can benefit from the following features of Ray Data: - -**Streaming execution**: - -- The preprocessing pipeline will be executed lazily and stream the data batches into training workers. -- Training can start immediately without significant up-front preprocessing time. - -**Automatic data sharding**: - -- The dataset will be automatically sharded across all training workers. - -**Leverage additional resources for preprocessing** - -- Ray Data can utilize all resources in the Ray cluster for preprocessing, not just those on your training nodes. - For more details, see the following sections for each framework: .. tab-set:: @@ -305,14 +288,14 @@ For more details, see the following sections for each framework: **Option 1 (with Ray Data):** - 1. Convert your PyTorch Dataset to a Ray Dataset and + 1. Convert your PyTorch Dataset to a Ray Dataset. 2. Pass the Ray Dataset into the TorchTrainer via ``datasets`` argument. - 3. Inside your ``train_loop_per_worker``, access the sharded dataset via :meth:`ray.train.get_dataset_shard`. + 3. Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`. 4. Create a dataset iterable via :meth:`ray.data.DataIterator.iter_torch_batches`. For more details, see the :ref:`Migrating from PyTorch Datasets and DataLoaders `. - **Option 2 (with PyTorch DataLoader):** + **Option 2 (without Ray Data):** 1. Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``. 2. Use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training. @@ -330,8 +313,9 @@ For more details, see the following sections for each framework: 3. Inside your ``train_loop_per_worker``, access the sharded dataset via :meth:`ray.train.get_dataset_shard`. 4. Create a iterable dataset via :meth:`ray.data.DataIterator.iter_torch_batches`. 5. Pass the iterable dataset into ``transformers.Trainer`` during initialization. + 6. Wrap your transformers trainer with the :meth:`ray.train.huggingface.transformers.prepare_trainer` utility, so that it supports Ray Iterable Dataset. - **Option 2 (with HuggingFace Dataset):** + **Option 2 (without Ray Data):** 1. Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``. 2. Pass the Hugging Face Dataset into ``transformers.Trainer`` during initialization. From 91f0474e6854511ef7ef896e14c68d7cb4ef522b Mon Sep 17 00:00:00 2001 From: yunxuanx Date: Tue, 11 Jun 2024 18:11:54 +0000 Subject: [PATCH 3/4] fix Signed-off-by: yunxuanx --- .../train/user-guides/data-loading-preprocessing.rst | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/doc/source/train/user-guides/data-loading-preprocessing.rst b/doc/source/train/user-guides/data-loading-preprocessing.rst index ead3fc9fb371..0f2aae042d1b 100644 --- a/doc/source/train/user-guides/data-loading-preprocessing.rst +++ b/doc/source/train/user-guides/data-loading-preprocessing.rst @@ -15,8 +15,7 @@ For more details about Ray Data, including comparisons to alternatives, see :ref .. note:: - In addition to Ray Data, you can continue using the data utilities provided by Machine Learning frameworks, such as PyTorch Dataset, - Hugging Face Dataset, and Lightning Data Module. Ray Train also integrates seamlessly with these tools. + In addition to Ray Data, you can continue to use framework-native data utilities with Ray Train, such as PyTorch Dataset, Hugging Face Dataset, and Lightning DataModule. In this guide, we will cover how to incorporate Ray Data into your Ray Train script, and different ways to customize your data ingestion pipeline. @@ -284,7 +283,7 @@ For more details, see the following sections for each framework: .. tab-set:: - .. tab-item:: PyTorch + .. tab-item:: PyTorch DataLoader **Option 1 (with Ray Data):** @@ -300,11 +299,11 @@ For more details, see the following sections for each framework: 1. Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``. 2. Use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training. - .. tab-item:: Lightning + .. tab-item:: LightningDataModule The ``LightningDataModule`` is created with PyTorch ``Dataset``\s and ``DataLoader``\s. You can apply the same logic here. - .. tab-item:: Hugging Face + .. tab-item:: Hugging Face Dataset **Option 1 (with Ray Data):** From 6d2b097331560a773646587bdfb19bf948bd4e7b Mon Sep 17 00:00:00 2001 From: yunxuanx Date: Tue, 11 Jun 2024 18:19:05 +0000 Subject: [PATCH 4/4] update Signed-off-by: yunxuanx --- doc/source/train/user-guides/data-loading-preprocessing.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/source/train/user-guides/data-loading-preprocessing.rst b/doc/source/train/user-guides/data-loading-preprocessing.rst index 0f2aae042d1b..f20eb25aa0d6 100644 --- a/doc/source/train/user-guides/data-loading-preprocessing.rst +++ b/doc/source/train/user-guides/data-loading-preprocessing.rst @@ -311,8 +311,8 @@ For more details, see the following sections for each framework: 2. Pass the Ray Dataset into the TorchTrainer via the ``datasets`` argument. 3. Inside your ``train_loop_per_worker``, access the sharded dataset via :meth:`ray.train.get_dataset_shard`. 4. Create a iterable dataset via :meth:`ray.data.DataIterator.iter_torch_batches`. - 5. Pass the iterable dataset into ``transformers.Trainer`` during initialization. - 6. Wrap your transformers trainer with the :meth:`ray.train.huggingface.transformers.prepare_trainer` utility, so that it supports Ray Iterable Dataset. + 5. Pass the iterable dataset while initializing ``transformers.Trainer``. + 6. Wrap your transformers trainer with the :meth:`ray.train.huggingface.transformers.prepare_trainer` utility. **Option 2 (without Ray Data):**