Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train][Doc] Update PyTorch Data Ingestion User Guide #45421

Merged
merged 4 commits into from
Jun 25, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 46 additions & 15 deletions doc/source/train/user-guides/data-loading-preprocessing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -258,8 +258,7 @@ Some frameworks provide their own dataset and data loading utilities. For exampl
- **Hugging Face:** `Dataset <https://huggingface.co/docs/datasets/index>`_
- **PyTorch Lightning:** `LightningDataModule <https://lightning.ai/docs/pytorch/stable/data/datamodule.html>`_

These utilities can still be used directly with Ray Train. In particular, you may want to do this if you already have your data ingestion pipeline set up.
However, for more performant large-scale data ingestion we do recommend migrating to Ray Data.
woshiyyya marked this conversation as resolved.
Show resolved Hide resolved
You can still use these framework data utilities directly with Ray Train.

At a high level, you can compare these concepts as follows:

Expand All @@ -276,34 +275,66 @@ At a high level, you can compare these concepts as follows:
- n/a
- :meth:`ray.data.Dataset.iter_torch_batches`

Why using Ray Data?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Why using Ray Data?
Comparison with Ray Data

~~~~~~~~~~~~~~~~~~~

The framework's data utilities work well with small datasets requiring light preprocessing.
However, they can become performance bottlenecks when handling large-scale datasets with complex preprocessing logic.
Ray Data is designed to address these challenges, providing efficient large-scale data ingestion.

Specifically, you can benefit from the following features of Ray Data:

**Streaming execution**:

- The preprocessing pipeline will be executed lazily and stream the data batches into training workers.
- Training can start immediately without significant up-front preprocessing time.

**Automatic data sharding**:

- The dataset will be automatically sharded across all training workers.

For more details, see the following sections for each framework.
**Leverage additional resources for preprocessing**

- Ray Data can utilize all resources in the Ray cluster for preprocessing, not just those on your training nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good content that I think everyone should read, regardless of whether or not they are starting with PyTorch data. Do you think we could bring this higher up in the guide (e.g. even in the introduction), and then reference it from here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Sounds good to me.


For more details, see the following sections for each framework:
woshiyyya marked this conversation as resolved.
Show resolved Hide resolved

.. tab-set::

.. tab-item:: PyTorch Dataset and DataLoader
.. tab-item:: PyTorch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original names were more explicit to make it clear that this is referring to the dataset framework, rather than the training framework.


**Option 1 (with Ray Data):**

**Option 1 (with Ray Data):** Convert your PyTorch Dataset to a Ray Dataset and pass it into the Trainer via ``datasets`` argument.
Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
You can convert this to replace the PyTorch DataLoader via :meth:`ray.data.DataIterator.iter_torch_batches`.
1. Convert your PyTorch Dataset to a Ray Dataset and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
1. Convert your PyTorch Dataset to a Ray Dataset and
1. Convert your PyTorch Dataset to a Ray Dataset.

There are some other small typos/formatting errors that I'll review more thoroughly in a follow-up review.

2. Pass the Ray Dataset into the TorchTrainer via ``datasets`` argument.
3. Inside your ``train_loop_per_worker``, access the sharded dataset via :meth:`ray.train.get_dataset_shard`.
4. Create a dataset iterable via :meth:`ray.data.DataIterator.iter_torch_batches`.

For more details, see the :ref:`Migrating from PyTorch Datasets and DataLoaders <migrate_pytorch>`.

**Option 2 (without Ray Data):** Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``.
You can use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training.
**Option 2 (with PyTorch DataLoader):**

1. Instantiate the Torch Dataset and DataLoader directly in the ``train_loop_per_worker``.
2. Use the :meth:`ray.train.torch.prepare_data_loader` utility to set up the DataLoader for distributed training.

.. tab-item:: LightningDataModule
.. tab-item:: Lightning

The ``LightningDataModule`` is created with PyTorch ``Dataset``\s and ``DataLoader``\s. You can apply the same logic here.

.. tab-item:: Hugging Face Dataset
.. tab-item:: Hugging Face

**Option 1 (with Ray Data):**

**Option 1 (with Ray Data):** Convert your Hugging Face Dataset to a Ray Dataset and pass it into the Trainer via the ``datasets`` argument.
Inside your ``train_loop_per_worker``, you can access the dataset via :meth:`ray.train.get_dataset_shard`.
1. Convert your Hugging Face Dataset to a Ray Dataset. For instructions, see :ref:`Ray Data for Hugging Face <loading_datasets_from_ml_libraries>`.
2. Pass the Ray Dataset into the TorchTrainer via the ``datasets`` argument.
3. Inside your ``train_loop_per_worker``, access the sharded dataset via :meth:`ray.train.get_dataset_shard`.
4. Create a iterable dataset via :meth:`ray.data.DataIterator.iter_torch_batches`.
5. Pass the iterable dataset into ``transformers.Trainer`` during initialization.

For instructions, see :ref:`Ray Data for Hugging Face <loading_datasets_from_ml_libraries>`.
**Option 2 (with HuggingFace Dataset):**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I understand why you chose to do this but I'm also a little worried this might be confusing since Option 1 technically does also use Hugging Face Datasets.

Copy link
Member Author

@woshiyyya woshiyyya May 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I realized the difference now. Previously, this section aims at teaching users how to convert their HF dataset to Ray Dataset, then do training. But this PR tries to directly categorize on what we eventually use in the training function.

# prev
HF Dataset -> Ray Data -> HF Transformers
            HF Dataset -> HF Transformers

# now
Ray Data -> HF Transformers
HF Dataset -> HF Transformers

My consideration here is that we'd better not force everyone to take the "HF Dataset -> Ray Data" conversion step.

For example, their original datasets format could be parquet, and before onboarding Ray, they already build a HF Dataset from parquet file, then feed it to HF Trainer.

In this case, they can either build ray dataset from parquet or from HF dataset.

# Before onboarding Ray
raw data -> HF dataset -> HF transformer

# After onboarding Ray
option 1: raw data -> HF dataset -> Ray Data -> HF transformer

v.s. 

option 2:  raw data -> Ray Data -> HF transformer

We can discuss more in person next week.


**Option 2 (without Ray Data):** Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``.
1. Instantiate the Hugging Face Dataset directly in the ``train_loop_per_worker``.
2. Pass the Hugging Face Dataset into ``transformers.Trainer`` during initialization.

.. tip::

Expand Down
Loading