Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update BridgeTower blog post #1295

Merged
merged 2 commits into from
Jul 7, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 12 additions & 12 deletions bridgetower.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ authors:
<!-- {blog_metadata} -->
<!-- {authors} -->

[Optimum Habana v1.6](https://github.com/huggingface/optimum-habana/tree/main) on Habana Gaudi2 achieves **more than x3 speedups compared to A100** when fine-tuning BridgeTower, a state-of-the-art vision-language model. Two new features contribute to the performance improvement: hardware-accelerated data loading and a fast DDP implementation.
[Optimum Habana v1.6](https://github.com/huggingface/optimum-habana/tree/main) on Habana Gaudi2 achieves **almost x3 speedups compared to A100** when fine-tuning BridgeTower, a state-of-the-art vision-language model. Two new features contribute to the performance improvement: hardware-accelerated data loading and a fast DDP implementation.

*These techniques apply to any other workloads constrained by data loading, which is frequently the case for many types of vision models.* This post will take you through the process and benchmark we used to compare BridgeTower fine-tuning on Habana Gaudi2 and Nvidia A100 80GB. It also demonstrates how easy it is to take advantage of these features in transformers-based models.

Expand Down Expand Up @@ -57,11 +57,11 @@ Here are the throughputs we got on Gaudi2 and A100:
| Device | `dataloader_num_workers=0` | `dataloader_num_workers=1` |
|:----------:|:--------------------------:|:--------------------------:|
| Gaudi2 HPU | 532.4 samples/s | 639.7 samples/s |
| A100 GPU | 188.6 samples/s | 254.7 samples/s |
| A100 GPU | 210.5 samples/s | 296.6 samples/s |

We first see that **Gaudi2 is x2.51 faster than A100** with `dataloader_num_workers=1` and x2.82 faster with `dataloader_num_workers=0`, which is even better than [the speedups we previously reported](https://huggingface.co/blog/habana-gaudi-2-benchmark)!
We first see that **Gaudi2 is x2.16 faster than A100** with `dataloader_num_workers=1` and x2.53 faster with `dataloader_num_workers=0`, which is on par with [the speedups we previously reported](https://huggingface.co/blog/habana-gaudi-2-benchmark)!

Second, we see that **allocating more resources for data loading can lead to easy speedups**: x1.20 on Gaudi2 and x1.35 on A100.
Second, we see that **allocating more resources for data loading can lead to easy speedups**: x1.20 on Gaudi2 and x1.41 on A100.

We also ran experiments with several dedicated subprocesses for data loading but performance was not better than with `dataloader_num_workers=1` for both Gaudi2 and A100.
Thus, **using `dataloader_num_workers=1` is usually a good first way of accelerating your runs involving images!**
Expand All @@ -77,9 +77,9 @@ Before delving into how to perform hardware-accelerated data loading, let's look

Optimum Habana's fast DDP does not split parameter gradients into buckets as [DDP does](https://pytorch.org/docs/stable/notes/ddp.html#internal-design). It also uses [HPU graphs](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html?highlight=hpu%20graphs) to collect gradients in all processes and then update them (after the [all_reduce](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce) operation is performed) with minimal host overhead. You can check this implementation [here](https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/distributed/fast_ddp.py).

Simply using `distribution_strategy="fast_ddp"` (and keeping `dataloader_num_workers=1`) on Gaudi2 gives us 705.9 samples/s. **This is x1.10 faster than with DDP and x2.77 faster than A100!**
Simply using `distribution_strategy="fast_ddp"` (and keeping `dataloader_num_workers=1`) on Gaudi2 gives us 705.9 samples/s. **This is x1.10 faster than with DDP and x2.38 faster than A100!**

So adding just two training arguments (`dataloader_num_workers=1` and `distribution_strategy="fast_ddp"`) led to a x1.33 speedup on Gaudi2 and to a x2.77 speedup compared to A100 with `dataloader_num_workers=1`.
So adding just two training arguments (`dataloader_num_workers=1` and `distribution_strategy="fast_ddp"`) led to a x1.33 speedup on Gaudi2 and to a x2.38 speedup compared to A100 with `dataloader_num_workers=1`.


### Hardware-accelerated data loading with Optimum Habana
Expand Down Expand Up @@ -120,17 +120,17 @@ We are now going to benchmark a run with `dataloader_num_workers=1`, `distributi
| Device | `dataloader_num_workers=0` | `dataloader_num_workers=1` | `dataloader_num_workers=1` + `distribution_strategy="fast_ddp"` | `dataloader_num_workers=1` + `distribution_strategy="fast_ddp"` + `mediapipe_dataloader` |
|:----------:|:--------------------------:|:--------------------------:|:---------------:|:---------------:|
| Gaudi2 HPU | 532.4 samples/s | 639.7 samples/s | 705.9 samples/s | 802.1 samples/s |
| A100 GPU | 188.6 samples/s | 254.7 samples/s | / | / |
| A100 GPU | 210.5 samples/s | 296.6 samples/s | / | / |

We got an additional x1.14 speedup compared to the previous run with `dataloader_num_workers=1` and `distribution_strategy="fast_ddp"`.
This final run is thus x1.51 faster than our base run on Gaudi2 **simply adding 3 ready-to-use training arguments.** It is also **x3.15 faster than A100 with `dataloader_num_workers=1`!**
This final run is thus x1.51 faster than our base run on Gaudi2 **simply adding 3 ready-to-use training arguments.** It is also **x2.70 faster than A100 with `dataloader_num_workers=1`!**


### Reproducing this benchmark

To reproduce this benchmark, you first need to get access to Gaudi2 through the [Intel Developer Cloud](https://www.intel.com/content/www/us/en/secure/developer/devcloud/cloud-launchpad.html) (see [this guide](https://huggingface.co/blog/habana-gaudi-2-benchmark#how-to-get-access-to-gaudi2) for more information).

Then, you need to install the latest version of Optimum Habana and to run `run_bridgetower.py` that you can find [here](https://github.com/huggingface/optimum-habana/blob/main/examples/contrastive-image-text/run_bridgetower.py). Here is how to do it:
Then, you need to install the latest version of Optimum Habana and run `run_bridgetower.py` that you can find [here](https://github.com/huggingface/optimum-habana/blob/main/examples/contrastive-image-text/run_bridgetower.py). Here is how to do it:

```bash
pip install optimum[habana]
Expand Down Expand Up @@ -164,9 +164,9 @@ To push your model and Tensorboard logs to the Hugging Face Hub, you will have t
huggingface-cli login
```

For A100, you can use the same `run_bridgetower.py` script with a couple of small changes:
For A100, you can use the same `run_bridgetower.py` script with a few small changes:
- Replace `GaudiTrainer` and `GaudiTrainingArguments` with `Trainer` and `TrainingArguments` from Transformers
- Remove references to `GaudiConfig` and `gaudi_config`
- Remove references to `GaudiConfig`, `gaudi_config` and `HabanaDataloaderTrainer`
- Import `set_seed` directly from Transformers: `from transformers import set_seed`

The results displayed in this benchmark were obtained with a Nvidia A100 80GB GCP instance with 8 GPUS.
Expand All @@ -177,7 +177,7 @@ Note that `--distribution_strategy fast_ddp` and `--mediapipe_dataloader` are co
## Conclusion

When dealing with images, we presented two solutions to speed up your training workflows: allocating more resources to the dataloader, and decoding and augmenting images directly on accelerator devices rather than on CPU.
We showed that it leads to dramatic speedups when training a SOTA vision-language model like BridgeTower: **Habana Gaudi2 with Optimum Habana is more than 3x faster than Nvidia A100 80GB with Transformers!**
We showed that it leads to dramatic speedups when training a SOTA vision-language model like BridgeTower: **Habana Gaudi2 with Optimum Habana is almost 3x faster than Nvidia A100 80GB with Transformers!**
And this is super easy to use as you just need to provide a few additional training arguments.

To go further, we are looking forward to using HPU graphs for training models even faster and to presenting how to use DeepSpeed ZeRO-3 on Gaudi2 to accelerate the training of your LLMs. Stay tuned!
Expand Down