From 945fda0efb128c41437bdec07a60aca201f70a73 Mon Sep 17 00:00:00 2001 From: regisss Date: Fri, 7 Jul 2023 02:21:32 +0200 Subject: [PATCH 1/2] Update BridgeTower blog post --- bridgetower.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/bridgetower.md b/bridgetower.md index 82aed15bad..d8a1b28749 100644 --- a/bridgetower.md +++ b/bridgetower.md @@ -12,7 +12,7 @@ authors: -[Optimum Habana v1.6](https://github.com/huggingface/optimum-habana/tree/main) on Habana Gaudi2 achieves **more than x3 speedups compared to A100** when fine-tuning BridgeTower, a state-of-the-art vision-language model. Two new features contribute to the performance improvement: hardware-accelerated data loading and a fast DDP implementation. +[Optimum Habana v1.6](https://github.com/huggingface/optimum-habana/tree/main) on Habana Gaudi2 achieves **almost x3 speedups compared to A100** when fine-tuning BridgeTower, a state-of-the-art vision-language model. Two new features contribute to the performance improvement: hardware-accelerated data loading and a fast DDP implementation. *These techniques apply to any other workloads constrained by data loading, which is frequently the case for many types of vision models.* This post will take you through the process and benchmark we used to compare BridgeTower fine-tuning on Habana Gaudi2 and Nvidia A100 80GB. It also demonstrates how easy it is to take advantage of these features in transformers-based models. @@ -57,11 +57,11 @@ Here are the throughputs we got on Gaudi2 and A100: | Device | `dataloader_num_workers=0` | `dataloader_num_workers=1` | |:----------:|:--------------------------:|:--------------------------:| | Gaudi2 HPU | 532.4 samples/s | 639.7 samples/s | -| A100 GPU | 188.6 samples/s | 254.7 samples/s | +| A100 GPU | 210.5 samples/s | 296.6 samples/s | -We first see that **Gaudi2 is x2.51 faster than A100** with `dataloader_num_workers=1` and x2.82 faster with `dataloader_num_workers=0`, which is even better than [the speedups we previously reported](https://huggingface.co/blog/habana-gaudi-2-benchmark)! +We first see that **Gaudi2 is x2.16 faster than A100** with `dataloader_num_workers=1` and x2.53 faster with `dataloader_num_workers=0`, which is on par with [the speedups we previously reported](https://huggingface.co/blog/habana-gaudi-2-benchmark)! -Second, we see that **allocating more resources for data loading can lead to easy speedups**: x1.20 on Gaudi2 and x1.35 on A100. +Second, we see that **allocating more resources for data loading can lead to easy speedups**: x1.20 on Gaudi2 and x1.41 on A100. We also ran experiments with several dedicated subprocesses for data loading but performance was not better than with `dataloader_num_workers=1` for both Gaudi2 and A100. Thus, **using `dataloader_num_workers=1` is usually a good first way of accelerating your runs involving images!** @@ -77,9 +77,9 @@ Before delving into how to perform hardware-accelerated data loading, let's look Optimum Habana's fast DDP does not split parameter gradients into buckets as [DDP does](https://pytorch.org/docs/stable/notes/ddp.html#internal-design). It also uses [HPU graphs](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html?highlight=hpu%20graphs) to collect gradients in all processes and then update them (after the [all_reduce](https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_reduce) operation is performed) with minimal host overhead. You can check this implementation [here](https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/distributed/fast_ddp.py). -Simply using `distribution_strategy="fast_ddp"` (and keeping `dataloader_num_workers=1`) on Gaudi2 gives us 705.9 samples/s. **This is x1.10 faster than with DDP and x2.77 faster than A100!** +Simply using `distribution_strategy="fast_ddp"` (and keeping `dataloader_num_workers=1`) on Gaudi2 gives us 705.9 samples/s. **This is x1.10 faster than with DDP and x2.38 faster than A100!** -So adding just two training arguments (`dataloader_num_workers=1` and `distribution_strategy="fast_ddp"`) led to a x1.33 speedup on Gaudi2 and to a x2.77 speedup compared to A100 with `dataloader_num_workers=1`. +So adding just two training arguments (`dataloader_num_workers=1` and `distribution_strategy="fast_ddp"`) led to a x1.33 speedup on Gaudi2 and to a x2.38 speedup compared to A100 with `dataloader_num_workers=1`. ### Hardware-accelerated data loading with Optimum Habana @@ -120,17 +120,17 @@ We are now going to benchmark a run with `dataloader_num_workers=1`, `distributi | Device | `dataloader_num_workers=0` | `dataloader_num_workers=1` | `dataloader_num_workers=1` + `distribution_strategy="fast_ddp"` | `dataloader_num_workers=1` + `distribution_strategy="fast_ddp"` + `mediapipe_dataloader` | |:----------:|:--------------------------:|:--------------------------:|:---------------:|:---------------:| | Gaudi2 HPU | 532.4 samples/s | 639.7 samples/s | 705.9 samples/s | 802.1 samples/s | -| A100 GPU | 188.6 samples/s | 254.7 samples/s | / | / | +| A100 GPU | 210.5 samples/s | 296.6 samples/s | / | / | We got an additional x1.14 speedup compared to the previous run with `dataloader_num_workers=1` and `distribution_strategy="fast_ddp"`. -This final run is thus x1.51 faster than our base run on Gaudi2 **simply adding 3 ready-to-use training arguments.** It is also **x3.15 faster than A100 with `dataloader_num_workers=1`!** +This final run is thus x1.51 faster than our base run on Gaudi2 **simply adding 3 ready-to-use training arguments.** It is also **x2.70 faster than A100 with `dataloader_num_workers=1`!** ### Reproducing this benchmark To reproduce this benchmark, you first need to get access to Gaudi2 through the [Intel Developer Cloud](https://www.intel.com/content/www/us/en/secure/developer/devcloud/cloud-launchpad.html) (see [this guide](https://huggingface.co/blog/habana-gaudi-2-benchmark#how-to-get-access-to-gaudi2) for more information). -Then, you need to install the latest version of Optimum Habana and to run `run_bridgetower.py` that you can find [here](https://github.com/huggingface/optimum-habana/blob/main/examples/contrastive-image-text/run_bridgetower.py). Here is how to do it: +Then, you need to install the latest version of Optimum Habana and run `run_bridgetower.py`` that you can find [here](https://github.com/huggingface/optimum-habana/blob/main/examples/contrastive-image-text/run_bridgetower.py). Here is how to do it: ```bash pip install optimum[habana] @@ -164,9 +164,9 @@ To push your model and Tensorboard logs to the Hugging Face Hub, you will have t huggingface-cli login ``` -For A100, you can use the same `run_bridgetower.py` script with a couple of small changes: +For A100, you can use the same `run_bridgetower.py` script with a few small changes: - Replace `GaudiTrainer` and `GaudiTrainingArguments` with `Trainer` and `TrainingArguments` from Transformers -- Remove references to `GaudiConfig` and `gaudi_config` +- Remove references to `GaudiConfig`, `gaudi_config` and `HabanaDataloaderTrainer` - Import `set_seed` directly from Transformers: `from transformers import set_seed` The results displayed in this benchmark were obtained with a Nvidia A100 80GB GCP instance with 8 GPUS. @@ -177,7 +177,7 @@ Note that `--distribution_strategy fast_ddp` and `--mediapipe_dataloader` are co ## Conclusion When dealing with images, we presented two solutions to speed up your training workflows: allocating more resources to the dataloader, and decoding and augmenting images directly on accelerator devices rather than on CPU. -We showed that it leads to dramatic speedups when training a SOTA vision-language model like BridgeTower: **Habana Gaudi2 with Optimum Habana is more than 3x faster than Nvidia A100 80GB with Transformers!** +We showed that it leads to dramatic speedups when training a SOTA vision-language model like BridgeTower: **Habana Gaudi2 with Optimum Habana is almost 3x faster than Nvidia A100 80GB with Transformers!** And this is super easy to use as you just need to provide a few additional training arguments. To go further, we are looking forward to using HPU graphs for training models even faster and to presenting how to use DeepSpeed ZeRO-3 on Gaudi2 to accelerate the training of your LLMs. Stay tuned! From 40bb4c3b11344b1280745a55decb7ea31c267caf Mon Sep 17 00:00:00 2001 From: regisss Date: Fri, 7 Jul 2023 02:25:36 +0200 Subject: [PATCH 2/2] Fix typo --- bridgetower.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bridgetower.md b/bridgetower.md index d8a1b28749..0b86b9a3eb 100644 --- a/bridgetower.md +++ b/bridgetower.md @@ -130,7 +130,7 @@ This final run is thus x1.51 faster than our base run on Gaudi2 **simply adding To reproduce this benchmark, you first need to get access to Gaudi2 through the [Intel Developer Cloud](https://www.intel.com/content/www/us/en/secure/developer/devcloud/cloud-launchpad.html) (see [this guide](https://huggingface.co/blog/habana-gaudi-2-benchmark#how-to-get-access-to-gaudi2) for more information). -Then, you need to install the latest version of Optimum Habana and run `run_bridgetower.py`` that you can find [here](https://github.com/huggingface/optimum-habana/blob/main/examples/contrastive-image-text/run_bridgetower.py). Here is how to do it: +Then, you need to install the latest version of Optimum Habana and run `run_bridgetower.py` that you can find [here](https://github.com/huggingface/optimum-habana/blob/main/examples/contrastive-image-text/run_bridgetower.py). Here is how to do it: ```bash pip install optimum[habana]