title | thumbnail | authors | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Accelerating Vision-Language Models: BridgeTower on Habana Gaudi2 |
/blog/assets/bridgetower/thumbnail.png |
|
Optimum Habana v1.6 on Habana Gaudi2 achieves almost x3 speedups compared to A100 when fine-tuning BridgeTower, a state-of-the-art vision-language model. Two new features contribute to the performance improvement: hardware-accelerated data loading and a fast DDP implementation.
These techniques apply to any other workloads constrained by data loading, which is frequently the case for many types of vision models. This post will take you through the process and benchmark we used to compare BridgeTower fine-tuning on Habana Gaudi2 and Nvidia A100 80GB. It also demonstrates how easy it is to take advantage of these features in transformers-based models.
In the recent past, Vision-Language (VL) models have gained tremendous importance and shown dominance in a variety of VL tasks. Most common approaches leverage uni-modal encoders to extract representations from their respective modalities. Then those representations are either fused together, or fed into a cross-modal encoder. To efficiently handle some of the performance limitations and restrictions in VL representation learning, BridgeTower introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations at different semantic levels in the cross-modal encoder.
Pre-trained with only 4M images (see the detail below), BridgeTower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, BridgeTower achieves an accuracy of 78.73% on the VQAv2 test-std set, outperforming the previous state-of-the-art model (METER) by 1.09% using the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, BridgeTower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets.
Nvidia A100 Tensor Core GPU includes the 3rd generation of the Tensor Core technology. Although a newer generation got released recently (H100), this is still the fastest GPU that you will find at most cloud providers. We use here the 80GB-memory variant which also offers faster memory bandwidth than the 40GB one.
Habana Gaudi2 is the second-generation AI hardware accelerator designed by Habana Labs. A single server contains 8 accelerator devices called HPUs with 96GB of memory each. Check out our previous blog post for a more in-depth introduction and a guide showing how to access it through the Intel Developer Cloud. Unlike many AI accelerators in the market, advanced features are very easy to apply to make the most of Gaudi2 with Optimum Habana, which enables users to port Transformers-compatible scripts to Gaudi with just a 2-line change.
To benchmark training, we are going to fine-tune a BridgeTower Large checkpoint consisting of 866M parameters. This checkpoint was pretrained on English language using masked language modeling, image-text matching and image-text contrastive loss on Conceptual Captions, SBU Captions, MSCOCO Captions and Visual Genome.
We will further fine-tune this checkpoint on the New Yorker Caption Contest dataset which consists of cartoons from The New Yorker and the most voted captions.
Hyperparameters are all the same for both accelerators, except the batch size: we managed to fit 40 samples on Gaudi2 against 32 on A100. You can check them out here for Gaudi2 and there for A100.
When dealing with datasets involving images, data loading is frequently a bottleneck because many costly operations are computed on CPU (image decoding, image augmentations) and then full images are sent to the training devices. Ideally, we would like to send only raw bytes to devices and then perform decoding and various image transformations on device. But let's see first how to easily allocate more resources to data loading for accelerating your runs.
When image loading is done on CPU, a quick way to speed it up would be to allocate more subprocesses for data loading. This is very easy to do with Transformers' TrainingArguments
(or its Optimum Habana counterpart GaudiTrainingArguments
): you can use the dataloader_num_workers=N
argument to set the number of subprocesses (N
) allocated on CPU for data loading.
The default is 0, which means that data is loaded in the main process. This may not be optimal as the main process has many things to manage. We can set it to 1 to have one fully dedicated subprocess for data loading. When several subprocesses are allocated, each one of them will be responsible for preparing a batch. This means that RAM consumption will increase with the number of workers. One recommendation would be to set it to the number of CPU cores, but those cores may not be fully free so you will have to try it out to find the best configuration.
Let's run the two following experiments:
- a mixed-precision (bfloat16/float) run distributed across 8 devices where data loading is performed by the same process as everything else (i.e.
dataloader_num_workers=0
) - a mixed-precision (bfloat16/float) run distributed across 8 devices with 1 dedicated subprocess for data loading (i.e.
dataloader_num_workers=1
)
Here are the throughputs we got on Gaudi2 and A100:
Device | dataloader_num_workers=0 |
dataloader_num_workers=1 |
---|---|---|
Gaudi2 HPU | 532.4 samples/s | 639.7 samples/s |
A100 GPU | 210.5 samples/s | 296.6 samples/s |
We first see that Gaudi2 is x2.16 faster than A100 with dataloader_num_workers=1
and x2.53 faster with dataloader_num_workers=0
, which is on par with the speedups we previously reported!
Second, we see that allocating more resources for data loading can lead to easy speedups: x1.20 on Gaudi2 and x1.41 on A100.
We also ran experiments with several dedicated subprocesses for data loading but performance was not better than with dataloader_num_workers=1
for both Gaudi2 and A100.
Thus, using dataloader_num_workers=1
is usually a good first way of accelerating your runs involving images!
Tensorboard logs can be visualized here for Gaudi2 and there for A100.
Before delving into how to perform hardware-accelerated data loading, let's look at another very easy way of speeding up your distributed runs on Gaudi. The new release of Optimum Habana, version 1.6.0, introduced a new feature that allows users to choose the distribution strategy to use:
distribution_strategy="ddp"
to use PyTorchDistributedDataParallel
(DDP)distribution_strategy="fast_ddp"
to use a lighter and usually faster implementation
Optimum Habana's fast DDP does not split parameter gradients into buckets as DDP does. It also uses HPU graphs to collect gradients in all processes and then update them (after the all_reduce operation is performed) with minimal host overhead. You can check this implementation here.
Simply using distribution_strategy="fast_ddp"
(and keeping dataloader_num_workers=1
) on Gaudi2 gives us 705.9 samples/s. This is x1.10 faster than with DDP and x2.38 faster than A100!
So adding just two training arguments (dataloader_num_workers=1
and distribution_strategy="fast_ddp"
) led to a x1.33 speedup on Gaudi2 and to a x2.38 speedup compared to A100 with dataloader_num_workers=1
.
For even larger speedups, we are now going to move as many data loading operations as possible from the CPU to the accelerator devices (i.e. HPUs on Gaudi2 or GPUs on A100). This can be done on Gaudi2 using Habana's media pipeline.
Given a dataset, most dataloaders follow the following recipe:
- Fetch data (e.g. where your JPEG images are stored on disk)
- The CPU reads encoded images
- The CPU decodes images
- The CPU applies image transformations to augment images
- Finally, images are sent to devices (although this is usually not done by the dataloader itself)
Instead of doing the whole process on CPU and send ready-to-train data to devices, a more efficient workflow would be to send encoded images to devices first and then perform image decoding and augmentations:
- Same as before
- Same as before
- Encoded images are sent to devices
- Devices decode images
- Devices apply image transformations to augment images
That way we can benefit from the computing power of our devices to speed image decoding and transformations up. Note that there are two caveats to be aware of when doing this:
- Device memory consumption will increase, so you may have to reduce your batch size if there is not enough free memory. This may mitigate the speedup brought by this approach.
- If devices are intensively used (100% or close to it) when doing data loading on CPU, don't expect any speedup when doing it on devices as they already have their hands full.
To implement this on Gaudi2, we have got you covered: the contrastive image-text example in Optimum Habana now provides a ready-to-use media pipeline that you can use with COCO-like datasets that contain text and images! You will just have to add --mediapipe_dataloader
to your command to use it.
For interested readers, a lower-level overview is given in the documentation of Gaudi here and the list of all supported operators is available there.
We are now going to benchmark a run with dataloader_num_workers=1
, distribution_strategy="fast_ddp"
and mediapipe_dataloader
since all these optimizations are compatible with each other:
Device | dataloader_num_workers=0 |
dataloader_num_workers=1 |
dataloader_num_workers=1 + distribution_strategy="fast_ddp" |
dataloader_num_workers=1 + distribution_strategy="fast_ddp" + mediapipe_dataloader |
---|---|---|---|---|
Gaudi2 HPU | 532.4 samples/s | 639.7 samples/s | 705.9 samples/s | 802.1 samples/s |
A100 GPU | 210.5 samples/s | 296.6 samples/s | / | / |
We got an additional x1.14 speedup compared to the previous run with dataloader_num_workers=1
and distribution_strategy="fast_ddp"
.
This final run is thus x1.51 faster than our base run on Gaudi2 simply adding 3 ready-to-use training arguments. It is also x2.70 faster than A100 with dataloader_num_workers=1
!
To reproduce this benchmark, you first need to get access to Gaudi2 through the Intel Developer Cloud (see this guide for more information).
Then, you need to install the latest version of Optimum Habana and run run_bridgetower.py
that you can find here. Here is how to do it:
pip install optimum[habana]
git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana/examples/contrastive-image-text
pip install -r requirements.txt
The base command line to run the script is:
python ../gaudi_spawn.py --use_mpi --world_size 8 run_bridgetower.py \
--output_dir /tmp/bridgetower-test \
--model_name_or_path BridgeTower/bridgetower-large-itm-mlm-itc \
--dataset_name jmhessel/newyorker_caption_contest --dataset_config_name matching \
--image_column image --caption_column image_description \
--remove_unused_columns=False \
--do_train --do_eval --do_predict \
--per_device_train_batch_size="40" --per_device_eval_batch_size="16" \
--num_train_epochs 5 \
--learning_rate="1e-5" \
--push_to_hub --report_to tensorboard --hub_model_id bridgetower\
--overwrite_output_dir \
--use_habana --use_lazy_mode --use_hpu_graphs_for_inference --gaudi_config_name Habana/clip \
--throughput_warmup_steps 3 \
--logging_steps 10
which corresponds to the case --dataloader_num_workers 0
. You can then add --dataloader_num_workers 1
, --distribution_strategy fast_ddp
and --mediapipe_dataloader
to test other configurations.
To push your model and Tensorboard logs to the Hugging Face Hub, you will have to log in to your account beforehand with:
huggingface-cli login
For A100, you can use the same run_bridgetower.py
script with a few small changes:
- Replace
GaudiTrainer
andGaudiTrainingArguments
withTrainer
andTrainingArguments
from Transformers - Remove references to
GaudiConfig
,gaudi_config
andHabanaDataloaderTrainer
- Import
set_seed
directly from Transformers:from transformers import set_seed
The results displayed in this benchmark were obtained with a Nvidia A100 80GB GCP instance with 8 GPUS.
Note that --distribution_strategy fast_ddp
and --mediapipe_dataloader
are compatible with Gaudi2 only and will not work with A100.
When dealing with images, we presented two solutions to speed up your training workflows: allocating more resources to the dataloader, and decoding and augmenting images directly on accelerator devices rather than on CPU. We showed that it leads to dramatic speedups when training a SOTA vision-language model like BridgeTower: Habana Gaudi2 with Optimum Habana is almost 3x faster than Nvidia A100 80GB with Transformers! And this is super easy to use as you just need to provide a few additional training arguments.
To go further, we are looking forward to using HPU graphs for training models even faster and to presenting how to use DeepSpeed ZeRO-3 on Gaudi2 to accelerate the training of your LLMs. Stay tuned!
If you are interested in accelerating your Machine Learning training and inference workflows using the latest AI hardware accelerators and software libraries, check out our Expert Acceleration Program. To learn more about Habana solutions, read about our partnership and contact them here. To learn more about Hugging Face efforts to make AI hardware accelerators easy to use, check out our Hardware Partner Program.