Adding multi gpu support for DPR inference #1414

MichelBartels · 2021-09-05T19:49:16Z

Proposed changes:

adds multi gpu support for DPR inference (Does DPR document store "update embeddings" utilize multiple GPUs? #1318)
adds benchmarking functionality for multiple gpus

Status (please check what you already did):

First draft (up for discussions & feedback)
Final code
Added tests
Updated documentation

This is an initial draft, which has a raised a few questions for me:
I have added one optional parameter to DPR's constructor called devices. This takes a list of torch.devices or integer/string identifiers of gpus specifying the gpus used for inference. This is, however, a bit counterintuitive as, despite its name, only cpu devices are invalid and whether or not to use a gpu is specified with another parameter. Also, usually the specific devices don't matter as normally only one gpu type is present. To address this, I have multiple ideas:

only use the parameter device, default value being cpu; implementing a check if it is cpu and then proceed accordingly. This would, however, require to specify each gpu
also accepting integers for the use_gpu parameter; maybe renaming it. The parameter would also stand for the number of gpus
an extension of 2.: accepting boolean, integer and list of device specifier for use_gpu

I am not sure which of these options is the best. (Or if there is better option.)

The existing parameter batch_size has an ambiguous meaning when using multiple gpus. It could either mean per device batch size or total batch size. It should either be defined or per_device_batch_size and/or total_batch_size could be added as parameters.

Benchmark:
I have extended the benchmark_index method, because it seems like adding another function would create a lot of overlap. However, this seems to further complicate the function by introducing two additional nested loops (number of gpus and batch size). Should I perhaps split the benchmark in two and refactor the benchmark_index method so its functionality can be reused? Also, similarly to the question above, should the benchmark parameters specify the specific gpu devices or just the number of gpus?

Performance:
torch.nn.DataParallel seems to add a lot of overhead. This means that for small batch sizes using two gpus is slower than using one. However, with a batch size of 512 I could measure a performance improvement of 37.6% using two RTX 3060. This is probably higher with more VRAM.

julian-risch · 2021-09-06T17:34:21Z

Hi @MichelBartels thank you very much for your contribution and especially for providing the speed comparison with 1 GPU vs. 2 GPUs. We will need some time to think about and discuss your proposed changes to use_gpu and batch_size. @tholor could you please share your opinion on this?

tholor · 2021-09-08T15:48:05Z

Hey @MichelBartels ,

Thanks a lot for working on this! Let me address your questions:

`use_gpu` + `devices`argument

95% of user probably want to just set use_gpu=True and then utilize all available GPUs automatically
5% of users might want to restrict it to a specific GPU device (e.g. when multiple users / applications share a multi-gpu machine and you want to make sure that there's no ressource conflict between different applications)
=> I'd therefore suggest that devices is an optional parameter with a list of device names. If supplied you just use those. If not, you use all available GPU devices.

`batch_size`

We usually use the total batch size here in multi gpu settings as this is what you typically log / tune in experiments. Let's clarify this in the according docstring to avoid confusion of users. No need to add an extra argument though.

Benchmarking

I feel there's no need to integrate it via a loop at the moment. Let's rather at a constant DEVICES=None at the beginning of the script that configures whether all available GPUs are used or only a specific device.

Just let me know when I should do a more thorough review!

MichelBartels · 2021-09-09T08:16:40Z

I have now added the suggested changes. I have also added a warning in case the batch size is less than the device size.

MichelBartels · 2021-09-09T11:10:54Z

The check seems to fail because https://github.com/deepset-ai/haystack/blob/master/test/test_retriever.py#L199 seems to keep the default parameter of use_gpu (true). The test server, however, does not seem to have gpu support. Usually, initialize_device_settings switches to CPU silently in this case (https://github.com/deepset-ai/haystack/blob/master/test/test_retriever.py#L199). Should I also implement this? Or is a warning/crash expected when use_gpu is set to true without a gpu available?

tholor · 2021-09-09T11:56:38Z

Yep, the behaviour on a CPU machine should be to silently "fallback" to CPU even when use_gpu=True. We can also make this more explicit in the doc string.

MichelBartels · 2021-09-09T12:44:46Z

I have now also implemented the fallback. I think you can do your thorough review now.

tholor

Hey @MichelBartels ,

This is all looking good to me. I only added minor updates to docstring, replaced the deprecated logger.warn() and made sure that the supplied devices values are also stores within the object config (important for saving/loading).

I think this is a pragmatic, first implementation of multi-gpu support.
Later on, we should probably rethink some implementation details (see comment) and potentially support other options like DDP, TPUs or similar. HF's accelearte lib might be interesting to keep in mind here.

=> Ready-to-merge. Congrats to your first contribution 🎉

tholor · 2021-09-10T11:10:58Z

haystack/retriever/dense.py

+            self.devices = [torch.device(device) for device in range(torch.cuda.device_count())]
+        else:
+            self.devices = [torch.device("cpu")]
+


As we already have a helper method for device resolution (initialize_device_settings()), it might make sense to rather adjust this method and use it then everywhere within haystack than implementing it now here in a custom way. However, as this method is currently part of FARM and we are just in the middle of migrating FARM to Haystack, let's not complicate things and move forward with the custom code here. (FYI @Timoeller )

here is the link to the function - lets use this once the FARM migration PR is merged

Michel Bartels added 7 commits September 5, 2021 12:56

Added support for Multi-GPU inference to DPR including benchmark

8ac56d5

Merge branch 'master' of https://github.com/MichelBartels/haystack

3e6edbc

fixed multi gpu

776353f

added batch size to benchmark to better reflect multi gpu capabilities

3954f10

remove unnecessary entry in config.json

313e9e5

fixed typos

b93a315

fixed config name

c000fb6

julian-risch requested a review from tholor September 6, 2021 17:34

Michel Bartels added 2 commits September 9, 2021 09:48

update benchmark to use DEVICES constant

af4ab99

changed multi gpu parameters and updated docstring

44ec4ac

adds silent fallback on cpu

1f0acb0

update doc string, warning and config

0f7f52b

tholor approved these changes Sep 10, 2021

View reviewed changes

tholor merged commit da2e8da into deepset-ai:master Sep 10, 2021

tholor mentioned this pull request Sep 10, 2021

Does DPR document store "update embeddings" utilize multiple GPUs? #1318

Closed

tholor mentioned this pull request Nov 1, 2021

Standardize initialisation of device settings #1683

Merged

julian-risch mentioned this pull request Jan 6, 2022

Multi GPU support for FARM reader inference #1971

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding multi gpu support for DPR inference #1414

Adding multi gpu support for DPR inference #1414

MichelBartels commented Sep 5, 2021 •

edited

Loading

julian-risch commented Sep 6, 2021

tholor commented Sep 8, 2021

MichelBartels commented Sep 9, 2021

MichelBartels commented Sep 9, 2021

tholor commented Sep 9, 2021

MichelBartels commented Sep 9, 2021

tholor left a comment

tholor Sep 10, 2021

Timoeller Sep 10, 2021 •

edited

Loading

Adding multi gpu support for DPR inference #1414

Adding multi gpu support for DPR inference #1414

Conversation

MichelBartels commented Sep 5, 2021 • edited Loading

julian-risch commented Sep 6, 2021

tholor commented Sep 8, 2021

use_gpu + devicesargument

batch_size

Benchmarking

MichelBartels commented Sep 9, 2021

MichelBartels commented Sep 9, 2021

tholor commented Sep 9, 2021

MichelBartels commented Sep 9, 2021

tholor left a comment

Choose a reason for hiding this comment

tholor Sep 10, 2021

Choose a reason for hiding this comment

Timoeller Sep 10, 2021 • edited Loading

Choose a reason for hiding this comment

MichelBartels commented Sep 5, 2021 •

edited

Loading

`use_gpu` + `devices`argument

`batch_size`

Timoeller Sep 10, 2021 •

edited

Loading