Skip to content

[CPU] Support SHM based inference_all_reduce in TorchBackend#5391

Merged
tjruwase merged 10 commits intodeepspeedai:masterfrom
delock:gma/gloo_shm_allreduce
Apr 17, 2024
Merged

[CPU] Support SHM based inference_all_reduce in TorchBackend#5391
tjruwase merged 10 commits intodeepspeedai:masterfrom
delock:gma/gloo_shm_allreduce

Conversation

@delock
Copy link
Collaborator

@delock delock commented Apr 10, 2024

This PR adds SHM based inference_all_reduce kernel to TorchBackend communication backend. When inference on CPU server, this path replaces default torch.distributed.all_reduce which eventurally use gloo backend. This PR will improve inference performance with AutoTP when only stock PyTorch is installed without Intel Extension for PyTorch.

Compared with gloo backend. SHM based inference_all_reduce kernel is a more directed path and perform much better on single node.

message size gloo all_reduce(ms) SHM all_reduce(ms)
32MB 30.7 0.65
64KB 0.23 0.028

In text generation of bloom-3b with AutoTP, average token latency improved 1.45x with this PR on 2S Xeon node.

@delock delock changed the title Support SHM based inference_all_reduce in TorchBackend [CPU] Support SHM based inference_all_reduce in TorchBackend Apr 10, 2024
@delock
Copy link
Collaborator Author

delock commented Apr 11, 2024

Hi @loadams the formatting error had been fixed, thanks!

@tjruwase tjruwase added this pull request to the merge queue Apr 17, 2024
Merged via the queue into deepspeedai:master with commit b22706a Apr 17, 2024
rraminen pushed a commit to ROCm/DeepSpeed that referenced this pull request May 9, 2024
…edai#5391)

This PR adds SHM based `inference_all_reduce` kernel to `TorchBackend`
communication backend. When inference on CPU server, this path replaces
default `torch.distributed.all_reduce` which eventurally use gloo
backend. This PR will improve inference performance with AutoTP when
only stock PyTorch is installed without Intel Extension for PyTorch.

Compared with gloo backend. SHM based inference_all_reduce kernel is a
more directed path and perform much better on single node.

| message size | gloo all_reduce(ms) | SHM all_reduce(ms) |
| --- | --- | --- |
| 32MB | 30.7 | 0.65 |
| 64KB | 0.23 | 0.028 |

In text generation of bloom-3b with AutoTP, average token latency
improved 1.45x with this PR on 2S Xeon node.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
umchand pushed a commit to umchand/DeepSpeed that referenced this pull request May 20, 2024
…edai#5391)

This PR adds SHM based `inference_all_reduce` kernel to `TorchBackend`
communication backend. When inference on CPU server, this path replaces
default `torch.distributed.all_reduce` which eventurally use gloo
backend. This PR will improve inference performance with AutoTP when
only stock PyTorch is installed without Intel Extension for PyTorch.

Compared with gloo backend. SHM based inference_all_reduce kernel is a
more directed path and perform much better on single node.

| message size | gloo all_reduce(ms) | SHM all_reduce(ms) |
| --- | --- | --- |
| 32MB | 30.7 | 0.65 |
| 64KB | 0.23 | 0.028 |

In text generation of bloom-3b with AutoTP, average token latency
improved 1.45x with this PR on 2S Xeon node.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
dbyoung18 pushed a commit to dbyoung18/DeepSpeed that referenced this pull request Jun 11, 2024
…edai#5391)

This PR adds SHM based `inference_all_reduce` kernel to `TorchBackend`
communication backend. When inference on CPU server, this path replaces
default `torch.distributed.all_reduce` which eventurally use gloo
backend. This PR will improve inference performance with AutoTP when
only stock PyTorch is installed without Intel Extension for PyTorch.

Compared with gloo backend. SHM based inference_all_reduce kernel is a
more directed path and perform much better on single node.

| message size | gloo all_reduce(ms) | SHM all_reduce(ms) |
| --- | --- | --- |
| 32MB | 30.7 | 0.65 |
| 64KB | 0.23 | 0.028 |

In text generation of bloom-3b with AutoTP, average token latency
improved 1.45x with this PR on 2S Xeon node.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants