Fix mqa parallelization #51

thomasw21 · 2023-05-11T12:25:00Z

There's no reason to run a second communication for kv layers in MQA. Basically the previous ColumnLinear should already handle that all_gather for us, we just need to run a typical Linear layer. It does mean that we run that kv linear in duplicated fashion, but it's seems to not matter (It's my understanding that sequence parallel wasn't used during the training of starcoder)

The reason why I decided not to build another module, is for backward compatibility. Basically this PR should allow for free throughput improvement, and be able to load previous checkpoints.

I have not tested this PR. cc @loubnabnl

If it does provide the same improvement I've seen on my own codebase, then we need to propagate this PR to our inference models. Typically text-generation-inference and such. (EDIT: this seems already support in text-generation-inference with even an extra merging of qkv)

thomasw21

Okay it seems the implementation is not correct for sequence parallel. I'll try to fix this.

Don't need to run communication for kv

33d14e6

thomasw21 requested a review from loubnabnl May 11, 2023 12:26

Woops

67f733c

lvwerra requested a review from jlamypoirier May 11, 2023 12:30

thomasw21 commented May 11, 2023

View reviewed changes

I think this fixes sequence parallel

080d1c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mqa parallelization #51

Fix mqa parallelization #51

thomasw21 commented May 11, 2023 •

edited

Loading

thomasw21 left a comment

Fix mqa parallelization #51

Are you sure you want to change the base?

Fix mqa parallelization #51

Conversation

thomasw21 commented May 11, 2023 • edited Loading

thomasw21 left a comment

Choose a reason for hiding this comment

thomasw21 commented May 11, 2023 •

edited

Loading