Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Pointed out by @szhengac, the current logic that uses
.chunk(3, dim=-1)
to split qkv assumes different data layouts for TP=1 and TP>1 cases. Specifically, when TP=1, we assume the qkv is contiguous, meaning that the weight layout is[q0q1,...,k0k1, ..., v0v1]
. However, when TP>1, since weight is sharded along axis=0, each partitioned weight has[3 * H // TP]
. This assumes the qkv layout is interleaved (i.e.,[q0k0v0, ...]
).This won't be an issue if we always run the model within the same case, but the produces incorrect results if, for example, we trained the model with TP=2 but now want to fine-tune it with TP=1. Although transposing trained weights could also resolve this issue, this seems not straightforward to users.
This PR fixes this issue by assuming the qkv weights are always interleaved. This is also the methodology used in Megatron-LM. Accordingly, we need to manually transpose the weights in the unit test to match the GPT-2 attention results.
Checklist