-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GQA support to MPT (and GPT) models #205
Conversation
examples/mpt/weight.py
Outdated
n_embd // tensor_parallel + | ||
(n_embd // n_head) * 2) | ||
(head_dim * n_kv_head * 2) // tensor_parallel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be wrong for the MQA case. I'll need to find a model to verify this.
Looks great :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one comment: there are few restrictions on number of kv heads and tp size for GQA/MQA; n_kv_heads must be divisible by tp_size and num_heads must be divisible by n_kv_heads. I'd suggest we put an assertion command to ensure this is satisfied.
Also, I'll remove the dependency on FT conversion for MPT models in a new PR if it helps. Basically directly convert from HF. |
Hi @bheilbrun , Thanks a lot for the pull request. Can you rebase the PR against the main branch, please? We are not going to do updates to the @nv-guomingz , can you take a look at this PR, please? Thanks, |
Sure, I'll take a look this PR today. |
Hi @bheilbrun , Could u please give me a full steps-by-steps instructions on building engine with replit-code-v1.5 model? I managed to convert the weights via below cmd
python3 build.py --model_dir=./ft_ckpts/replit/bf16-gqa/1-gpu \ 130 ↵ Error msg [11/09/2023-11:31:46] [TRT-LLM] [I] Loading weights from FT... |
And another issue is this PR will break the original supporting with MPT weights converting. |
Ohhh, I think I know what happened. In this PR, I tried to finish support for I'll see if this has a quick fix or if I should back out the |
0554ec4
to
975726a
Compare
Hi @bheilbrun It seems that you had pushed new commit and it fixed the original mpt model building issue. However, I still met the issue when I tried to verify replit-v1.5 model. Specifically, if I tried to build the engine with below command python3 build.py --model_dir=./ft_ckpts/replit/bf16-gqa/1-gpu \
--max_batch_size 64 \
--use_gpt_attention_plugin \
--use_gemm_plugin \
--output_dir ./trt_engines/replit/bf16/1-gpu --n_kv_head 8 There's error msg like TypeError: GPTLMHeadModel.__init__() got an unexpected keyword argument 'num_kv_heads' I think the rootcause is that we may need to apply simliar change like here. Could u please take a look at this issue? B.T.W, Would u please provide the full cmd to reproduce your local results in case we may have different usage with your PR? |
Heya @nv-guomingz, thanks again for looking. I added my test commands to the PR description. Hope that helps. I also tested mpt-7b with 1 and 2 GPUs. The latter required a small fix to
This error surprises me because I added that kwarg in this PR here, https://github.com/NVIDIA/TensorRT-LLM/pull/205/files#diff-1767dd0367b35551b6031983a93a636d50efca440e69bbdc17f8e0ac3d147151R341 . Could you double check your local checkout of Thanks for testing. |
Hi @bheilbrun thanks for updating and the issue has gone with a clean build. I've verified the correctness on both tp1 and tp2 case on H100/A100/L40S platform. We're going to merge your PR into internal repo firstly and credit your great work in next weely release if everything goes well. Thanks, |
@nv-guomingz great news, appreciate the help! |
@megha95 that'd be a great improvement. Hopping through the "old" FasterTransformer format is definitely a pain. It's working now but is also a maintenance headache. Let me know if I can help. |
416eee2
to
1b92a21
Compare
…orRT-LLM into bheilbrun/mpt-gqa
Hi @bheilbrun I saw you've update commit to 78b1b03. By checking the git history, I guess u wanna to update this branch with latest main code. I think it's not neccessary if there's no feature changes since we've rebased the 416eee2 with internal main branch succesfully 😄 Thanks, |
Thanks! Out of convenience, I was using this branch to share code between a few different machines. :) I'll do this on a different branch if I need to update again, to avoid the notification noise for y'all. |
@@ -90,10 +90,6 @@ def convert_weight_to_ft_each(out_dir: str, tensor_parallelism: int, | |||
for j in range(tensor_parallelism): | |||
save_path = os.path.join(out_dir, f'model.{tensor_name}.{j}.bin') | |||
split_vals[j].tofile(save_path) | |||
if config['no_bias']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bheilbrun May I know why we need to remove line 93 to line 96?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to the no_bias
change I mentioned in the PR description. I translate MPT's no_bias=True
option to GPT's bias=False
. When this is set, GPT doesn't load bias tensors for many layers.
However, there is one implementation difference between MPT and GPT. MPT has no bias for all layers. GPT by contrast still expects biases for layernorm layers, based on my reading and experimentation.
Hope that clears it up and that it's not causing problems.
Hi @bheilbrun , we pushed an update to the main branch, and we added you as co-author, which is also mentioned in the announcement. We're going to close this PR, please let us know if you have any questions. Thanks again for the great contribution. |
Why
TensorRT-LLM currently supports MPT models with MHA and MQA, but not GQA. However, there is at least one MPT-based model in the wild that uses GQA (replit-code-v1.5). It's my understanding that others may exist in the future.
What
TensorRT-LLM already supports GQA, so the delta in this PR is mostly about plumbing 'num KV heads' through a few layers, including the generic GPT model implementation. As such, GPT models should also support GQA but I didn't deeply test it (beyond the pre-existing unit and e2e tests).
Additionally, this PR improved support for the MPT
no_bias
option by not writing empty bias tensors (in most cases) when no bias is present in the model.I also removed the unused
examples.mpt.weights.load_from_hf_gpt
function. The existing example scripts use onlyload_from_ft
in the same file.Testing
replit-code-v1.5
from HuggingFace checkpoints. (commands below)mosaicml/mpt-7b
with--world_size
set to 1 and 2.testing/
I'm not sure how much we need to maintain backwards compatibility with existing FasterTransformer configs or implementations, so let me know if you see any problems in this area.
Similarly, if there are any other models I should test, let me know.