-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow inference performance for large Llama models compared to naive MP #66
Comments
Hi @sgsdxzy!
Which is not quite double the speed but it gets better on larger batches. About your case: if you're sure that those numbers are valid, maybe It's somehow connected to the fact that you're using 4 cards. What's the data bandwidth between them? Are all 4 cards using enough PCI-E lanes? |
@BlackSamorez I can confirm using 2 cards TP provides a small speedup against 2 cards MP. The 4 cards are all running at pcie3.0x16 on an X99. Here's my P2P connectivity test (I have two nvlinks between [0,1] and [2,3])
I think Kaggle T4s are not using nvlinks so that's not the problem here, and I don't think 4-cards would suddenly hit a communication bottleneck and drastically reduce performance. I think it's more of a misconfigure or bug. Where would you suggest me to look? |
@sgsdxzy Thanks!
during forward passes. Also could you please benchmark |
@BlackSamorez Here's the results:
So the problem here:
The updated script for reference
|
@BlackSamorez here's results for OPT-6.7B, almost same as llama-7b.
Are you testing in int8 or fp16? Can you get any other cards than dual T4? And I don't think I am having a gpu communication problem as deepspeed-inference provided TP is boosting performance for me on OPT(llama is not well-supported yet), 2-card fp16 is 65% faster than 1-card fp16 oobabooga/text-generation-webui#561 (comment) |
I find |
@sgsdxzy Hi! |
@BlackSamorez I upragded |
@BlackSamorez it's here. This is conda envrionment, tell me if you suspect any specific package that doesn't have version listed by
|
@sgsdxzy By the way here's what I get on my setup with
Only |
I've tested pure forward passes and it looks good:
On the same |
@BlackSamorez is that |
I'm not sure. There is a different data structure for ungathered tensors called |
Have you identified the issue? |
I also observed slowdown with tensor_parallel 1.2.1 compared to native performance on single GPU. SetupLlama-7b on 8 x A100 80GB (NVLink) Prompt
so the number of new generated tokens is a fixed value (155) Inference Performance1-GPU w/o TP: inference time 7.08s, GPU-util by
vs.
any hints on what might have gone wrong? |
Thank you for sharing your findings on the performance of LLaMA 13B on Kaggle 2x T4. Good to know that you've identified the .generate() issue. I appreciate your efforts in looking into it and eagerly await the release of a fix. Keep up the good work! |
Hi @BlackSamorez , have you been able to identify and fix the issue? I am having similar issues, where using 2 way or even 4 way tp slows down inference times, while using |
Would love to know if there is any update on this issue @BlackSamorez. |
@eric-mitchell @dmgcsilva Sadly, I have no time nor resources to properly test and benchmark this right now. I'll do it in a month or so. |
anyone find an alternative efficient TP solution yet? |
Also found that 4gpus tp is much slower than 2gpus tp, while the latter is still a bit faster than 2*gpus pp. |
This work is very meaningful. I followed @sgsdxzy and conducted the following test on 3090.
But performance seems to be the same. Are there any other useful tensor parallel tools? |
@dutsc I use Aphrodite-engine or vLLM for TP inference. |
Thank you for your answer. |
The inference speed of naive model parallel is much better than tensor parallel:
Setup: Llama-30b on 2080Ti 22G x4
Naive: 31.64s
4-way TP, main branch: 177.78s
4-way TP, llama branch: 102.22s
The code for naive inference
The code for TP:
The text was updated successfully, but these errors were encountered: