-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorRT-LLM pipeline parallelism is broken #259
Comments
Hello, does this same configuration work for you outside of the context of |
@IlyasMoutawwakil when running without mpi, I get Here's the sample configuration I use:
And command line which successfully launches the benchmark: |
will investigate this, I remember launching distributed (tp) trt-llm without mpirun, but it's been long now. |
I was able to run trt-llm with tp and pp without the mpirun runner, I believe that's only needed for multi-node. |
Very strange - I tried now to reproduce CLI tests on my machine using optimum-nvidia:latest container, and still got the same error: In logs, I see that world_size is indeed 1: Here are my steps:
|
Sorry, I double-checked the logs and figured out that I'm using pre-built engines from single-GPU runs 🤦♂️ |
Also, I see this in the GitHub CI log (e.g. https://github.com/huggingface/optimum-benchmark/actions/runs/11008321942/job/30565746560):
|
In my "local" tests (on an A100) I see equal usage on both GPUs, until kv cache starts being allocated and that's when one machine uses more than the other (almost gets saturated) I guess that's weird but it sounds like an issue in tensorrt-llm. I also don't get I also checked optimum-nvidia code and it's using the LLM helper class at: |
tell me if this makes sense, I admit it is weird and confusing that the logs show MPI size as 1. |
No it's actually wrong to sum throughputs with TP or PP, these two strategies split the model and not the data, so in the case of TP tensors are split, and only half of the computation is performed on each GPU, but you can't have different inputs on each process (unlike DP). That's why batch_size=1 works with TP and PP, but the min batch size with DP is 2. It makes sense for me that TP gives as much perf as single gpu here, in fact I'm surprised it reaches that, as it's a strategy that's optimized for compute bound problems (big weights + prefill = big matmuls) with a bit of comm overhead. |
@asesorov I can also easily implement an |
Problem Description
When trying to use pipeline parallelism in tensorrt-llm on 2+ NVIDIA GPUs, I encounter
AssertionError: Expected but not provided tensors:{'transformer.vocab_embedding.weight'}
. I tried other models, but error is the same.Environment
Optimum Benchmark configuration
Logs
With mpirun:
trt-llm_2gpus_pp_mpirun_n2.log
Without mpirun:
trt-llm_2gpus_pp.log
Preview of the error:
The text was updated successfully, but these errors were encountered: