RuntimeError: Can't enable access between nodes 1 and 0 #2066
Labels
not a bug
Some known limitation, but not a bug.
triaged
Issue has been triaged by maintainers
waiting for feedback
Hi, I tried to convert T5 model to tensorrt. I have a 4 GPUs devices.In the python convert_checkpoint.py step,I set tp_size=4,pp_size=1.Then I got tensorrt model successfully.However,when I use command :mpirun --allow-run-as-root -np 4 python3 run.py ,I got those errors
when I set tp_size=1,pp_size=1 in the python convert_checkpoint.py step,I can run python3 run.py successfully.
So how can I fixed this problem?It seems to be related with GPU setting,but I don't know how to do that.
I also found a similar issue
but when I added --use_custom_all_reduce disable in trtllm-build,it showed unrecognized arguments
The text was updated successfully, but these errors were encountered: