-
Notifications
You must be signed in to change notification settings - Fork 544
Closed
Labels
Description
🚀 The feature, motivation and pitch
Hello, as I notice that trl (vllm-ascend) now supports to inference on a independent node during training. Does it allow to inference alongside other workloads on the same GPU as training (i.e. TP or PP)?
I think this would be useful as single-node/NPU inference could be wasteful, reducing inference throughput and training efficiency.
I think the core thing is how to update grouped parameters, adapting to NPU (but I am not very familiar with that). If there's any possibilities, I would like to help.
See the issue implemented on GPUs: huggingface/trl#3162
Alternatives
No response
Additional context
To be specifically, I guess the difference should lie in init and broadcast process:
init:
pg = StatelessProcessGroup.create(host=self.host, port=self.group_port, rank=self.rank, world_size=world_size) --> replace with corresponding group manager?
self.pynccl_comm = PyNcclCommunicator(pg, device="cuda:0") --> replace with NPUCommunicatorbroadcast:
self.pynccl_comm.broadcast(weights, src=self.rank, stream=torch.cuda.current_stream())
self.pynccl_comm.group.barrier() --> replace with some NPU broadcast method?