Skip to content

[Feature]: Co-locating NPU support for GRPO training with trl #467

@Switchsyj

Description

@Switchsyj

🚀 The feature, motivation and pitch

Hello, as I notice that trl (vllm-ascend) now supports to inference on a independent node during training. Does it allow to inference alongside other workloads on the same GPU as training (i.e. TP or PP)?

I think this would be useful as single-node/NPU inference could be wasteful, reducing inference throughput and training efficiency.

I think the core thing is how to update grouped parameters, adapting to NPU (but I am not very familiar with that). If there's any possibilities, I would like to help.

See the issue implemented on GPUs: huggingface/trl#3162

Alternatives

No response

Additional context

To be specifically, I guess the difference should lie in init and broadcast process:
init:

pg = StatelessProcessGroup.create(host=self.host, port=self.group_port, rank=self.rank, world_size=world_size) --> replace with corresponding  group manager?
self.pynccl_comm = PyNcclCommunicator(pg, device="cuda:0")  --> replace with NPUCommunicator

broadcast:

self.pynccl_comm.broadcast(weights, src=self.rank, stream=torch.cuda.current_stream())
self.pynccl_comm.group.barrier() --> replace with some NPU broadcast method?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions