[Feature]: Co-locating NPU support for GRPO training with trl

### 🚀 The feature, motivation and pitch

Hello, as I notice that trl (vllm-ascend) now supports to inference on a independent node during training. Does it allow to inference alongside other workloads on the same GPU as training (i.e. TP or PP)?

I think this would be useful as single-node/NPU inference could be wasteful, reducing inference throughput and  training efficiency.

I think the core thing is how to update grouped parameters, adapting to NPU (but I am not very familiar with that). If there's any possibilities, I would like to help.

See the issue implemented on GPUs: https://github.com/huggingface/trl/pull/3162

### Alternatives

_No response_

### Additional context

To be specifically, I guess the difference should lie in init and broadcast process:
init:
```python
pg = StatelessProcessGroup.create(host=self.host, port=self.group_port, rank=self.rank, world_size=world_size) --> replace with corresponding  group manager?
self.pynccl_comm = PyNcclCommunicator(pg, device="cuda:0")  --> replace with NPUCommunicator
```
broadcast:
```python
self.pynccl_comm.broadcast(weights, src=self.rank, stream=torch.cuda.current_stream())
self.pynccl_comm.group.barrier() --> replace with some NPU broadcast method?
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Co-locating NPU support for GRPO training with trl #467

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Co-locating NPU support for GRPO training with trl #467

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions