GRPO with tool calling #2712

accupham · 2025-01-31T07:25:54Z

Method description

Would it be possible to implement an RL environment that does multi-turn tool calling in the GRPO RL training loop? Right now it seems to be a one-shot inference before passing it to the custom reward function. I'd like to have a multi-turn interaction via tool calling step before passing the final result to the reward function.

Online tool calling would enable RL over a simulation with feedback from the environment. Seems vLLM supports all manner of tool calling. Could this be added as part of GRPO?

@qgallouedec

Open source status

The method implementation is available
The model weights are available
The training datasets are available

Provide useful links for the implementation

No response

github-actions bot added 🏋 GRPO Related to GRPO 🏋 Reward Related to Reward modelling labels Jan 31, 2025

August-murr mentioned this issue Jan 31, 2025

Training Agents with GRPO #2723

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GRPO with tool calling #2712

GRPO with tool calling #2712

accupham commented Jan 31, 2025

GRPO with tool calling #2712

GRPO with tool calling #2712

Comments

accupham commented Jan 31, 2025

Method description

Open source status

Provide useful links for the implementation