You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Would it be possible to implement an RL environment that does multi-turn tool calling in the GRPO RL training loop? Right now it seems to be a one-shot inference before passing it to the custom reward function. I'd like to have a multi-turn interaction via tool calling step before passing the final result to the reward function.
Online tool calling would enable RL over a simulation with feedback from the environment. Seems vLLM supports all manner of tool calling. Could this be added as part of GRPO?
Method description
Would it be possible to implement an RL environment that does multi-turn tool calling in the GRPO RL training loop? Right now it seems to be a one-shot inference before passing it to the custom reward function. I'd like to have a multi-turn interaction via tool calling step before passing the final result to the reward function.
Online tool calling would enable RL over a simulation with feedback from the environment. Seems vLLM supports all manner of tool calling. Could this be added as part of GRPO?
@qgallouedec
Open source status
Provide useful links for the implementation
No response
The text was updated successfully, but these errors were encountered: