[VLM] end2end geo3k multi-turn RL of VLM Recipe#1141
[VLM] end2end geo3k multi-turn RL of VLM Recipe#1141zhaochenyang20 merged 39 commits intoTHUDM:mainfrom
Conversation
slime/utils/arguments.py
Outdated
| type=int, | ||
| default=None, | ||
| help="Maximum turns for multi-turn custom rollout (e.g., Sokoban). Defaults to rollout implementation config.", | ||
| ) |
There was a problem hiding this comment.
is it possible to pass these 2 configs through --custom-config-path?
There was a problem hiding this comment.
Yes, it sounds neater. I have pushed the change, thanks!
|
Nice done Xiaole! |
|
@gxlvera are you working on OpenCUA part? can i help with it? |
|
Great job so far! |
Hi, you could try to support OpenCUA's AgentNet dataset. Note that if you want to implement the online interaction, maybe you need an os sandbox for simulation. It's OK if you stick with offline mode (without interaction) although I personally don't think that would work well. |
|
Sure @gxlvera can try that |
|
@gxlvera can you help with openCUA a bit? where can i connect u? |
Hi, you could DM me at gxlvera@gmail.com~ |
|
We shall also have a Megatron version,. But FSDP works cool! |
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

Goal
VLM Multi-turn (related to #1075)
TODO / Status
Rollout
examples/vlm_multi_turn/rollout.pymax_turns(specified via rollout argument --custom-config-path)loss_mask/rollout_log_probsloss_mask = 1on assistant tokensloss_mask = 0on user/observation tokensrollout_log_probspadded to matchsample.promptstays unmaskedInteractive environment
examples/vlm_multi_turn/env_geo3k.pybuild_env/reset/step/format_observationfunctions for per-turn feedbackData & dataset
Experiment Result
Trained Qwen3-VL-2B-Instruct on the geo3k dataset with multi-turn reasoning, using GRPO.
