[Tracker] SkyRLGymGenerator improvements

- [x] Follow token-in-token-out
  - https://github.com/NovaSky-AI/SkyRL/pull/152
  - https://github.com/NovaSky-AI/SkyRL/issues/123
- [x] Add a documentation page detailing the codepaths and behaviors of `SkyRLGymGenerator`
  - https://github.com/NovaSky-AI/SkyRL/pull/186
- [x] Make agent_loop output a data class #194 
- [ ] Add a config for custom chat template, especially for Qwen3
  - https://github.com/NovaSky-AI/SkyRL/pull/178
  - [ ] Consider where to add the custom chat template. What if users are not using `SkyRLGymGenerator`? Should we apply that custom chat template to `InferenceEngineClient.tokenizer`?
- [ ] Separate configuration of `cfg.generator.retokenize` and `cfg.generator.custom_chat_template` 
- [ ] Improve unit tests for `SkyRLGymGenerator` (e.g. do not mock tokenizer for more realistic tests, and add more tests beyond qwen2.5, qwen3, llama)
  - https://github.com/NovaSky-AI/SkyRL/pull/197
- [ ] For token-in-token-out, also maintain a string-based `chat_history` for the sake of sanity check. After each turn, we compare the token-in-token-out-maintained token IDs and the string-based chat history
- [ ] Deprecate post-processed action for SkyRL Gym to ensure token-in-token-out. Instead ask users to do such post-processing in the token space
  - This PR removes it from our official gym (search and txt2sql): https://github.com/NovaSky-AI/SkyRL/pull/190
- [ ] Add a multi-turn GSM8k example for testing purposes (the tool can be checking the answer). We usually run gsm8k for comparing curves, but we should have a multi-turn equivalent that is cheaper to run than search / text2sql
- [ ] Currently for multi-turn in SkyRLGymGenerator, if the stop reason is length for one of the turns, users will not be aware. Besides, if we do not follow the retokenize code path, no EOS token will be manually attached, while the retokenize codepath will attach the EOS token due to the chat template. Besides, currently there is no way for the user to know when a turn in the middle stopped due to "length", since stop reason is a single string, rather than a list of strings (one per turn). Should revisit.
  - Related to this CI fix: https://github.com/NovaSky-AI/SkyRL/pull/269
  - Related to this too: https://github.com/NovaSky-AI/SkyRL/pull/456
  - Can be addressed by fixing this: https://github.com/NovaSky-AI/SkyRL/issues/279
- [ ] Support turn-level rewards for the `retokenize_chat_history` codepath, as shown by this TODO
  - As described in this comment: https://github.com/NovaSky-AI/SkyRL/blob/f0015a705af0af82a0012538ff1b62c3b158cbea/skyrl-train/skyrl_train/generators/skyrl_gym_generator.py#L228-L234
  - Related PR: https://github.com/NovaSky-AI/SkyRL/pull/226
- [ ] See issue described in this comment https://github.com/NovaSky-AI/SkyRL/pull/317#discussion_r2369919448
- [ ] Cut down the number of lines of code in SkyRLGymGenerator

	if retokenize_chat_history:
	# a. We always re-tokenize the entire chat history every turn and at the end.
	chat_history, chat_end_index, input_ids = self._get_next_input_ids_by_retokenizing_chat_history(
	chat_history, chat_end_index, output, new_obs
	)
	# TODO(tgriggs): Support turn-level rewards for multi-turn chat template
	per_step_rewards.append((step_reward, None))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracker] SkyRLGymGenerator improvements #179

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Tracker] SkyRLGymGenerator improvements #179

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions