-
Notifications
You must be signed in to change notification settings - Fork 257
Open
Description
- Follow token-in-token-out
- Add a documentation page detailing the codepaths and behaviors of
SkyRLGymGenerator - Make agent_loop output a data class [cleanup] Make agent_loop output a dataclass #194
- Add a config for custom chat template, especially for Qwen3
- [SkyRLGymGenerator] Add genreator.chat_template and change Qwen3 default behavior to accumulate thinking tokens #178
- Consider where to add the custom chat template. What if users are not using
SkyRLGymGenerator? Should we apply that custom chat template toInferenceEngineClient.tokenizer?
- Separate configuration of
cfg.generator.retokenizeandcfg.generator.custom_chat_template - Improve unit tests for
SkyRLGymGenerator(e.g. do not mock tokenizer for more realistic tests, and add more tests beyond qwen2.5, qwen3, llama) - For token-in-token-out, also maintain a string-based
chat_historyfor the sake of sanity check. After each turn, we compare the token-in-token-out-maintained token IDs and the string-based chat history - Deprecate post-processed action for SkyRL Gym to ensure token-in-token-out. Instead ask users to do such post-processing in the token space
- This PR removes it from our official gym (search and txt2sql): [Generator][Env] Add stop str, remove need for post-processed action in search and txt2sql #190
- Add a multi-turn GSM8k example for testing purposes (the tool can be checking the answer). We usually run gsm8k for comparing curves, but we should have a multi-turn equivalent that is cheaper to run than search / text2sql
- Currently for multi-turn in SkyRLGymGenerator, if the stop reason is length for one of the turns, users will not be aware. Besides, if we do not follow the retokenize code path, no EOS token will be manually attached, while the retokenize codepath will attach the EOS token due to the chat template. Besides, currently there is no way for the user to know when a turn in the middle stopped due to "length", since stop reason is a single string, rather than a list of strings (one per turn). Should revisit.
- Related to this CI fix: [Fix][CI] Address generator CI test fails when model stop reason is length #269
- Related to this too: [train][CI] Fix flaky GPU skyrlgymgenerator test due to stop_reason=length #456
- Can be addressed by fixing this: [SkyRLGymGenerator] Cleaner generation length handling #279
- Support turn-level rewards for the
retokenize_chat_historycodepath, as shown by this TODO- As described in this comment:
SkyRL/skyrl-train/skyrl_train/generators/skyrl_gym_generator.py
Lines 228 to 234 in f0015a7
if retokenize_chat_history: # a. We always re-tokenize the entire chat history every turn and at the end. chat_history, chat_end_index, input_ids = self._get_next_input_ids_by_retokenizing_chat_history( chat_history, chat_end_index, output, new_obs ) # TODO(tgriggs): Support turn-level rewards for multi-turn chat template per_step_rewards.append((step_reward, None)) - Related PR: [Generator] Support turn-level rewards in SkyRLGymGenerator #226
- As described in this comment:
- See issue described in this comment [Metrics] Add back pass_at_n computation for (per-token) rewards #317 (comment)
- Cut down the number of lines of code in SkyRLGymGenerator
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels