[SkyRLGymGenerator] Add genreator.chat_template and change Qwen3 default behavior to accumulate thinking tokens#178
Conversation
tyler-griggs
left a comment
There was a problem hiding this comment.
Thanks for sending this in! Left a couple comments, primarily around moving chat templates out of the config. There was also a pretty big PR that was just merged that touches some similar code in skyrl_gym_generator.py so it'd be good to merge changes!
… commits as right now, users have to manually add in parser support). Provided thinking masking on/off support for Qwen3 Models
Done!
…. Final checks are done and ready!
15b5eb9 to
7a22332
Compare
|
Here's the wandb for the updated run with the file and name support: https://wandb.ai/devpatelio/updated_template_system |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a flexible system for custom chat templates and masking for "thinking" tokens, which is a valuable enhancement. The changes are supported by new test scripts and a mini-dataset for validation. However, the current implementation has several critical issues, including NameError exceptions from undefined variables that will lead to runtime crashes, unreachable code due to flawed logic, and a bug in batched generation that results in incorrect loss masks. I have also identified several medium-severity issues such as hardcoded paths in scripts, duplicated template files, and leftover debug print statements. My review provides specific suggestions to address these problems to ensure the new functionality is robust and maintainable.
skyrl-train/tests/cpu/generators/test_skyrl_gym_generator_chat_templating.py
Outdated
Show resolved
Hide resolved
|
@devpatelio quick check: is this PR ready for review? I see several debugging |
There was a problem hiding this comment.
Code Review
This pull request introduces a flexible system for custom chat templates and token masking, which is a significant improvement over the previous hard-coded approach. The changes are well-structured, providing new configuration options in ppo_base_config.yaml and updating the generator logic accordingly while maintaining backward compatibility. The addition of tests for the new functionality is also a positive step. My review includes a couple of suggestions to enhance maintainability by addressing file duplication and to improve robustness by refining exception handling.
|
@tyler-griggs Sorry about that, I forgot to remove some of the debug print statements + fix some of the rebasing variable names. The PR is ready to review now! |
…t list instead of List[List[int]]
There was a problem hiding this comment.
Code Review
This pull request introduces a flexible system for using custom chat templates, including support for loading them from files and masking "thinking" tokens during training. The changes are well-structured, with new configurations in ppo_base_config.yaml and updated logic in the generator and utility functions. My review includes a couple of suggestions to improve code readability and maintainability, such as refactoring Jinja template definitions to be more readable and using a list comprehension for conciseness.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a flexible system for using custom chat templates, including loading them from files and masking "thinking" tokens for training. The changes are well-implemented, with robust error handling for template loading and comprehensive tests that cover the new functionality, including file-based templates and loss masking. The addition of qwen3_with_thinking and qwen3_without_thinking templates, along with the configuration in ppo_base_config.yaml, effectively addresses the goal of supporting custom templates and token masking.
skyrl-train/tests/cpu/generators/test_skyrl_gym_generator_chat_templating.py
Outdated
Show resolved
Hide resolved
skyrl-train/tests/cpu/generators/test_skyrl_gym_generator_chat_templating.py
Show resolved
Hide resolved
skyrl-train/tests/cpu/generators/test_skyrl_gym_generator_chat_templating.py
Outdated
Show resolved
Hide resolved
skyrl-train/tests/cpu/generators/test_skyrl_gym_generator_chat_templating.py
Show resolved
Hide resolved
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a flexible mechanism for using custom chat templates, allowing them to be specified by name or loaded from a file. The changes are well-implemented, particularly the new get_custom_chat_template function in utils.py which includes robust error handling. The accompanying tests have been updated effectively to cover the new functionality, including loading templates from files and verifying their behavior. I have a couple of minor suggestions to improve type hinting and simplify a template's logic. Overall, this is a great enhancement to the project's flexibility.
There was a problem hiding this comment.
For better type safety and code clarity, the type hint for chat_template_config should reflect that it can be an omegaconf.DictConfig object, which is what it receives from the configuration. Using a string literal for the type hint avoids needing to import it here if it's not desired.
You can add from omegaconf import DictConfig at the top of the file to resolve the type hint.
| def get_custom_chat_template(chat_template_config: Optional[dict] = None) -> Optional[str]: | |
| def get_custom_chat_template(chat_template_config: Optional[Union[dict, "DictConfig"]] = None) -> Optional[str]: |
There was a problem hiding this comment.
@devpatelio this suggestion is actually correct. We should do from omegaconf import DictConfig at the top, and change it to Union[dict, DictConfig]. This is because generator_cfg.chat_template is a DictConfig and not a dict.
Would be good to quickly check that run_gsm8k.sh works with:
- Not specifying anything (i.e. using the default
chat_templatevalues) - Specifying something (e.g. from name)
There was a problem hiding this comment.
You can leave this unchanged in the PR. the if an else share the same two lines
| expected_user_loss_mask = [0] * len(empty_user) + [0] # extra 0 for single observation token | ||
|
|
||
| if custom_chat_template is not None: | ||
| # For custom_chat_template, the first generation prompt IDs are part of `resp_str`, hence has corresponding mask |
There was a problem hiding this comment.
un-indent these two lines to be shared by the if and else
There was a problem hiding this comment.
@devpatelio this suggestion is actually correct. We should do from omegaconf import DictConfig at the top, and change it to Union[dict, DictConfig]. This is because generator_cfg.chat_template is a DictConfig and not a dict.
Would be good to quickly check that run_gsm8k.sh works with:
- Not specifying anything (i.e. using the default
chat_templatevalues) - Specifying something (e.g. from name)
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a flexible system for using custom chat templates, allowing them to be specified by name from a predefined set or loaded from a file. This is a great enhancement for supporting different model behaviors, like masking 'thinking' tokens. The implementation is solid, with robust handling of configurations and file operations in utils.py. The tests have been effectively updated to cover the new functionality, including loading templates from files.
I've added a couple of minor suggestions to improve the clarity and reduce redundancy in the test files. Overall, this is a well-executed feature addition.
skyrl-train/tests/cpu/generators/test_skyrl_gym_generator_chat_templating.py
Outdated
Show resolved
Hide resolved
…_templating.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…token out (#438) This is a follow-up PR to #178 In that PR, we changed the default codepath of Qwen3 in `SkyRLGymGenerator` to always keep all the thinking tokens for multi-turn generation and subsequent turn generation. Users can return to the previous behavior (strip thinking tokens for both inference and training) by specifying `config.generator.chat_template` introduced in that PR. This PR does a couple of things ### 1. Doc update on Qwen3 in SkyRLGymGenerator First we update the doc to explain the current behavior with Qwen3 in SkyRLGymGenerator. ### 2. Bug fix with fixed base method motivated by Qwen3 Then I noticed a bug when using Qwen3 with codepath 1 (TI/TO with strictly appending token sequence) due to how we have `<|im_start|>assistant\nI am an assistant.<|im_end|>\n` in our fixed-base method's `base_conversation_token_ids`. This, when templated by Qwen3's chat template, will add dummy empty thinking tokens `<think>\n\n</think>\n\n`, making the delta not what we expect. So I exclude assistant in the base conversation for both SkyRLGymGenerator, and our helper `encode_messages_subset ()` for the custom generator. As a result, we will have consecutive user messages when constructing observation token IDs, which should be fine. That is, before this PR, we use `base_conversation_token_ids`: ``` In [7]: tokenizer.decode(base_conversation_token_ids) Out[7]: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nI am a user.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\nI am an assistant.<|im_end|>\n' ``` but adding any message right after it will remove the thinking tokens by default, making the indexing incorrect ### 3. Add more unit tests. In `test_skyrl_gym_generator_chat_templating_exact()`, we used to test `Qwen3` with our custom chat template that strips thinking tokens, I now also test the codepath 1's Qwen3 (accumulate thinking tokens). Did the same thing for `test_append_eos_after_stop_multi_turn()`. Then since the `_exact()` test does not check thinking token handling with Qwen3, I added `test_skyrl_gym_generator_chat_templating_qwen3_thinking` that is essentially the same test but just with Qwen3 model, and with mock llm generation to be `"<think>\nmock thinking\n</think>\n\nb"` instead of just `"b"`, making the test more realistic.
…token out (NovaSky-AI#438) This is a follow-up PR to NovaSky-AI#178 In that PR, we changed the default codepath of Qwen3 in `SkyRLGymGenerator` to always keep all the thinking tokens for multi-turn generation and subsequent turn generation. Users can return to the previous behavior (strip thinking tokens for both inference and training) by specifying `config.generator.chat_template` introduced in that PR. This PR does a couple of things ### 1. Doc update on Qwen3 in SkyRLGymGenerator First we update the doc to explain the current behavior with Qwen3 in SkyRLGymGenerator. ### 2. Bug fix with fixed base method motivated by Qwen3 Then I noticed a bug when using Qwen3 with codepath 1 (TI/TO with strictly appending token sequence) due to how we have `<|im_start|>assistant\nI am an assistant.<|im_end|>\n` in our fixed-base method's `base_conversation_token_ids`. This, when templated by Qwen3's chat template, will add dummy empty thinking tokens `<think>\n\n</think>\n\n`, making the delta not what we expect. So I exclude assistant in the base conversation for both SkyRLGymGenerator, and our helper `encode_messages_subset ()` for the custom generator. As a result, we will have consecutive user messages when constructing observation token IDs, which should be fine. That is, before this PR, we use `base_conversation_token_ids`: ``` In [7]: tokenizer.decode(base_conversation_token_ids) Out[7]: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nI am a user.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\nI am an assistant.<|im_end|>\n' ``` but adding any message right after it will remove the thinking tokens by default, making the indexing incorrect ### 3. Add more unit tests. In `test_skyrl_gym_generator_chat_templating_exact()`, we used to test `Qwen3` with our custom chat template that strips thinking tokens, I now also test the codepath 1's Qwen3 (accumulate thinking tokens). Did the same thing for `test_append_eos_after_stop_multi_turn()`. Then since the `_exact()` test does not check thinking token handling with Qwen3, I added `test_skyrl_gym_generator_chat_templating_qwen3_thinking` that is essentially the same test but just with Qwen3 model, and with mock llm generation to be `"<think>\nmock thinking\n</think>\n\nb"` instead of just `"b"`, making the test more realistic.
…ult behavior on thinking tokens accumulation (NovaSky-AI#178) In `ppo_base_config`, add `generator.chat_template` to allow users to define custom chat template. This is particularly helpful for Qwen3 models, where you might want to keep the thinking tokens or remove them. ```.yaml chat_template: source: "name" # "name" or "file" name_or_path: null # e.g., "qwen3_with_thinking" or "/path/to/template.j2" ``` You can either specify a chat template defined in `skyrl_train.generators.utils.CUSTOM_CHAT_TEMPLATES` (we currently have `qwen3_with_thinking` and `qwen3_without_thinking`) or load from a file. Before this PR, the behavior for Qwen3 models are that, they will not keep the non-last turn thinking tokens and we always retokenize (every turn and at the end of the trajectory). After this PR, Qwen3 models by default will accumulate in a TI/TO manner, just like any other models. Users can modify this behavior by setting `name_or_path` to `qwen3_without_thinking` to keep the old behavior. Related to NovaSky-AI#179 and address NovaSky-AI#104 --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <tgriggs@berkeley.edu> Co-authored-by: Charlie Ruan <charlieruan@berkeley.edu>
…token out (NovaSky-AI#438) This is a follow-up PR to NovaSky-AI#178 In that PR, we changed the default codepath of Qwen3 in `SkyRLGymGenerator` to always keep all the thinking tokens for multi-turn generation and subsequent turn generation. Users can return to the previous behavior (strip thinking tokens for both inference and training) by specifying `config.generator.chat_template` introduced in that PR. This PR does a couple of things ### 1. Doc update on Qwen3 in SkyRLGymGenerator First we update the doc to explain the current behavior with Qwen3 in SkyRLGymGenerator. ### 2. Bug fix with fixed base method motivated by Qwen3 Then I noticed a bug when using Qwen3 with codepath 1 (TI/TO with strictly appending token sequence) due to how we have `<|im_start|>assistant\nI am an assistant.<|im_end|>\n` in our fixed-base method's `base_conversation_token_ids`. This, when templated by Qwen3's chat template, will add dummy empty thinking tokens `<think>\n\n</think>\n\n`, making the delta not what we expect. So I exclude assistant in the base conversation for both SkyRLGymGenerator, and our helper `encode_messages_subset ()` for the custom generator. As a result, we will have consecutive user messages when constructing observation token IDs, which should be fine. That is, before this PR, we use `base_conversation_token_ids`: ``` In [7]: tokenizer.decode(base_conversation_token_ids) Out[7]: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nI am a user.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\nI am an assistant.<|im_end|>\n' ``` but adding any message right after it will remove the thinking tokens by default, making the indexing incorrect ### 3. Add more unit tests. In `test_skyrl_gym_generator_chat_templating_exact()`, we used to test `Qwen3` with our custom chat template that strips thinking tokens, I now also test the codepath 1's Qwen3 (accumulate thinking tokens). Did the same thing for `test_append_eos_after_stop_multi_turn()`. Then since the `_exact()` test does not check thinking token handling with Qwen3, I added `test_skyrl_gym_generator_chat_templating_qwen3_thinking` that is essentially the same test but just with Qwen3 model, and with mock llm generation to be `"<think>\nmock thinking\n</think>\n\nb"` instead of just `"b"`, making the test more realistic.
…ult behavior on thinking tokens accumulation (#178) In `ppo_base_config`, add `generator.chat_template` to allow users to define custom chat template. This is particularly helpful for Qwen3 models, where you might want to keep the thinking tokens or remove them. ```.yaml chat_template: source: "name" # "name" or "file" name_or_path: null # e.g., "qwen3_with_thinking" or "/path/to/template.j2" ``` You can either specify a chat template defined in `skyrl_train.generators.utils.CUSTOM_CHAT_TEMPLATES` (we currently have `qwen3_with_thinking` and `qwen3_without_thinking`) or load from a file. Before this PR, the behavior for Qwen3 models are that, they will not keep the non-last turn thinking tokens and we always retokenize (every turn and at the end of the trajectory). After this PR, Qwen3 models by default will accumulate in a TI/TO manner, just like any other models. Users can modify this behavior by setting `name_or_path` to `qwen3_without_thinking` to keep the old behavior. Related to #179 and address #104 --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <tgriggs@berkeley.edu> Co-authored-by: Charlie Ruan <charlieruan@berkeley.edu>
…token out (NovaSky-AI#438) This is a follow-up PR to NovaSky-AI#178 In that PR, we changed the default codepath of Qwen3 in `SkyRLGymGenerator` to always keep all the thinking tokens for multi-turn generation and subsequent turn generation. Users can return to the previous behavior (strip thinking tokens for both inference and training) by specifying `config.generator.chat_template` introduced in that PR. This PR does a couple of things ### 1. Doc update on Qwen3 in SkyRLGymGenerator First we update the doc to explain the current behavior with Qwen3 in SkyRLGymGenerator. ### 2. Bug fix with fixed base method motivated by Qwen3 Then I noticed a bug when using Qwen3 with codepath 1 (TI/TO with strictly appending token sequence) due to how we have `<|im_start|>assistant\nI am an assistant.<|im_end|>\n` in our fixed-base method's `base_conversation_token_ids`. This, when templated by Qwen3's chat template, will add dummy empty thinking tokens `<think>\n\n</think>\n\n`, making the delta not what we expect. So I exclude assistant in the base conversation for both SkyRLGymGenerator, and our helper `encode_messages_subset ()` for the custom generator. As a result, we will have consecutive user messages when constructing observation token IDs, which should be fine. That is, before this PR, we use `base_conversation_token_ids`: ``` In [7]: tokenizer.decode(base_conversation_token_ids) Out[7]: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nI am a user.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\nI am an assistant.<|im_end|>\n' ``` but adding any message right after it will remove the thinking tokens by default, making the indexing incorrect ### 3. Add more unit tests. In `test_skyrl_gym_generator_chat_templating_exact()`, we used to test `Qwen3` with our custom chat template that strips thinking tokens, I now also test the codepath 1's Qwen3 (accumulate thinking tokens). Did the same thing for `test_append_eos_after_stop_multi_turn()`. Then since the `_exact()` test does not check thinking token handling with Qwen3, I added `test_skyrl_gym_generator_chat_templating_qwen3_thinking` that is essentially the same test but just with Qwen3 model, and with mock llm generation to be `"<think>\nmock thinking\n</think>\n\nb"` instead of just `"b"`, making the test more realistic.
Prior to this PR, for
SkyRLGymGeneratorwe do:That is, we always use a chat template that strips the thinking tokens for Qwen3 models.
This is a really implicit behavior. Users might want to keep the thinking tokens, or strip them. One prior PR that tries to address this: #117
In this PR, in
ppo_base_config, addgenerator.chat_templateto allow users to define custom chat template. This is particularly helpful for Qwen3 models, where you might want to keep the thinking tokens or remove them.You can either specify a chat template defined in
skyrl_train.generators.utils.CUSTOM_CHAT_TEMPLATES(we currently haveqwen3_with_thinkingandqwen3_without_thinking; the latter is the old one we use) or load from a file.Before this PR, the behavior for Qwen3 models are that, they will not keep the non-last turn thinking tokens and we always retokenize (every turn and at the end of the trajectory).
After this PR, Qwen3 models by default will accumulate in a TI/TO manner, just like any other models. Users can modify this behavior by setting
name_or_pathtoqwen3_without_thinkingto keep the old behavior.Related to #179 and address #104