Skip to content

[SkyRLGymGenerator] Add genreator.chat_template and change Qwen3 default behavior to accumulate thinking tokens#178

Merged
CharlieFRuan merged 65 commits intoNovaSky-AI:mainfrom
devpatelio:devpatel/skyrl-issue-104
Oct 5, 2025
Merged

[SkyRLGymGenerator] Add genreator.chat_template and change Qwen3 default behavior to accumulate thinking tokens#178
CharlieFRuan merged 65 commits intoNovaSky-AI:mainfrom
devpatelio:devpatel/skyrl-issue-104

Conversation

@devpatelio
Copy link
Collaborator

@devpatelio devpatelio commented Aug 21, 2025

Prior to this PR, for SkyRLGymGenerator we do:

def get_custom_chat_template(model_name: str) -> str:
    if "Qwen3" in model_name:
        return CUSTOM_CHAT_TEMPLATES["qwen3_thinking"]
    else:
        return None

That is, we always use a chat template that strips the thinking tokens for Qwen3 models.

This is a really implicit behavior. Users might want to keep the thinking tokens, or strip them. One prior PR that tries to address this: #117

In this PR, in ppo_base_config, add generator.chat_template to allow users to define custom chat template. This is particularly helpful for Qwen3 models, where you might want to keep the thinking tokens or remove them.

  chat_template:
    source: "name"  # "name" or "file"
    name_or_path: null  # e.g., "qwen3_with_thinking" or "/path/to/template.j2"

You can either specify a chat template defined in skyrl_train.generators.utils.CUSTOM_CHAT_TEMPLATES (we currently have qwen3_with_thinking and qwen3_without_thinking; the latter is the old one we use) or load from a file.

Before this PR, the behavior for Qwen3 models are that, they will not keep the non-last turn thinking tokens and we always retokenize (every turn and at the end of the trajectory).

After this PR, Qwen3 models by default will accumulate in a TI/TO manner, just like any other models. Users can modify this behavior by setting name_or_path to qwen3_without_thinking to keep the old behavior.

Related to #179 and address #104

@CharlieFRuan CharlieFRuan self-requested a review August 21, 2025 18:12
@devpatelio devpatelio marked this pull request as ready for review August 21, 2025 19:02
@devpatelio devpatelio changed the title Added config for custom chat template (will make easier in subsequent… Custom Chat Template + Thinking Token Masking Aug 21, 2025
Copy link
Member

@tyler-griggs tyler-griggs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sending this in! Left a couple comments, primarily around moving chat templates out of the config. There was also a pretty big PR that was just merged that touches some similar code in skyrl_gym_generator.py so it'd be good to merge changes!

@devpatelio devpatelio force-pushed the devpatel/skyrl-issue-104 branch from 15b5eb9 to 7a22332 Compare August 23, 2025 02:49
@devpatelio
Copy link
Collaborator Author

Here's the wandb for the updated run with the file and name support: https://wandb.ai/devpatelio/updated_template_system

@tyler-griggs
Copy link
Member

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a flexible system for custom chat templates and masking for "thinking" tokens, which is a valuable enhancement. The changes are supported by new test scripts and a mini-dataset for validation. However, the current implementation has several critical issues, including NameError exceptions from undefined variables that will lead to runtime crashes, unreachable code due to flawed logic, and a bug in batched generation that results in incorrect loss masks. I have also identified several medium-severity issues such as hardcoded paths in scripts, duplicated template files, and leftover debug print statements. My review provides specific suggestions to address these problems to ensure the new functionality is robust and maintainable.

@tyler-griggs
Copy link
Member

@devpatelio quick check: is this PR ready for review? I see several debugging print statements, and a couple of the files in examples/gsm8k look like they are dev-only tests. Should these all be removed?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a flexible system for custom chat templates and token masking, which is a significant improvement over the previous hard-coded approach. The changes are well-structured, providing new configuration options in ppo_base_config.yaml and updating the generator logic accordingly while maintaining backward compatibility. The addition of tests for the new functionality is also a positive step. My review includes a couple of suggestions to enhance maintainability by addressing file duplication and to improve robustness by refining exception handling.

@devpatelio
Copy link
Collaborator Author

devpatelio commented Aug 24, 2025

@tyler-griggs Sorry about that, I forgot to remove some of the debug print statements + fix some of the rebasing variable names. The PR is ready to review now!

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a flexible system for using custom chat templates, including support for loading them from files and masking "thinking" tokens during training. The changes are well-structured, with new configurations in ppo_base_config.yaml and updated logic in the generator and utility functions. My review includes a couple of suggestions to improve code readability and maintainability, such as refactoring Jinja template definitions to be more readable and using a list comprehension for conciseness.

@devpatelio
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a flexible system for using custom chat templates, including loading them from files and masking "thinking" tokens for training. The changes are well-implemented, with robust error handling for template loading and comprehensive tests that cover the new functionality, including file-based templates and loss masking. The addition of qwen3_with_thinking and qwen3_without_thinking templates, along with the configuration in ppo_base_config.yaml, effectively addresses the goal of supporting custom templates and token masking.

@devpatelio
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a flexible mechanism for using custom chat templates, allowing them to be specified by name or loaded from a file. The changes are well-implemented, particularly the new get_custom_chat_template function in utils.py which includes robust error handling. The accompanying tests have been updated effectively to cover the new functionality, including loading templates from files and verifying their behavior. I have a couple of minor suggestions to improve type hinting and simplify a template's logic. Overall, this is a great enhancement to the project's flexibility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better type safety and code clarity, the type hint for chat_template_config should reflect that it can be an omegaconf.DictConfig object, which is what it receives from the configuration. Using a string literal for the type hint avoids needing to import it here if it's not desired.

You can add from omegaconf import DictConfig at the top of the file to resolve the type hint.

Suggested change
def get_custom_chat_template(chat_template_config: Optional[dict] = None) -> Optional[str]:
def get_custom_chat_template(chat_template_config: Optional[Union[dict, "DictConfig"]] = None) -> Optional[str]:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devpatelio this suggestion is actually correct. We should do from omegaconf import DictConfig at the top, and change it to Union[dict, DictConfig]. This is because generator_cfg.chat_template is a DictConfig and not a dict.

Would be good to quickly check that run_gsm8k.sh works with:

  • Not specifying anything (i.e. using the default chat_template values)
  • Specifying something (e.g. from name)

Copy link
Collaborator

@CharlieFRuan CharlieFRuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last nits...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can leave this unchanged in the PR. the if an else share the same two lines

expected_user_loss_mask = [0] * len(empty_user) + [0] # extra 0 for single observation token

if custom_chat_template is not None:
# For custom_chat_template, the first generation prompt IDs are part of `resp_str`, hence has corresponding mask
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this comment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

un-indent these two lines to be shared by the if and else

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devpatelio this suggestion is actually correct. We should do from omegaconf import DictConfig at the top, and change it to Union[dict, DictConfig]. This is because generator_cfg.chat_template is a DictConfig and not a dict.

Would be good to quickly check that run_gsm8k.sh works with:

  • Not specifying anything (i.e. using the default chat_template values)
  • Specifying something (e.g. from name)

@devpatelio
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a flexible system for using custom chat templates, allowing them to be specified by name from a predefined set or loaded from a file. This is a great enhancement for supporting different model behaviors, like masking 'thinking' tokens. The implementation is solid, with robust handling of configurations and file operations in utils.py. The tests have been effectively updated to cover the new functionality, including loading templates from files.

I've added a couple of minor suggestions to improve the clarity and reduce redundancy in the test files. Overall, this is a well-executed feature addition.

…_templating.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@CharlieFRuan CharlieFRuan changed the title Custom Chat Template + Thinking Token Masking [SkyRLGymGenerator] Add genreator.chat_template and change Qwen3 default behavior on thinking tokens accumulation Oct 5, 2025
@CharlieFRuan CharlieFRuan merged commit 345a996 into NovaSky-AI:main Oct 5, 2025
3 checks passed
@CharlieFRuan CharlieFRuan changed the title [SkyRLGymGenerator] Add genreator.chat_template and change Qwen3 default behavior on thinking tokens accumulation [SkyRLGymGenerator] Add genreator.chat_template and change Qwen3 default behavior to accumulate thinking tokens Oct 5, 2025
CharlieFRuan added a commit that referenced this pull request Oct 10, 2025
…token out (#438)

This is a follow-up PR to #178

In that PR, we changed the default codepath of Qwen3 in
`SkyRLGymGenerator` to always keep all the thinking tokens for
multi-turn generation and subsequent turn generation. Users can return
to the previous behavior (strip thinking tokens for both inference and
training) by specifying `config.generator.chat_template` introduced in
that PR.

This PR does a couple of things

### 1. Doc update on Qwen3 in SkyRLGymGenerator

First we update the doc to explain the current behavior with Qwen3 in
SkyRLGymGenerator.

### 2. Bug fix with fixed base method motivated by Qwen3

Then I noticed a bug when using Qwen3 with codepath 1 (TI/TO with
strictly appending token sequence) due to how we have
`<|im_start|>assistant\nI am an assistant.<|im_end|>\n` in our
fixed-base method's `base_conversation_token_ids`.

This, when templated by Qwen3's chat template, will add dummy empty
thinking tokens `<think>\n\n</think>\n\n`, making the delta not what we
expect. So I exclude assistant in the base conversation for both
SkyRLGymGenerator, and our helper `encode_messages_subset ()` for the
custom generator. As a result, we will have consecutive user messages
when constructing observation token IDs, which should be fine.

That is, before this PR, we use `base_conversation_token_ids`:

```
In [7]: tokenizer.decode(base_conversation_token_ids)
Out[7]: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nI am a user.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\nI am an assistant.<|im_end|>\n'
```

but adding any message right after it will remove the thinking tokens by
default, making the indexing incorrect


### 3. Add more unit tests.

In `test_skyrl_gym_generator_chat_templating_exact()`, we used to test
`Qwen3` with our custom chat template that strips thinking tokens, I now
also test the codepath 1's Qwen3 (accumulate thinking tokens). Did the
same thing for `test_append_eos_after_stop_multi_turn()`.

Then since the `_exact()` test does not check thinking token handling
with Qwen3, I added
`test_skyrl_gym_generator_chat_templating_qwen3_thinking` that is
essentially the same test but just with Qwen3 model, and with mock llm
generation to be `"<think>\nmock thinking\n</think>\n\nb"` instead of
just `"b"`, making the test more realistic.
Lucas-Fernandes-Martins pushed a commit to Lucas-Fernandes-Martins/SkyRL that referenced this pull request Oct 11, 2025
…token out (NovaSky-AI#438)

This is a follow-up PR to NovaSky-AI#178

In that PR, we changed the default codepath of Qwen3 in
`SkyRLGymGenerator` to always keep all the thinking tokens for
multi-turn generation and subsequent turn generation. Users can return
to the previous behavior (strip thinking tokens for both inference and
training) by specifying `config.generator.chat_template` introduced in
that PR.

This PR does a couple of things

### 1. Doc update on Qwen3 in SkyRLGymGenerator

First we update the doc to explain the current behavior with Qwen3 in
SkyRLGymGenerator.

### 2. Bug fix with fixed base method motivated by Qwen3

Then I noticed a bug when using Qwen3 with codepath 1 (TI/TO with
strictly appending token sequence) due to how we have
`<|im_start|>assistant\nI am an assistant.<|im_end|>\n` in our
fixed-base method's `base_conversation_token_ids`.

This, when templated by Qwen3's chat template, will add dummy empty
thinking tokens `<think>\n\n</think>\n\n`, making the delta not what we
expect. So I exclude assistant in the base conversation for both
SkyRLGymGenerator, and our helper `encode_messages_subset ()` for the
custom generator. As a result, we will have consecutive user messages
when constructing observation token IDs, which should be fine.

That is, before this PR, we use `base_conversation_token_ids`:

```
In [7]: tokenizer.decode(base_conversation_token_ids)
Out[7]: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nI am a user.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\nI am an assistant.<|im_end|>\n'
```

but adding any message right after it will remove the thinking tokens by
default, making the indexing incorrect


### 3. Add more unit tests.

In `test_skyrl_gym_generator_chat_templating_exact()`, we used to test
`Qwen3` with our custom chat template that strips thinking tokens, I now
also test the codepath 1's Qwen3 (accumulate thinking tokens). Did the
same thing for `test_append_eos_after_stop_multi_turn()`.

Then since the `_exact()` test does not check thinking token handling
with Qwen3, I added
`test_skyrl_gym_generator_chat_templating_qwen3_thinking` that is
essentially the same test but just with Qwen3 model, and with mock llm
generation to be `"<think>\nmock thinking\n</think>\n\nb"` instead of
just `"b"`, making the test more realistic.
li-boxuan pushed a commit to li-boxuan/SkyRL that referenced this pull request Nov 23, 2025
…ult behavior on thinking tokens accumulation (NovaSky-AI#178)

In `ppo_base_config`, add `generator.chat_template` to allow users to
define custom chat template. This is particularly helpful for Qwen3
models, where you might want to keep the thinking tokens or remove them.

```.yaml
  chat_template:
    source: "name"  # "name" or "file"
    name_or_path: null  # e.g., "qwen3_with_thinking" or "/path/to/template.j2"
```

You can either specify a chat template defined in
`skyrl_train.generators.utils.CUSTOM_CHAT_TEMPLATES` (we currently have
`qwen3_with_thinking` and `qwen3_without_thinking`) or load from a file.

Before this PR, the behavior for Qwen3 models are that, they will not
keep the non-last turn thinking tokens and we always retokenize (every
turn and at the end of the trajectory).

After this PR, Qwen3 models by default will accumulate in a TI/TO
manner, just like any other models. Users can modify this behavior by
setting `name_or_path` to `qwen3_without_thinking` to keep the old
behavior.

Related to NovaSky-AI#179 and address NovaSky-AI#104

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Tyler Griggs <tgriggs@berkeley.edu>
Co-authored-by: Charlie Ruan <charlieruan@berkeley.edu>
li-boxuan pushed a commit to li-boxuan/SkyRL that referenced this pull request Nov 23, 2025
…token out (NovaSky-AI#438)

This is a follow-up PR to NovaSky-AI#178

In that PR, we changed the default codepath of Qwen3 in
`SkyRLGymGenerator` to always keep all the thinking tokens for
multi-turn generation and subsequent turn generation. Users can return
to the previous behavior (strip thinking tokens for both inference and
training) by specifying `config.generator.chat_template` introduced in
that PR.

This PR does a couple of things

### 1. Doc update on Qwen3 in SkyRLGymGenerator

First we update the doc to explain the current behavior with Qwen3 in
SkyRLGymGenerator.

### 2. Bug fix with fixed base method motivated by Qwen3

Then I noticed a bug when using Qwen3 with codepath 1 (TI/TO with
strictly appending token sequence) due to how we have
`<|im_start|>assistant\nI am an assistant.<|im_end|>\n` in our
fixed-base method's `base_conversation_token_ids`.

This, when templated by Qwen3's chat template, will add dummy empty
thinking tokens `<think>\n\n</think>\n\n`, making the delta not what we
expect. So I exclude assistant in the base conversation for both
SkyRLGymGenerator, and our helper `encode_messages_subset ()` for the
custom generator. As a result, we will have consecutive user messages
when constructing observation token IDs, which should be fine.

That is, before this PR, we use `base_conversation_token_ids`:

```
In [7]: tokenizer.decode(base_conversation_token_ids)
Out[7]: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nI am a user.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\nI am an assistant.<|im_end|>\n'
```

but adding any message right after it will remove the thinking tokens by
default, making the indexing incorrect


### 3. Add more unit tests.

In `test_skyrl_gym_generator_chat_templating_exact()`, we used to test
`Qwen3` with our custom chat template that strips thinking tokens, I now
also test the codepath 1's Qwen3 (accumulate thinking tokens). Did the
same thing for `test_append_eos_after_stop_multi_turn()`.

Then since the `_exact()` test does not check thinking token handling
with Qwen3, I added
`test_skyrl_gym_generator_chat_templating_qwen3_thinking` that is
essentially the same test but just with Qwen3 model, and with mock llm
generation to be `"<think>\nmock thinking\n</think>\n\nb"` instead of
just `"b"`, making the test more realistic.
dzorlu referenced this pull request in fleet-ai/SkyRL Feb 4, 2026
…ult behavior on thinking tokens accumulation (#178)

In `ppo_base_config`, add `generator.chat_template` to allow users to
define custom chat template. This is particularly helpful for Qwen3
models, where you might want to keep the thinking tokens or remove them.

```.yaml
  chat_template:
    source: "name"  # "name" or "file"
    name_or_path: null  # e.g., "qwen3_with_thinking" or "/path/to/template.j2"
```

You can either specify a chat template defined in
`skyrl_train.generators.utils.CUSTOM_CHAT_TEMPLATES` (we currently have
`qwen3_with_thinking` and `qwen3_without_thinking`) or load from a file.

Before this PR, the behavior for Qwen3 models are that, they will not
keep the non-last turn thinking tokens and we always retokenize (every
turn and at the end of the trajectory).

After this PR, Qwen3 models by default will accumulate in a TI/TO
manner, just like any other models. Users can modify this behavior by
setting `name_or_path` to `qwen3_without_thinking` to keep the old
behavior.

Related to #179 and address #104

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Tyler Griggs <tgriggs@berkeley.edu>
Co-authored-by: Charlie Ruan <charlieruan@berkeley.edu>
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
…token out (NovaSky-AI#438)

This is a follow-up PR to NovaSky-AI#178

In that PR, we changed the default codepath of Qwen3 in
`SkyRLGymGenerator` to always keep all the thinking tokens for
multi-turn generation and subsequent turn generation. Users can return
to the previous behavior (strip thinking tokens for both inference and
training) by specifying `config.generator.chat_template` introduced in
that PR.

This PR does a couple of things

### 1. Doc update on Qwen3 in SkyRLGymGenerator

First we update the doc to explain the current behavior with Qwen3 in
SkyRLGymGenerator.

### 2. Bug fix with fixed base method motivated by Qwen3

Then I noticed a bug when using Qwen3 with codepath 1 (TI/TO with
strictly appending token sequence) due to how we have
`<|im_start|>assistant\nI am an assistant.<|im_end|>\n` in our
fixed-base method's `base_conversation_token_ids`.

This, when templated by Qwen3's chat template, will add dummy empty
thinking tokens `<think>\n\n</think>\n\n`, making the delta not what we
expect. So I exclude assistant in the base conversation for both
SkyRLGymGenerator, and our helper `encode_messages_subset ()` for the
custom generator. As a result, we will have consecutive user messages
when constructing observation token IDs, which should be fine.

That is, before this PR, we use `base_conversation_token_ids`:

```
In [7]: tokenizer.decode(base_conversation_token_ids)
Out[7]: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nI am a user.<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\nI am an assistant.<|im_end|>\n'
```

but adding any message right after it will remove the thinking tokens by
default, making the indexing incorrect


### 3. Add more unit tests.

In `test_skyrl_gym_generator_chat_templating_exact()`, we used to test
`Qwen3` with our custom chat template that strips thinking tokens, I now
also test the codepath 1's Qwen3 (accumulate thinking tokens). Did the
same thing for `test_append_eos_after_stop_multi_turn()`.

Then since the `_exact()` test does not check thinking token handling
with Qwen3, I added
`test_skyrl_gym_generator_chat_templating_qwen3_thinking` that is
essentially the same test but just with Qwen3 model, and with mock llm
generation to be `"<think>\nmock thinking\n</think>\n\nb"` instead of
just `"b"`, making the test more realistic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments