Auto-set pad_token_id when the default is None and not set in the buffer config. by yaochaorui · Pull Request #188 · agentscope-ai/Trinity-RFT

yaochaorui · 2025-08-14T22:05:34Z

Description

Background:
When using models like Llama where the default pad_token_id is None, the sampler_strategy could be incorrect if pad_token_id wasn't explicitly set in the config. Both Verl and vLLM automatically handle this during tokenizer loading by setting appropriate pad_token_id values.

Purpose:
Update the config after tokenizer loading to ensure the pad_token_id is properly set, making the sampler_strategy consistent and correct.

Changes made:

Added config update logic after tokenizer loading to set pad_token_id when it's None in the original config
Ensures sampler_strategy uses the correct pad_token_id that Verl/vLLM have automatically determined during tokenizer initialization
Maintains alignment between our config and the actual tokenizer configuration

How to test:

Load a Llama tokenizer without specifying pad_token_id in the initial config
Verify that after tokenizer loading, the config is updated with the correct pad_token_id
Confirm that sampler_strategy works correctly with the updated config
Test with tokenizers that already have explicit pad_token_id to ensure no regression

Checklist

Please check the following items before code is ready to be reviewed.

[x ] Code has passed all tests
[x ] Docstrings have been added/updated in Google Style
[x ] Code is ready for review

gemini-code-assist

Summary of Changes

Hello @yaochaorui, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue where the sampler_strategy could be incorrect when using models like Llama, which default pad_token_id to None. It introduces logic to automatically set the pad_token_id in the configuration after tokenizer loading if it's not explicitly defined, ensuring consistency with the pad_token_id determined by the tokenizer itself (e.g., by Verl/vLLM). This change aims to maintain alignment between the internal configuration and the actual tokenizer setup, preventing potential issues with sampling.

Highlights

Configuration Update: The system now automatically updates the pad_token_id in the global configuration if it's None when the tokenizer is loaded.
Sampler Strategy Consistency: Ensures that the sampler_strategy correctly utilizes the pad_token_id that the tokenizer (e.g., from Verl/vLLM) has automatically determined.
Config Alignment: Improves the alignment between the internal configuration and the actual tokenizer's settings.
Logging: A warning message is now logged when the pad_token_id is automatically set.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly addresses an issue where pad_token_id might not be set for certain models, causing inconsistencies. The proposed solution is to automatically set it from the tokenizer after it's loaded, which is a sensible approach.

My feedback focuses on making the implementation more robust and readable by adding a check to ensure the tokenizer's pad_token_id is not None before assignment and by improving the clarity of the corresponding warning message.

trinity/trainer/verl_trainer.py

trinity/common/config.py

yanxi-chen

lgtm

…fer config. (agentscope-ai#188)

gemini-code-assist bot reviewed Aug 14, 2025

View reviewed changes

trinity/trainer/verl_trainer.py Outdated Show resolved Hide resolved

yaochaorui added 3 commits August 18, 2025 12:02

[Feat] Auto-set missing pad_token_id. Align with Verl and vLLM behavior.

a4ba577

[Format]

e433885

[refactor] Move the pad_token_id auto-setting to config.py.

486dbbe

yaochaorui force-pushed the cr/pad_token branch from 1f13ee3 to 486dbbe Compare August 18, 2025 19:06

yaochaorui added 2 commits August 18, 2025 13:37

[fix]

46cc05d

[minor]

6f0e63a

yanxi-chen reviewed Aug 20, 2025

View reviewed changes

trinity/common/config.py Outdated Show resolved Hide resolved

[minor]

192c778

yanxi-chen approved these changes Aug 20, 2025

View reviewed changes

yanxi-chen merged commit fac0e88 into agentscope-ai:main Aug 20, 2025
1 check passed

yaochaorui deleted the cr/pad_token branch August 26, 2025 21:48

yaochaorui added a commit to yaochaorui/Trinity-RFT that referenced this pull request Aug 27, 2025

Auto-set pad_token_id when the default is None and not set in the buf…

6b1f5e0

…fer config. (agentscope-ai#188)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-set pad_token_id when the default is None and not set in the buffer config.#188

Auto-set pad_token_id when the default is None and not set in the buffer config.#188
yanxi-chen merged 6 commits intoagentscope-ai:mainfrom
yaochaorui:cr/pad_token

yaochaorui commented Aug 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

yanxi-chen left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yaochaorui commented Aug 14, 2025

Description

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

yanxi-chen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants