Skip to content

Auto-set pad_token_id when the default is None and not set in the buffer config.#188

Merged
yanxi-chen merged 6 commits intoagentscope-ai:mainfrom
yaochaorui:cr/pad_token
Aug 20, 2025
Merged

Auto-set pad_token_id when the default is None and not set in the buffer config.#188
yanxi-chen merged 6 commits intoagentscope-ai:mainfrom
yaochaorui:cr/pad_token

Conversation

@yaochaorui
Copy link
Contributor

Description

Background:
When using models like Llama where the default pad_token_id is None, the sampler_strategy could be incorrect if pad_token_id wasn't explicitly set in the config. Both Verl and vLLM automatically handle this during tokenizer loading by setting appropriate pad_token_id values.

Purpose:
Update the config after tokenizer loading to ensure the pad_token_id is properly set, making the sampler_strategy consistent and correct.

Changes made:

  • Added config update logic after tokenizer loading to set pad_token_id when it's None in the original config
  • Ensures sampler_strategy uses the correct pad_token_id that Verl/vLLM have automatically determined during tokenizer initialization
  • Maintains alignment between our config and the actual tokenizer configuration

How to test:

  1. Load a Llama tokenizer without specifying pad_token_id in the initial config
  2. Verify that after tokenizer loading, the config is updated with the correct pad_token_id
  3. Confirm that sampler_strategy works correctly with the updated config
  4. Test with tokenizers that already have explicit pad_token_id to ensure no regression

Checklist

Please check the following items before code is ready to be reviewed.

  • [x ] Code has passed all tests
  • [x ] Docstrings have been added/updated in Google Style
  • [x ] Code is ready for review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yaochaorui, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue where the sampler_strategy could be incorrect when using models like Llama, which default pad_token_id to None. It introduces logic to automatically set the pad_token_id in the configuration after tokenizer loading if it's not explicitly defined, ensuring consistency with the pad_token_id determined by the tokenizer itself (e.g., by Verl/vLLM). This change aims to maintain alignment between the internal configuration and the actual tokenizer setup, preventing potential issues with sampling.

Highlights

  • Configuration Update: The system now automatically updates the pad_token_id in the global configuration if it's None when the tokenizer is loaded.
  • Sampler Strategy Consistency: Ensures that the sampler_strategy correctly utilizes the pad_token_id that the tokenizer (e.g., from Verl/vLLM) has automatically determined.
  • Config Alignment: Improves the alignment between the internal configuration and the actual tokenizer's settings.
  • Logging: A warning message is now logged when the pad_token_id is automatically set.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses an issue where pad_token_id might not be set for certain models, causing inconsistencies. The proposed solution is to automatically set it from the tokenizer after it's loaded, which is a sensible approach.

My feedback focuses on making the implementation more robust and readable by adding a check to ensure the tokenizer's pad_token_id is not None before assignment and by improving the clarity of the corresponding warning message.

Copy link
Collaborator

@yanxi-chen yanxi-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@yanxi-chen yanxi-chen merged commit fac0e88 into agentscope-ai:main Aug 20, 2025
1 check passed
@yaochaorui yaochaorui deleted the cr/pad_token branch August 26, 2025 21:48
yaochaorui added a commit to yaochaorui/Trinity-RFT that referenced this pull request Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants