Auto-set pad_token_id when the default is None and not set in the buffer config.#188
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @yaochaorui, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses an issue where the sampler_strategy could be incorrect when using models like Llama, which default pad_token_id to None. It introduces logic to automatically set the pad_token_id in the configuration after tokenizer loading if it's not explicitly defined, ensuring consistency with the pad_token_id determined by the tokenizer itself (e.g., by Verl/vLLM). This change aims to maintain alignment between the internal configuration and the actual tokenizer setup, preventing potential issues with sampling.
Highlights
- Configuration Update: The system now automatically updates the pad_token_id in the global configuration if it's None when the tokenizer is loaded.
- Sampler Strategy Consistency: Ensures that the sampler_strategy correctly utilizes the pad_token_id that the tokenizer (e.g., from Verl/vLLM) has automatically determined.
- Config Alignment: Improves the alignment between the internal configuration and the actual tokenizer's settings.
- Logging: A warning message is now logged when the pad_token_id is automatically set.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request correctly addresses an issue where pad_token_id might not be set for certain models, causing inconsistencies. The proposed solution is to automatically set it from the tokenizer after it's loaded, which is a sensible approach.
My feedback focuses on making the implementation more robust and readable by adding a check to ensure the tokenizer's pad_token_id is not None before assignment and by improving the clarity of the corresponding warning message.
1f13ee3 to
486dbbe
Compare
Description
Background:
When using models like Llama where the default
pad_token_idisNone, the sampler_strategy could be incorrect ifpad_token_idwasn't explicitly set in the config. Both Verl and vLLM automatically handle this during tokenizer loading by setting appropriatepad_token_idvalues.Purpose:
Update the config after tokenizer loading to ensure the
pad_token_idis properly set, making the sampler_strategy consistent and correct.Changes made:
pad_token_idwhen it'sNonein the original configpad_token_idthat Verl/vLLM have automatically determined during tokenizer initializationHow to test:
pad_token_idin the initial configpad_token_idpad_token_idto ensure no regressionChecklist
Please check the following items before code is ready to be reviewed.