-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
[Bugfix] Fix Incremental Detokenization with tokenizers == 0.22.0
#24159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Fanli Lin <fanli.lin@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a compatibility issue with tokenizers==0.22.0 which changed an error message format, causing failures in incremental detokenization. The changes correctly adapt the error handling to work with both old and new versions of the library by using a substring check instead of an exact match on the error message, and by using a keyword argument for DecodeStream which is now required. My review includes a suggestion to make the error message check even more robust by using startswith instead of in, to reduce the chance of incorrectly handling unrelated exceptions.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Fanli Lin <fanli0116@gmail.com>
| " for request %s, resetting decode stream.", self.request_id) | ||
| self.stream = DecodeStream(self.skip_special_tokens) | ||
| self.stream = DecodeStream( | ||
| skip_special_tokens=self.skip_special_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change necessary? DecodeStream introduce more args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, as can be seen from: https://github.com/huggingface/tokenizers/pull/1856/files#diff-780be0b9b76e7260ed2be2249d17c4a879b7e2e98e9f30f26dc9c65501f775d1R672. Otherwise, we would get an error that ids should not be Boolean type.
|
cc @njhill Please take a review, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @faaany!
tokenizers == 0.22.0
…llm-project#24159) Signed-off-by: Fanli Lin <fanli.lin@intel.com> Signed-off-by: Fanli Lin <fanli0116@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…llm-project#24159) Signed-off-by: Fanli Lin <fanli.lin@intel.com> Signed-off-by: Fanli Lin <fanli0116@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…llm-project#24159) Signed-off-by: Fanli Lin <fanli.lin@intel.com> Signed-off-by: Fanli Lin <fanli0116@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…llm-project#24159) Signed-off-by: Fanli Lin <fanli.lin@intel.com> Signed-off-by: Fanli Lin <fanli0116@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
This PR updates the
_protected_stepmethod to incorporate the latest change ofDecodeStreamin tokenizers 0.22.0 to avoid the following UT failure:Error Log:
Since the latest transformers v4.56 would automatically install the latest tokenizers 0.22.0, we need to update the code logic introduced in PR #19449 to make it work for both tokenizers 0.21.4 and 0.22.0.
Test Result
With the fix in this PR,
test_fast_inc_detok_invalid_utf8_err_casecan pass:Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.