-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for non-rust implemented tokenization for __getitem__
method.
#24039
Conversation
We can re-run that for you 😉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favor of these changes as I can confirm that fast tokenizer in python support this indexing while slow ones don't. This is narrowing further the gap between the two.
Thanks for adding this, let's just wait for the tests to pass + can you update the error message since this will allow indexing with integers.
The documentation is not available anymore as the PR was closed or merged. |
Request for review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
@jacklanda Could you update the error message as requested by @ArthurZucker? |
__getitem__
m…__getitem__
method.
@amyeroberts Have updated the mentioned error messages by @ArthurZucker |
Ask for final review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for adding!
…ethod. (huggingface#24039) * Add support for non-rust implemented tokenization for `__getitem__` method. * Update for error message on adding new sub-branch for `__item__` method. --------- Co-authored-by: liuyang17 <liuyang17@zhihu.com>
Overview
This PR is going to add a support for the usage scenario of "getting a slice from the batch-tokenized sequences".
Without this PR, it seems to raise KeyError with the following message KeyError: 'Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers'
P.S. The above scenario could be reproduced by using some models new uploaded but not support to Rust-implemented tokenization, such as fnlp/moss-moon-003-sft. Also we can run the following examplar script for reproducing this issue:
Error Message
All in all, I think it seems useful to implement getitem method behind it in Python side :)
Note that this PR is associative with the previous closed one.
#23645