Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for non-rust implemented tokenization for __getitem__ method. #24039

Merged
merged 2 commits into from
Jun 7, 2023

Conversation

jacklanda
Copy link
Contributor

@jacklanda jacklanda commented Jun 6, 2023

Overview

This PR is going to add a support for the usage scenario of "getting a slice from the batch-tokenized sequences".

Without this PR, it seems to raise KeyError with the following message KeyError: 'Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers'

P.S. The above scenario could be reproduced by using some models new uploaded but not support to Rust-implemented tokenization, such as fnlp/moss-moon-003-sft. Also we can run the following examplar script for reproducing this issue:

# test script `/home/workspace/test.py` for this PR. 
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("fnlp/moss-moon-003-sft", trust_remote_code=True)
tok.add_special_tokens({"pad_token": "[PAD]"})

texts = ["Today is a good day!", "It's a good idea!", "How's going?"]
batch_tok = tok(texts, padding=True)
print(batch_tok[0:3])  # report `KeyError` here

Error Message

Traceback (most recent call last):
  File "/home/workspace/test.py", line 8, in <module>
    print(batch_tok[0:3])  # report `KeyError` here
  File "/home/app/anaconda3/envs/test/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 242, in __getitem__
    raise KeyError(
KeyError: 'Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers'

All in all, I think it seems useful to implement getitem method behind it in Python side :)

Note that this PR is associative with the previous closed one.
#23645

@jacklanda
Copy link
Contributor Author

It seems that failed in workflow due to Read Timeout on T5-relevant testing.
How can I rerun for this?

图片
图片

@ArthurZucker
Copy link
Collaborator

We can re-run that for you 😉

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favor of these changes as I can confirm that fast tokenizer in python support this indexing while slow ones don't. This is narrowing further the gap between the two.
Thanks for adding this, let's just wait for the tests to pass + can you update the error message since this will allow indexing with integers.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 6, 2023

The documentation is not available anymore as the PR was closed or merged.

@jacklanda
Copy link
Contributor Author

Request for review :)

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

@amyeroberts
Copy link
Collaborator

@jacklanda Could you update the error message as requested by @ArthurZucker?

@amyeroberts amyeroberts self-requested a review June 6, 2023 15:00
@jacklanda jacklanda changed the title Add support for non-rust implemented tokenization for __getitem__ m… Add support for non-rust implemented tokenization for __getitem__ method. Jun 6, 2023
@jacklanda
Copy link
Contributor Author

@jacklanda Could you update the error message as requested by @ArthurZucker?

@amyeroberts Have updated the mentioned error messages by @ArthurZucker
Thanks.

@jacklanda
Copy link
Contributor Author

Ask for final review :)

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for adding!

@amyeroberts amyeroberts merged commit 1e4a773 into huggingface:main Jun 7, 2023
novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
…ethod. (huggingface#24039)

* Add support for non-rust implemented tokenization for `__getitem__` method.

* Update for error message on adding new sub-branch for `__item__` method.

---------

Co-authored-by: liuyang17 <liuyang17@zhihu.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants