Add support for non-rust implemented tokenization for `getitem` method. #24039

jacklanda · 2023-06-06T04:43:29Z

Overview

This PR is going to add a support for the usage scenario of "getting a slice from the batch-tokenized sequences".

Without this PR, it seems to raise KeyError with the following message KeyError: 'Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers'

P.S. The above scenario could be reproduced by using some models new uploaded but not support to Rust-implemented tokenization, such as fnlp/moss-moon-003-sft. Also we can run the following examplar script for reproducing this issue:

# test script `/home/workspace/test.py` for this PR. 
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("fnlp/moss-moon-003-sft", trust_remote_code=True)
tok.add_special_tokens({"pad_token": "[PAD]"})

texts = ["Today is a good day!", "It's a good idea!", "How's going?"]
batch_tok = tok(texts, padding=True)
print(batch_tok[0:3])  # report `KeyError` here

Error Message

Traceback (most recent call last):
  File "/home/workspace/test.py", line 8, in <module>
    print(batch_tok[0:3])  # report `KeyError` here
  File "/home/app/anaconda3/envs/test/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 242, in __getitem__
    raise KeyError(
KeyError: 'Indexing with integers (to access backend Encoding for a given batch index) is not available when using Python based tokenizers'

All in all, I think it seems useful to implement getitem method behind it in Python side :)

Note that this PR is associative with the previous closed one.
#23645

…ethod.

jacklanda · 2023-06-06T05:08:38Z

It seems that failed in workflow due to Read Timeout on T5-relevant testing.
How can I rerun for this?

ArthurZucker · 2023-06-06T08:25:28Z

We can re-run that for you 😉

ArthurZucker

I am in favor of these changes as I can confirm that fast tokenizer in python support this indexing while slow ones don't. This is narrowing further the gap between the two.
Thanks for adding this, let's just wait for the tests to pass + can you update the error message since this will allow indexing with integers.

HuggingFaceDocBuilderDev · 2023-06-06T08:41:21Z

The documentation is not available anymore as the PR was closed or merged.

jacklanda · 2023-06-06T14:03:06Z

Request for review :)

amyeroberts

Thanks for adding this!

amyeroberts · 2023-06-06T15:00:02Z

@jacklanda Could you update the error message as requested by @ArthurZucker?

jacklanda · 2023-06-06T16:45:19Z

@jacklanda Could you update the error message as requested by @ArthurZucker?

@amyeroberts Have updated the mentioned error messages by @ArthurZucker
Thanks.

jacklanda · 2023-06-07T09:25:14Z

Ask for final review :)

amyeroberts

Thanks again for adding!

…ethod. (huggingface#24039) * Add support for non-rust implemented tokenization for `__getitem__` method. * Update for error message on adding new sub-branch for `__item__` method. --------- Co-authored-by: liuyang17 <liuyang17@zhihu.com>

Add support for non-rust implemented tokenization for __getitem__ m…

0bd0dd5

…ethod.

ArthurZucker reviewed Jun 6, 2023

View reviewed changes

amyeroberts approved these changes Jun 6, 2023

View reviewed changes

amyeroberts self-requested a review June 6, 2023 15:00

Update for error message on adding new sub-branch for __item__ method.

4e92bf8

jacklanda changed the title ~~Add support for non-rust implemented tokenization for __getitem__ m…~~ Add support for non-rust implemented tokenization for __getitem__ method. Jun 6, 2023

amyeroberts approved these changes Jun 7, 2023

View reviewed changes

amyeroberts merged commit 1e4a773 into huggingface:main Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for non-rust implemented tokenization for `getitem` method. #24039

Add support for non-rust implemented tokenization for `getitem` method. #24039

jacklanda commented Jun 6, 2023 •

edited

Loading

jacklanda commented Jun 6, 2023

ArthurZucker commented Jun 6, 2023

ArthurZucker left a comment

HuggingFaceDocBuilderDev commented Jun 6, 2023 •

edited

Loading

jacklanda commented Jun 6, 2023

amyeroberts left a comment

amyeroberts commented Jun 6, 2023

jacklanda commented Jun 6, 2023

jacklanda commented Jun 7, 2023

amyeroberts left a comment

Add support for non-rust implemented tokenization for __getitem__ method. #24039

Add support for non-rust implemented tokenization for __getitem__ method. #24039

Conversation

jacklanda commented Jun 6, 2023 • edited Loading

Overview

Error Message

jacklanda commented Jun 6, 2023

ArthurZucker commented Jun 6, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jun 6, 2023 • edited Loading

jacklanda commented Jun 6, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts commented Jun 6, 2023

jacklanda commented Jun 6, 2023

jacklanda commented Jun 7, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

Add support for non-rust implemented tokenization for `getitem` method. #24039

Add support for non-rust implemented tokenization for `getitem` method. #24039

jacklanda commented Jun 6, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 6, 2023 •

edited

Loading