Skip to content

Commit

Permalink
Add support for non-rust implemented tokenization for __getitem__ m…
Browse files Browse the repository at this point in the history
…ethod. (huggingface#24039)

* Add support for non-rust implemented tokenization for `__getitem__` method.

* Update for error message on adding new sub-branch for `__item__` method.

---------

Co-authored-by: liuyang17 <liuyang17@zhihu.com>
  • Loading branch information
2 people authored and novice03 committed Jun 23, 2023
1 parent d1b7753 commit 2c5f1f1
Showing 1 changed file with 7 additions and 2 deletions.
9 changes: 7 additions & 2 deletions src/transformers/tokenization_utils_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,15 +233,20 @@ def __getitem__(self, item: Union[int, str]) -> Union[Any, EncodingFast]:
etc.).
If the key is an integer, get the `tokenizers.Encoding` for batch item with index `key`.
If the key is a slice, returns the value of the dict associated to `key` ('input_ids', 'attention_mask', etc.)
with the constraint of slice.
"""
if isinstance(item, str):
return self.data[item]
elif self._encodings is not None:
return self._encodings[item]
elif isinstance(item, slice):
return {key: self.data[key][slice] for key in self.data.keys()}
else:
raise KeyError(
"Indexing with integers (to access backend Encoding for a given batch index) "
"is not available when using Python based tokenizers"
"Invalid key. Only three types of key are available: "
"(1) string, (2) integers for backend Encoding, and (3) slices for data subsetting."
)

def __getattr__(self, item: str):
Expand Down

0 comments on commit 2c5f1f1

Please sign in to comment.