remove enforcement of non special when adding tokens #1521

ArthurZucker · 2024-04-30T08:55:30Z

Fix the issue that prevented us from adding special and non special tokens using a single call to add_tokens.
Very important because calling this multiple times adds a huge slowdown (regex re-compilation).

>>> from transformers import AutoTokenizer, AddedToken
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

>>> tokenizer._tokenizer.add_tokens([AddedToken("<me-and-you>", special=True, normalized=False)])
1

>>> tokenizer._tokenizer.decode([128256], skip_special_tokens=True)
''

before it would output '<me-and-you>'. and the token was always special when added.
This will allow us to do the following cleanup: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_fast.py#L169-L192

-        encoder = list(self.added_tokens_encoder.keys()) + [str(token) f
-        # if some of the special tokens are strings, we check if we don'
-        tokens_to_add += [
-            token for token in self.all_special_tokens_extended if token
-        ]
-        if len(tokens_to_add) > 0:
-            # super hack: if a token.special is set, tokenizer ignores i
-            # Accumulate added tokens into batches of special/non-specia
-            # individual tokens would repeatedly rebuild a trie, which c
-            is_last_special = None
-            tokens = []
-            special_tokens = self.all_special_tokens
-            for token in tokens_to_add:
-                is_special = (
-                    (token.special or str(token) in special_tokens)
-                    if isinstance(token, AddedToken)
-                    else str(token) in special_tokens
-                )
-                if is_last_special is None or is_last_special == is_spec
-                    tokens.append(token)
-                else:
-                    self._add_tokens(tokens, special_tokens=is_last_spec
-                    tokens = [token]
-                is_last_special = is_special
-            if tokens:
-                self._add_tokens(tokens, special_tokens=is_last_special)
+ self._add_tokens(tokens_to_add)

this issue comes from c02d4e2 which was introduced in v0.8

HuggingFaceDocBuilderDev · 2024-04-30T09:00:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Narsil

LGTM.

Not sure if we should mark it as breaking chnage (it is technically, although we're more respectful of user intent which could categorize as bugfix)

ArthurZucker · 2024-04-30T13:53:43Z

we'll do a "breaking" release soon anyways! More a bug fix IMO!

remove enforcement of non special when adding tokens

9cb07d2

ArthurZucker added 4 commits April 30, 2024 11:14

mut no longer needed

c03582d

add a small test

27bbcac

nit

9c2407b

style

e3fec4f

ArthurZucker requested a review from Narsil April 30, 2024 12:18

ArthurZucker added 3 commits April 30, 2024 14:55

audit

2309896

ignore cargo audit's own vulnerability

65b4958

update

d70ba42

ArthurZucker marked this pull request as ready for review April 30, 2024 13:28

ArthurZucker added 2 commits April 30, 2024 15:45

revert

7b8d4e6

remove CVE

3579eb6

Narsil approved these changes Apr 30, 2024

View reviewed changes

ArthurZucker merged commit f2ec3b2 into main Apr 30, 2024
12 checks passed

ArthurZucker deleted the fix-add-tokens branch April 30, 2024 13:53

ArthurZucker mentioned this pull request Jun 18, 2024

Improve PreTrainedTokenizerFast loading time when there are many added tokens huggingface/transformers#31404

Merged

itazap mentioned this pull request Jun 20, 2024

SPLIT PR: add user defined symbols and control symbols huggingface/transformers#31305

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove enforcement of non special when adding tokens #1521

remove enforcement of non special when adding tokens #1521

ArthurZucker commented Apr 30, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 30, 2024

Narsil left a comment

ArthurZucker commented Apr 30, 2024

remove enforcement of non special when adding tokens #1521

remove enforcement of non special when adding tokens #1521

Conversation

ArthurZucker commented Apr 30, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Apr 30, 2024

Narsil left a comment

Choose a reason for hiding this comment

ArthurZucker commented Apr 30, 2024

ArthurZucker commented Apr 30, 2024 •

edited

Loading