Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove enforcement of non special when adding tokens #1521

Merged
merged 10 commits into from
Apr 30, 2024
Merged

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Apr 30, 2024

Fix the issue that prevented us from adding special and non special tokens using a single call to add_tokens.
Very important because calling this multiple times adds a huge slowdown (regex re-compilation).

>>> from transformers import AutoTokenizer, AddedToken
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

>>> tokenizer._tokenizer.add_tokens([AddedToken("<me-and-you>", special=True, normalized=False)])
1

>>> tokenizer._tokenizer.decode([128256], skip_special_tokens=True)
''

before it would output '<me-and-you>'. and the token was always special when added.
This will allow us to do the following cleanup: https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_fast.py#L169-L192

-        encoder = list(self.added_tokens_encoder.keys()) + [str(token) f
-        # if some of the special tokens are strings, we check if we don'
-        tokens_to_add += [
-            token for token in self.all_special_tokens_extended if token
-        ]
-        if len(tokens_to_add) > 0:
-            # super hack: if a token.special is set, tokenizer ignores i
-            # Accumulate added tokens into batches of special/non-specia
-            # individual tokens would repeatedly rebuild a trie, which c
-            is_last_special = None
-            tokens = []
-            special_tokens = self.all_special_tokens
-            for token in tokens_to_add:
-                is_special = (
-                    (token.special or str(token) in special_tokens)
-                    if isinstance(token, AddedToken)
-                    else str(token) in special_tokens
-                )
-                if is_last_special is None or is_last_special == is_spec
-                    tokens.append(token)
-                else:
-                    self._add_tokens(tokens, special_tokens=is_last_spec
-                    tokens = [token]
-                is_last_special = is_special
-            if tokens:
-                self._add_tokens(tokens, special_tokens=is_last_special)
+ self._add_tokens(tokens_to_add)

this issue comes from c02d4e2 which was introduced in v0.8

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker marked this pull request as ready for review April 30, 2024 13:28
Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Not sure if we should mark it as breaking chnage (it is technically, although we're more respectful of user intent which could categorize as bugfix)

@ArthurZucker
Copy link
Collaborator Author

we'll do a "breaking" release soon anyways! More a bug fix IMO!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants