Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer ignoring multiple spaces #40

Closed
jorgemcgomes opened this issue Jun 7, 2023 · 11 comments
Closed

Tokenizer ignoring multiple spaces #40

jorgemcgomes opened this issue Jun 7, 2023 · 11 comments

Comments

@jorgemcgomes
Copy link

It appears the tokenizer is ignoring more than one consecutive space.
This behaviour is not observed with the original LLama tokenizer. See examples below.

Is this some issue with the configuration of the HF tokenizer? Or has the model really been trained like this?
This seems like a very big deal for everything concerning code understanding/generation.

OpenLLaMA

from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained('openlm-research/open_llama_3b', use_fast=False)

>>> tokenizer("hello world")
{'input_ids': [1, 27701, 924], 'attention_mask': [1, 1, 1]}
>>> tokenizer("hello     world")
{'input_ids': [1, 27701, 924], 'attention_mask': [1, 1, 1]}

>>> tokenizer("hello\nworld")
{'input_ids': [1, 27701, 13, 7904], 'attention_mask': [1, 1, 1, 1]}

>>> tokenizer("hello\n world")
{'input_ids': [1, 27701, 13, 924], 'attention_mask': [1, 1, 1, 1]}
>>> tokenizer("hello\n       world")
{'input_ids': [1, 27701, 13, 924], 'attention_mask': [1, 1, 1, 1]}

# line breaks seem fine
>>> tokenizer("hello\nworld")
{'input_ids': [1, 27701, 13, 7904], 'attention_mask': [1, 1, 1, 1]}
>>> tokenizer("hello\n\nworld")
{'input_ids': [1, 27701, 13, 13, 7904], 'attention_mask': [1, 1, 1, 1, 1]}
>>> tokenizer("hello\n\n\nworld")
{'input_ids': [1, 27701, 13, 13, 13, 7904], 'attention_mask': [1, 1, 1, 1, 1, 1]}

Original LLaMA

tokenizer = LlamaTokenizer.from_pretrained('decapoda-research/llama-7b-hf', use_fast=False)
>>> tokenizer("hello world")
{'input_ids': [0, 22172, 3186], 'attention_mask': [1, 1, 1]}
>>> tokenizer("hello  world")
{'input_ids': [0, 22172, 29871, 3186], 'attention_mask': [1, 1, 1, 1]}
>>> tokenizer("hello   world")
{'input_ids': [0, 22172, 259, 3186], 'attention_mask': [1, 1, 1, 1]}
>>> tokenizer("hello    world")
{'input_ids': [0, 22172, 1678, 3186], 'attention_mask': [1, 1, 1, 1]}
>>> tokenizer("hello     world")
{'input_ids': [0, 22172, 268, 3186], 'attention_mask': [1, 1, 1, 1]}
@danielhanchen
Copy link

Interestingly all old model checkpoints also has the same issue if one uses use_fast = False. use_fast = True succeeds, albeit multiple spaces are tokenized independently (are multiple spaces supposed to be tokenized independently though?)

tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_3b_350bt_preview', use_fast = False)
tokenizer("hello    world")

returns
{'input_ids': [0, 27701, 924], 'attention_mask': [1, 1, 1]}

tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_3b_600bt_preview', use_fast = False)
tokenizer("hello     world")

returns
{'input_ids': [1, 27701, 924], 'attention_mask': [1, 1, 1]}

tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_3b_600bt_preview', use_fast = True)
tokenizer("hello     world")

returns
{'input_ids': [1, 27701, 31822, 31822, 31822, 31822, 924], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

I thought first maybe my contributions, which was to enable use_fast = True (https://huggingface.co/openlm-research/open_llama_3b_600bt_preview/discussions/3) (https://huggingface.co/openlm-research/open_llama_7b_700bt_preview/discussions/2) might have caused the error, but then I did not contribute to open_llama_3b_350bt_preview, yet the error still persists.

I already opened 3 PRs each to the 3B, 7B and 13B models which allows use_fast = True to load in seconds rather than 5 minutes (since HF converts a slow tokenizer to a fast one under the hood), and that should coincidentally also solve the space tokenization issue.

https://huggingface.co/openlm-research/open_llama_13b_600bt/discussions/1
https://huggingface.co/openlm-research/open_llama_3b/discussions/2
https://huggingface.co/openlm-research/open_llama_7b/discussions/1

But my main Q is still whether spaces are supposed to be independently tokenized? Ie is 2 spaces just 2 tokens, and 3 spaces = 3 individual space tokens, and not like the original LLAMA where 2 spaces = token id X, 3 spaces = token id Y etc.

PS for the meantime, you can use my tokenizers which already implements use_fast = True:

tokenizer = AutoTokenizer.from_pretrained("danielhanchen/open_llama_3b")
tokenizer = AutoTokenizer.from_pretrained("danielhanchen/open_llama_7b")
tokenizer = AutoTokenizer.from_pretrained("danielhanchen/open_llama_13b_600bt")

@jorgemcgomes
Copy link
Author

jorgemcgomes commented Jun 8, 2023

Thanks @danielhanchen . I was just trying your tokenizers, thanks. They do "solve" the spaces issue.

As for the space merging, I think it depends on whether the vocab has a token for multiple spaces or not (and how many).

Testing the original llama tokenizer, we can see they do have them up until 16 (!) spaces.
In the examples below, note that ▁ is not an underscore, it is the UTF symbol ▁ used by the tokenizer to represent a space.

from transformers import LlamaTokenizer
tok_original = LlamaTokenizer.from_pretrained('decapoda-research/llama-7b-hf', use_fast=True)

>>> tok_original.get_vocab()["▁"]
29871
>>> tok_original.get_vocab()["▁▁"]
259
>>> tok_original.get_vocab()["▁▁▁"]
1678
>>> tok_original.get_vocab()["▁▁▁▁"]
268
>>> tok_original.get_vocab()["▁▁▁▁▁"]
418
>>> tok_original.get_vocab()["▁▁▁▁▁▁"]
539
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁"]
4706
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁"]
308
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁"]
3986
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁▁"]
965
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁▁▁"]
9651
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁▁▁▁"]
632
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁▁▁▁▁"]
795
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁▁▁▁▁▁"]
1669
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁"]
18884
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁"]
462
>>> tok_original.get_vocab()["▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁'

The OpenLLama tokenizer only has the single space though:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("openlm-research/open_llama_3b", use_fast=False)

>>> tok.get_vocab()["▁"]
31822
>>> tok.get_vocab()["▁▁"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: '▁▁'

I think it is smart to have multiple spaces tokenized as a single token. When it comes to code data for example, it represents an enormous saving of tokens. Just think of all the tokens spent to encode simple indentations...

If OpenLLama was indeed trained like this, that's very unfortunate.

@danielhanchen
Copy link

Coolies on trying out my temporary tokenizers! :)

Interesting find on llama's support up to 16 spaces! I think Openllama did do the individual digit splitting correctly, just maybe not the spaces.

Quote from https://arxiv.org/pdf/2302.13971.pdf:

Tokenizer. We tokenize the data with the byte-
pair encoding (BPE) algorithm (Sennrich et al.,
2015), using the implementation from Sentence-
Piece (Kudo and Richardson, 2018). Notably, we
split all numbers into individual digits, and fallback
to bytes to decompose unknown UTF-8 characters

The original llama paper doesn't really mention on spaces, so presumably it's just treated like other tokens.

@joytianya
Copy link

joytianya commented Jun 9, 2023

When fine-tuning the code data downstream with https://github.com/young-geng/EasyLM/tree/main, there will be significant issues. Spaces are usually used for indentation. the result is that the indentations disappear.
Is there any way to solve it?

the code without indentation such as

def bubble_sort(arr):
 n = len(arr)
 for i in range(n-1):
 for j in range(n-i-1):
 if arr[j] > arr[j+1]:
 arr[j], arr[j+1] = arr[j+1], arr[j]
 return arr

@danielhanchen
Copy link

@joytianya I coincidentally opened 3 PRs to fix the 3B, 7B and 13B tokenizers :) If you're in a rush, you can temporary use my tokenizers which were the ones I pushed to the Openllama team's repo:

tokenizer = AutoTokenizer.from_pretrained("danielhanchen/open_llama_3b")
tokenizer = AutoTokenizer.from_pretrained("danielhanchen/open_llama_7b")
tokenizer = AutoTokenizer.from_pretrained("danielhanchen/open_llama_13b_600bt")

@young-geng
Copy link
Contributor

young-geng commented Jun 9, 2023

This is indeed a mistake on our side, as we have misconfigured the tokenizer to remove repeated spaces. I've updated that configuration and now the tokenizer should preserve all spaces. Please try it out.

@belladoreai
Copy link

belladoreai commented Jun 9, 2023

@danielhanchen What are the differences between the 3B, 7B, and 13B tokenizers? I ask because I've been working for a few days to create a client-side JavaScript tokenizer for LLaMA, and I used the 13B tokenizer as a reference. I assumed that the tokenizer is the same for these different LLaMA versions, but maybe it's not?

@codesoap
Copy link

codesoap commented Jun 9, 2023

When I compare the three tokenizers, they seem to be the same:

$ curl -L https://huggingface.co/openlm-research/open_llama_3b/resolve/main/tokenizer.model -o tokenizer.model.3b
$ curl -L https://huggingface.co/openlm-research/open_llama_7b/resolve/main/tokenizer.model -o tokenizer.model.7b
$ curl -L https://huggingface.co/openlm-research/open_llama_13b_600bt/resolve/main/tokenizer.model -o tokenizer.model.13b

$ sha256 tokenizer.model.*
SHA256 (tokenizer.model.13b) = 81c4a3c9a9bbad64636d93660b6982940cec979a398f42684ba7194d118a3f21
SHA256 (tokenizer.model.3b) = 81c4a3c9a9bbad64636d93660b6982940cec979a398f42684ba7194d118a3f21
SHA256 (tokenizer.model.7b) = 81c4a3c9a9bbad64636d93660b6982940cec979a398f42684ba7194d118a3f21

@danielhanchen
Copy link

danielhanchen commented Jun 9, 2023

@belladoreai yep as @codesoap showed, it seems like the OpenLLAMA team most likely trained 1 tokenizer on the entire 1T token RJ dataset, then used all 3 for each of the 3 models.

But anyways it seems like @young-geng has successfully fixed the tokenizers - I just checked all 3 (3B, 7B, 13B):

For eg:

tokenizer = AutoTokenizer.from_pretrained("openlm-research/open_llama_13b_600bt", pad_token = "</s>", use_fast = False)
tokenizer("Hello 1  2   3    4")

successfully returns:
{'input_ids': [1, 16644, 31822, 31853, 31822, 31822, 31855, 31822, 31822, 31822, 31878, 31822, 31822, 31822, 31822, 31882], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

I also updated the use_fast = True alternatives which enables Huggingface's batch processing to work on my tokenizer only repos for those who need fast tokenizations:
https://huggingface.co/danielhanchen/open_llama_3b
https://huggingface.co/danielhanchen/open_llama_7b
https://huggingface.co/danielhanchen/open_llama_13b_600bt

@young-geng
Copy link
Contributor

We've just release a 7B v2 model with a better tokenizer and pretrained with a lot of code data. Check that out!

@danielhanchen
Copy link

@young-geng Congrats on the 7B v2 release! I can see multiple spaces are now tokenized properly! Good work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants