Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing BOS tokens for HF tokenizer #11297

Open
domenVres opened this issue Nov 15, 2024 · 0 comments
Open

Missing BOS tokens for HF tokenizer #11297

domenVres opened this issue Nov 15, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@domenVres
Copy link

domenVres commented Nov 15, 2024

Describe the bug

In the current implementation of HF AutoTokenizer, the tokenizer does not behave in the same way as the HF version. I noticed the problem with the BOS token (which is extremely important for the Gemma 2 model). Specifically, the problem occurs with text_to_ids function,

def text_to_ids(self, text):
tokens = self.text_to_tokens(text)
ids = self.tokens_to_ids(tokens)
return ids
which is relevant for preprocess_data_for_megatron.py. The problem is that this function first transforms text to tokens and then tokens to IDs and consequently does not append the BOS token at the beginning as the HF encode method does (I assume the problem applies to other special tokens as well).

Steps/Code to reproduce bug

from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer as ATNemo
from transformers import AutoTokenizer as ATHF


tokenizer_path = "google/gemma-2-9b"
hf_tokenizer = ATHF.from_pretrained(tokenizer_path)
nemo_tokenizer = ATNemo(tokenizer_path)

text = "Text to tokenize"

# Common HF tokenization
ids = hf_tokenizer.encode(text)
print("HF tokenization:", ids)

# NeMo tokenization
ids = nemo_tokenizer.text_to_ids(text)
print("NeMo tokenization:", ids)

# HF tokenization using NeMo steps
tokens = hf_tokenizer.tokenize(text)
ids = hf_tokenizer.convert_tokens_to_ids(tokens)
print("HF tokenization using NeMo steps:", ids)

The outputs for provided code snippet are:

HF tokenization: [2, 1637, 577, 223491]
NeMo tokenization: [1637, 577, 223491]
HF tokenization using NeMo steps: [1637, 577, 223491]

Notice how HF tokenization have token ID 2 at the beginning, which corresponds to the BOS token.

Expected behavior

I expect NeMo HF wrapper to behave the same way as the HF tokenizer. The solution for this case would be to change the text_to_ids function to:

 def text_to_ids(self, text):
     ids = self.tokenizer.encode(text) 
     return ids 

Is there any reason why it isn't implemented this way?

Environment overview (please complete the following information)

I was using official NeMo container (24.09). However, I see that code for HF wrapper is still the same on main branch, so the issue should still apply.

@domenVres domenVres added the bug Something isn't working label Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant