Missing BOS tokens for HF tokenizer #11297

domenVres · 2024-11-15T14:37:18Z

Describe the bug

In the current implementation of HF AutoTokenizer, the tokenizer does not behave in the same way as the HF version. I noticed the problem with the BOS token (which is extremely important for the Gemma 2 model). Specifically, the problem occurs with text_to_ids function,

NeMo/nemo/collections/common/tokenizers/huggingface/auto_tokenizer.py

Lines 222 to 225 in ed244d9

    
           def text_to_ids(self, text): 
        
               tokens = self.text_to_tokens(text) 
        
               ids = self.tokens_to_ids(tokens) 
        
               return ids

which is relevant for preprocess_data_for_megatron.py. The problem is that this function first transforms text to tokens and then tokens to IDs and consequently does not append the BOS token at the beginning as the HF encode method does (I assume the problem applies to other special tokens as well).

Steps/Code to reproduce bug

from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer as ATNemo
from transformers import AutoTokenizer as ATHF


tokenizer_path = "google/gemma-2-9b"
hf_tokenizer = ATHF.from_pretrained(tokenizer_path)
nemo_tokenizer = ATNemo(tokenizer_path)

text = "Text to tokenize"

# Common HF tokenization
ids = hf_tokenizer.encode(text)
print("HF tokenization:", ids)

# NeMo tokenization
ids = nemo_tokenizer.text_to_ids(text)
print("NeMo tokenization:", ids)

# HF tokenization using NeMo steps
tokens = hf_tokenizer.tokenize(text)
ids = hf_tokenizer.convert_tokens_to_ids(tokens)
print("HF tokenization using NeMo steps:", ids)

The outputs for provided code snippet are:

HF tokenization: [2, 1637, 577, 223491]
NeMo tokenization: [1637, 577, 223491]
HF tokenization using NeMo steps: [1637, 577, 223491]

Notice how HF tokenization have token ID 2 at the beginning, which corresponds to the BOS token.

Expected behavior

I expect NeMo HF wrapper to behave the same way as the HF tokenizer. The solution for this case would be to change the text_to_ids function to:

 def text_to_ids(self, text):
     ids = self.tokenizer.encode(text) 
     return ids

Is there any reason why it isn't implemented this way?

Environment overview (please complete the following information)

I was using official NeMo container (24.09). However, I see that code for HF wrapper is still the same on main branch, so the issue should still apply.

The text was updated successfully, but these errors were encountered:

domenVres added the bug Something isn't working label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing BOS tokens for HF tokenizer #11297

Missing BOS tokens for HF tokenizer #11297

domenVres commented Nov 15, 2024 •

edited

Loading

Missing BOS tokens for HF tokenizer #11297

Missing BOS tokens for HF tokenizer #11297

Comments

domenVres commented Nov 15, 2024 • edited Loading

domenVres commented Nov 15, 2024 •

edited

Loading