You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the current implementation of HF AutoTokenizer, the tokenizer does not behave in the same way as the HF version. I noticed the problem with the BOS token (which is extremely important for the Gemma 2 model). Specifically, the problem occurs with text_to_ids function,
which is relevant for preprocess_data_for_megatron.py. The problem is that this function first transforms text to tokens and then tokens to IDs and consequently does not append the BOS token at the beginning as the HF encode method does (I assume the problem applies to other special tokens as well).
Steps/Code to reproduce bug
from nemo.collections.common.tokenizers.huggingface.auto_tokenizer import AutoTokenizer as ATNemo
from transformers import AutoTokenizer as ATHF
tokenizer_path = "google/gemma-2-9b"
hf_tokenizer = ATHF.from_pretrained(tokenizer_path)
nemo_tokenizer = ATNemo(tokenizer_path)
text = "Text to tokenize"
# Common HF tokenization
ids = hf_tokenizer.encode(text)
print("HF tokenization:", ids)
# NeMo tokenization
ids = nemo_tokenizer.text_to_ids(text)
print("NeMo tokenization:", ids)
# HF tokenization using NeMo steps
tokens = hf_tokenizer.tokenize(text)
ids = hf_tokenizer.convert_tokens_to_ids(tokens)
print("HF tokenization using NeMo steps:", ids)
Is there any reason why it isn't implemented this way?
Environment overview (please complete the following information)
I was using official NeMo container (24.09). However, I see that code for HF wrapper is still the same on main branch, so the issue should still apply.
The text was updated successfully, but these errors were encountered:
Describe the bug
In the current implementation of HF AutoTokenizer, the tokenizer does not behave in the same way as the HF version. I noticed the problem with the BOS token (which is extremely important for the Gemma 2 model). Specifically, the problem occurs with
text_to_ids
function,NeMo/nemo/collections/common/tokenizers/huggingface/auto_tokenizer.py
Lines 222 to 225 in ed244d9
preprocess_data_for_megatron.py
. The problem is that this function first transforms text to tokens and then tokens to IDs and consequently does not append the BOS token at the beginning as the HF encode method does (I assume the problem applies to other special tokens as well).Steps/Code to reproduce bug
The outputs for provided code snippet are:
Notice how HF tokenization have token ID 2 at the beginning, which corresponds to the BOS token.
Expected behavior
I expect NeMo HF wrapper to behave the same way as the HF tokenizer. The solution for this case would be to change the
text_to_ids
function to:Is there any reason why it isn't implemented this way?
Environment overview (please complete the following information)
I was using official NeMo container (24.09). However, I see that code for HF wrapper is still the same on main branch, so the issue should still apply.
The text was updated successfully, but these errors were encountered: