Whisper is not learning a new tokenizer, even when i make test and train dataset the same #27583

P-Sood · 2023-11-19T01:57:04Z

System Info

transformers version: 4.35.2
Platform: Linux-5.15.120+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.19.3
Safetensors version: 0.4.0
Accelerate version: 0.24.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.0+cu118 (False)
Tensorflow version (GPU?): 2.14.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.7.5 (cpu)
Jax version: 0.4.20
JaxLib version: 0.4.20
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@sanchit-gandhi

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Hello, I want to take the audio at my workplace and transform it into a transcription; however, with base whisper, it seems as though it isn't that good. So, I have been wanting to create my own tokenizer that can understand jargon and output that jargon better. Stuff similar to acronyms. Below I have shown my steps in

Creating Tokenizer
Preprocessing data pipeline
Model init, and configuration
Model outputs

I run this using huggingface trainer, with the generate option. Is it my data size? i have scoured online to try and find some sort of solution but they all just say it works. I am at my wits end and would appreciate any help on getting this tokenizer to learn my jargon.

Thank you in advance :)

Creating the tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
from transformers import WhisperTokenizer

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Pre-tokenizer responsible for converting the text to a stream of characters
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()#ByteLevel(add_prefix_space=False)

# Decoder responsible for converting the tokens back to a string
tokenizer.decoder = decoders.ByteLevel()

# Trainer responsible for training the BPE model
tokenizer.trainers = trainers.BpeTrainer(vocab_size=1000, min_frequency=2 , special_tokens=spec_tok)

# Training the tokenizer
tokenizer.train(["file.txt"])

# Save the tokenizer
tokenizer.save("NewWhisperTokenizer.json")

f = open('NewWhisperTokenizer.json')

# returns JSON object as
# a dictionary
data = json.load(f)
with open("vocab.json", "w") as outfile:
    json.dump(data['model']['vocab'], outfile)
with open("merges.txt", "w") as outfile:
    json.dump(data['model']['merges'], outfile)


tokenizer = WhisperTokenizer("vocab.json", "merges.txt" , errors = "replace", unk_token = "<|endoftext|>", bos_token = "<|endoftext|>", eos_token = "<|endoftext|>", pad_token = "<|endoftext|>")
tokenizer.add_special_tokens(WhisperTokenizer.from_pretrained("openai/whisper-tiny").special_tokens_map_extended)
tokenizer.save_pretrained("new_tok")

len(tokenizer) == 193

Preprocessing steps

def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    temp_labels = tokenizer(batch["phonetic_detail"]["utterance"]).input_ids
    batch["label"] = [label for sentence_labels in temp_labels for label in sentence_labels]
    return batch

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    tokenizer: Any
    feature_extractor: Any
    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.feature_extractor.pad(input_features, return_tensors="pt")
        label_features = [{"input_ids": feature["label"]} for feature in features]
        labels_batch = self.tokenizer.pad(label_features, return_tensors="pt")


        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
        if (labels[:, 0] == self.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]
        batch["labels"] = labels
        return batch

data_collator = DataCollatorSpeechSeq2SeqWithPadding(tokenizer , feature_extractor)

len(train_dataset) == 4000
len(test_dataset) == 1000

Model Config

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

voc = tokenizer.get_vocab()

model_Gen = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
model_Gen = model_Gen.to(device)

model_Gen.resize_token_embeddings(len(tokenizer))

model_Gen.config.pad_token_id = tokenizer.pad_token_id
model_Gen.config.decoder_start_token_id = voc['<|startoftranscript|>']
model_Gen.config.eos_token_id = tokenizer.eos_token_id
model_Gen.config.bos_token_id = tokenizer.bos_token_id
model_Gen.config.suppress_tokens = []
model_Gen.config.forced_decoder_ids = None
model_Gen.config.begin_suppress_tokens = [
    tokenizer.pad_token_id
  ]

model_Gen.generation_config.pad_token_id = tokenizer.pad_token_id
model_Gen.generation_config.decoder_start_token_id = voc['<|startoftranscript|>']
model_Gen.generation_config.eos_token_id = tokenizer.eos_token_id
model_Gen.generation_config.bos_token_id = tokenizer.bos_token_id
model_Gen.generation_config.suppress_tokens = []
model_Gen.generation_config.forced_decoder_ids = None
model_Gen.generation_config.begin_suppress_tokens = [
    tokenizer.pad_token_id
  ]

model_Gen.generation_config.no_timestamps_token_id = voc['<|notimestamps|>']

Huggingface Trainer

Here I have made the dataset the same 30 examples to see if it would give me complete overprediction, but even with setting train and test to be the same, it is not overfitting at all.

training_args = Seq2SeqTrainingArguments(
  output_dir='training_output',
  logging_dir='./logs',
  group_by_length=True,
  per_device_train_batch_size=1,
  gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
  per_device_eval_batch_size=1,
  num_train_epochs=8,
  gradient_checkpointing=True,
  lr_scheduler_type = "cosine_with_restarts",
  save_strategy='epoch',
  evaluation_strategy='epoch',
  logging_strategy='epoch',
  learning_rate=1e-2,
  weight_decay=0.005,
  # warmup_steps=36,
  save_total_limit=4,
  push_to_hub=False,
  predict_with_generate=True,
  generation_max_length=225,
  load_best_model_at_end=True,
  greater_is_better=False,
  generation_num_beams = 4,
  # fp16 = True,

  report_to="wandb", # Turn this off for pdb debug

)

trainer = CustomTrainer(
    compute_metrics=compute_metrics,
    args=training_args,
    model=model_Gen,
    data_collator=data_collator,
    tokenizer=processor.feature_extractor,
    train_dataset=new_test['test'] ,
    eval_dataset=new_test['test'],
)

trainer.evaluate()

Outputs after second epoch

tokenizer.batch_decode(pred.predictions , skip_special_tokens = True)
['', '', 'uwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuw', 'k', '', 'k', 'kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk', 
'awawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawaw', 'awawawaw', '', '', '', 'jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj', '', 'jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj', 'uweuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuw', '', 
'axaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxax', '', 
'kuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhk', 
'eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee', 
'eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee',
 'awawawaw', 
'eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee', 
'awawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawaw',
 '', 
'jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj']

Expected behavior

More understandable text descriptions

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-11-20T16:08:39Z

Hey 🤗 thanks a lot for opening an issue and using transformers!

We try to keep the github issues for bugs/feature requests.
Could you ask your question on the forum instead? I'm sure the community will be of help!

Otherwise you should follow the tutorial ressources on how to train a whisper model see:

Thanks!

P-Sood · 2023-11-20T20:32:32Z

Hello @ArthurZucker I shall post it on the huggingface forums as you request.

I saw that second post with training on the custom tokenizer. However, the fix they used was to switch it back to the regular pretrained tokenizer and just train for longer. So that doesn't seem like it would have too much effect on me.

The other issue I looked at here was on the huggingface bugs page so I decided to post it here as well.

They also had a similar issue, but they needed help to get the model to train, and had no information on the results after the code was correct. Maybe I should leave a comment at the author of that issue, seeing if he got it work.

Anyways, thanks for the info, ill post it on the forums.

ArthurZucker · 2023-11-21T13:17:09Z

I am not sure why you need to train a new tokenizer but I don't recommend it. You are completely losing the mapping from input_ids and tokens, thus the preptrained model is rendered useless. You should add tokens to the tokenizers rather than train a new one from scratch if you want to leverage the pretrained checkpoint

sanchit-gandhi · 2023-12-06T18:08:35Z

Do you know ahead of time what the kind of jargon is? You could first try Whisper prompting by putting your 'jargon' as the prompt:

processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
input_features = processor(input_speech, return_tensors="pt").input_features

# --- Without prompt ---
prompt_ids = processor.get_prompt_ids("Leighton")
output_without_prompt = model.generate(input_features)
print(processor.decode(output_without_prompt[0]))
# "<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|endoftext|>"

# --- With prompt ---
prompt_ids = processor.get_prompt_ids("Leighton")
output_with_prompt = model.generate(input_features, prompt_ids=prompt_ids)
print(processor.decode(output_with_prompt[0]))
# "<|startofprev|> Leighton<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He has grave doubts whether Sir Frederick Leighton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|endoftext|>"

Your next best method would be fine-tuning using the original tokenizer on your dataset, using as much data as possible: https://huggingface.co/blog/fine-tune-whisper

If you're in a low-data regime, freezing the encoder is recommended. Call this line before you do trainer.train():

model.freeze_encoder()

After that, see this issue for recommendations for custom vocabulary: https://discuss.huggingface.co/t/adding-custom-vocabularies-on-whisper/29311?u=nbroad. Note that this will require more data than standard fine-tuning, so you should be completely sure standard fine-tuning with the original tokenizer doesn't work before trying this. Also note that as @ArthurZucker mentioned, it is not recommended to completely reset the tokenizer, but rather append the new vocabulary to the tokenizer.

github-actions · 2024-01-01T08:04:20Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker · 2024-08-19T14:08:49Z

Hey! I would recommend you to use tokenizer.train_new_from_iterator for example! https://huggingface.co/learn/nlp-course/en/chapter6/2 for more details

ArthurZucker · 2024-08-19T14:09:09Z

The issue is that you might have to train the model as well, which is much more compilcated

ArthurZucker · 2024-08-27T12:02:03Z

Is it possible to train a tokenizer using tokenizer.train_new_from_iterator() and avoid model training?
Technically, yes, you can train a tokenizer using train_new_from_iterator() without re-training the model, but this usually isn't advisable. The reason is that the tokenizer and the model are tightly coupled. The model is trained on a specific vocabulary, which corresponds to the token IDs generated by the tokenizer. When you train a new tokenizer, the vocabulary and tokenization strategy change, which means that the tokens the model expects and those generated by the new tokenizer might not align. This misalignment leads to incorrect inputs to the model, which in turn can result in poor or nonsensical predictions.
Why did I get an empty prediction with the new tokenizer?
You got an empty prediction because the new tokenizer's output doesn't match what the model was trained to process. When you changed the tokenizer, the token IDs and the sequence of tokens fed into the model were different from what the model expects. The model likely received token IDs or sequences it was never trained on, causing it to fail in generating any meaningful output, hence the empty prediction. Additionally, if the special tokens used by the model (like <|endoftext|>, <|startoftranscript|>, etc.) have different IDs in the new tokenizer, the model might misinterpret these tokens, leading to the generation of no output.
Is it okay that <|endoftext|> has different IDs in the old and new tokenizers?
No, it's not okay if the model was trained with the assumption that <|endoftext|> has a specific token ID (like 50257) and now it has a different ID (like 0) in the new tokenizer. The model relies on specific token IDs to understand the input correctly. If the IDs are changed, the model's internal mechanisms (which depend on these IDs) will no longer function as intended. This misalignment can cause the model to either generate incorrect predictions or fail entirely, as seen in your case.
Is it okay to have extra special tokens in the new tokenizer's vocabulary?
Having extra special tokens in the new tokenizer's vocabulary is fine if the model is designed to recognize and utilize these tokens. However, if the model wasn't trained with these special tokens, they will likely be ignored or cause issues. For instance, if the model encounters these tokens but doesn't know how to interpret them, it may fail to generate appropriate predictions. On the other hand, if these special tokens are necessary for the model's functionality (e.g., indicating language or specific tasks), then having them is crucial. The problem arises when there's a mismatch between the special tokens the model expects and those provided by the tokenizer.

Conclusion
In summary, the root of the issues you're encountering is the misalignment between the tokenizer and the model. When you train a new tokenizer, the token IDs and tokenization strategies change, which can cause the model to malfunction if it was not retrained with this new tokenizer. For the best results, you should either retrain the model with the new tokenizer or, if retraining isn't feasible, stick to using the tokenizer that the model was originally trained with.

ArthurZucker · 2024-09-06T10:11:40Z

As I understand the "<|endoftext|>" special toke id must be the last one (or one of the last ones if other special tokens are used as well) in vocab.

This assumption is not necessarily true. The most important thing is that it stays at the same position if you want to re-use the tokenizer.

Now, when training the tokenizer, you don't need the special token. So you should add it afterwards or, give {"":58200}. as the initial vocab to your tokenizer.

Another thing is, you should not use the ByteLevel pretokenizer but the normalizer.
If you try to decoder 0, you will see that it will not be "<|endoftext|>" 😉

ArthurZucker · 2024-09-06T10:11:46Z

FYI @itazap

ArthurZucker · 2024-10-01T12:57:56Z

Honestly it's a bit complicated 😅
TLDR:

if you train a new tokenizer entirely, you are doomed
if you train from an olde one, then okay some tokens will match, but some will not -> some words will be completely unsee by the model
if you resize the embedding without the proper function, you are doomed as well

ArthurZucker · 2024-10-02T09:25:55Z

Would you mind sharing what unblocked you?! 🤗 I am super curious

ArthurZucker · 2024-10-02T09:26:06Z

(sorry my bad)

github-actions bot closed this as completed Jan 9, 2024

ArthurZucker reopened this Oct 2, 2024

ArthurZucker closed this as completed Oct 2, 2024

ArthurZucker mentioned this issue Oct 5, 2024

How to build a custom tokenizer on top of a exsiting Llama 3.2 tokenizer? huggingface/tokenizers#1644

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper is not learning a new tokenizer, even when i make test and train dataset the same #27583

Whisper is not learning a new tokenizer, even when i make test and train dataset the same #27583

P-Sood commented Nov 19, 2023 •

edited by ArthurZucker

Loading

ArthurZucker commented Nov 20, 2023

P-Sood commented Nov 20, 2023

ArthurZucker commented Nov 21, 2023

sanchit-gandhi commented Dec 6, 2023

github-actions bot commented Jan 1, 2024

ArthurZucker commented Aug 19, 2024

ArthurZucker commented Aug 19, 2024

ArthurZucker commented Aug 27, 2024

ArthurZucker commented Sep 6, 2024

ArthurZucker commented Sep 6, 2024 •

edited

Loading

ArthurZucker commented Oct 1, 2024

ArthurZucker commented Oct 2, 2024

ArthurZucker commented Oct 2, 2024

Whisper is not learning a new tokenizer, even when i make test and train dataset the same #27583

Whisper is not learning a new tokenizer, even when i make test and train dataset the same #27583

Comments

P-Sood commented Nov 19, 2023 • edited by ArthurZucker Loading

System Info

Who can help?

Information

Tasks

Reproduction

Creating the tokenizer

Preprocessing steps

Model Config

Huggingface Trainer

Outputs after second epoch

Expected behavior

ArthurZucker commented Nov 20, 2023

P-Sood commented Nov 20, 2023

ArthurZucker commented Nov 21, 2023

sanchit-gandhi commented Dec 6, 2023

github-actions bot commented Jan 1, 2024

ArthurZucker commented Aug 19, 2024

ArthurZucker commented Aug 19, 2024

ArthurZucker commented Aug 27, 2024

ArthurZucker commented Sep 6, 2024

ArthurZucker commented Sep 6, 2024 • edited Loading

ArthurZucker commented Oct 1, 2024

ArthurZucker commented Oct 2, 2024

ArthurZucker commented Oct 2, 2024

P-Sood commented Nov 19, 2023 •

edited by ArthurZucker

Loading

ArthurZucker commented Sep 6, 2024 •

edited

Loading