-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error finetuning Whisper using new tokenizer #25503
Comments
This error is most probably indicating that the |
Thank you so much for the quick reply. This is a show-stopper for me. When training with no_cuda=True I get the following error: You're using a WhisperTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[13], line 2
1 ### print("Start training")
----> 2 trainer.train()
3 #trainer.evaluate()
4 print("Done training")
File ~/anaconda3/lib/python3.10/site-packages/transformers/trainer.py:1662, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1657 self.model_wrapped = self.model
1659 inner_training_loop = find_executable_batch_size(
1660 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
1661 )
-> 1662 return inner_training_loop(
1663 args=args,
1664 resume_from_checkpoint=resume_from_checkpoint,
1665 trial=trial,
1666 ignore_keys_for_eval=ignore_keys_for_eval,
1667 )
File ~/anaconda3/lib/python3.10/site-packages/transformers/trainer.py:1929, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1927 tr_loss_step = self.training_step(model, inputs)
1928 else:
-> 1929 tr_loss_step = self.training_step(model, inputs)
1931 if (
1932 args.logging_nan_inf_filter
1933 and not is_torch_tpu_available()
1934 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
1935 ):
1936 # if loss is nan or inf simply add the average of previous logged losses
1937 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File ~/anaconda3/lib/python3.10/site-packages/transformers/trainer.py:2699, in Trainer.training_step(self, model, inputs)
2696 return loss_mb.reduce_mean().detach().to(self.args.device)
2698 with self.compute_loss_context_manager():
-> 2699 loss = self.compute_loss(model, inputs)
2701 if self.args.n_gpu > 1:
2702 loss = loss.mean() # mean() to average on multi-gpu parallel training
File ~/anaconda3/lib/python3.10/site-packages/transformers/trainer.py:2731, in Trainer.compute_loss(self, model, inputs, return_outputs)
2729 else:
2730 labels = None
-> 2731 outputs = model(**inputs)
2732 # Save past state if it exists
2733 # TODO: this needs to be fixed and made cleaner later.
2734 if self.args.past_index >= 0:
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/anaconda3/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:1414, in WhisperForConditionalGeneration.forward(self, input_features, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
1409 if decoder_input_ids is None and decoder_inputs_embeds is None:
1410 decoder_input_ids = shift_tokens_right(
1411 labels, self.config.pad_token_id, self.config.decoder_start_token_id
1412 )
-> 1414 outputs = self.model(
1415 input_features,
1416 attention_mask=attention_mask,
1417 decoder_input_ids=decoder_input_ids,
1418 encoder_outputs=encoder_outputs,
1419 decoder_attention_mask=decoder_attention_mask,
1420 head_mask=head_mask,
1421 decoder_head_mask=decoder_head_mask,
1422 cross_attn_head_mask=cross_attn_head_mask,
1423 past_key_values=past_key_values,
1424 decoder_inputs_embeds=decoder_inputs_embeds,
1425 use_cache=use_cache,
1426 output_attentions=output_attentions,
1427 output_hidden_states=output_hidden_states,
1428 return_dict=return_dict,
1429 )
1430 lm_logits = self.proj_out(outputs[0])
1432 loss = None
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/anaconda3/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:1279, in WhisperModel.forward(self, input_features, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, decoder_inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
1272 encoder_outputs = BaseModelOutput(
1273 last_hidden_state=encoder_outputs[0],
1274 hidden_states=encoder_outputs[1] if len(encoder_outputs) > 1 else None,
1275 attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,
1276 )
1278 # decoder outputs consists of (dec_features, past_key_value, dec_hidden, dec_attn)
-> 1279 decoder_outputs = self.decoder(
1280 input_ids=decoder_input_ids,
1281 attention_mask=decoder_attention_mask,
1282 encoder_hidden_states=encoder_outputs[0],
1283 head_mask=decoder_head_mask,
1284 cross_attn_head_mask=cross_attn_head_mask,
1285 past_key_values=past_key_values,
1286 inputs_embeds=decoder_inputs_embeds,
1287 use_cache=use_cache,
1288 output_attentions=output_attentions,
1289 output_hidden_states=output_hidden_states,
1290 return_dict=return_dict,
1291 )
1293 if not return_dict:
1294 return decoder_outputs + encoder_outputs
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/anaconda3/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:1030, in WhisperDecoder.forward(self, input_ids, attention_mask, encoder_hidden_states, head_mask, cross_attn_head_mask, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
1027 past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
1029 if inputs_embeds is None:
-> 1030 inputs_embeds = self.embed_tokens(input_ids)
1032 attention_mask = self._prepare_decoder_attention_mask(
1033 attention_mask, input_shape, inputs_embeds, past_key_values_length
1034 )
1036 # embed positions
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.local/lib/python3.10/site-packages/torch/nn/modules/sparse.py:162, in Embedding.forward(self, input)
161 def forward(self, input: Tensor) -> Tensor:
--> 162 return F.embedding(
163 input, self.weight, self.padding_idx, self.max_norm,
164 self.norm_type, self.scale_grad_by_freq, self.sparse)
File ~/.local/lib/python3.10/site-packages/torch/nn/functional.py:2210, in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
2204 # Note [embedding_renorm set_grad_enabled]
2205 # XXX: equivalent to
2206 # with torch.no_grad():
2207 # torch.embedding_renorm_
2208 # remove once script supports set_grad_enabled
2209 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2210 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self This confuses me because I'm training the new tokenizer like this: new_tokenizer = old_tokenizer.train_new_from_iterator(
get_training_corpus(),
old_tokenizer.vocab_size,
special_tokens_map=old_tokenizer.special_tokens_map,
new_special_tokens=old_tokenizer.all_special_tokens) saying that its vocab_size should be the same as the old one. the commands print(old_tokenizer.vocab_size) # 50257
print(len(old_tokenizer.vocab)) # 50364 tell me that the vocab of the old tokenizer has appended the 107 special tokens at the end of the vocab, whereas the commands print(new_tokenizer.vocab_size) # 50257
print(len(new_tokenizer.vocab)) # 50257 tells me that the new tokenizer has prepended(?) them. |
Okay, you might find help in huggingface/tokenizers#1277. |
I've been trying to understand how issue 1277 can help, but unsuccessfully. The problem seems to be too different to what I'm trying to achieve. def test_tokenizer(tokenizer):
idxs = [tokenizer.vocab[special_token] for special_token in tokenizer.all_special_tokens]
is_wrong = all([idx < tokenizer.vocab_size for idx in idxs])
print(f"Are special tokens after normal tokens? {not is_wrong}")
print(f"bos_token: {tokenizer.vocab['<|startoftranscript|>']} eos_token: {tokenizer.vocab['<|endoftext|>']}")
print("Special token ids: " + ", ".join([str(idx) for idx in idxs]))
def max_key_val(tokenizer):
d = tokenizer.vocab
key = max(d, key=d.get)
return key, d[key]
def min_key_val(tokenizer):
d = tokenizer.vocab
key = min(d, key=d.get)
return key, d[key]
print(f"Old tokenizer: \n{len(old_tokenizer)=} | {old_tokenizer.vocab_size=} | {min_key_val(old_tokenizer)=} | {max_key_val(old_tokenizer)=}")
test_tokenizer(old_tokenizer)
print(f"\nNew tokenizer: \n{len(new_tokenizer)=} | {new_tokenizer.vocab_size=} | {min_key_val(new_tokenizer)=} | {max_key_val(new_tokenizer)=}")
test_tokenizer(new_tokenizer)
The model expects the bos and eos at indices 50258 and 50257, but after using train_new_from_iterator these indices are wrong. model.config
I can make the error go away by making the vocab_size = len(old_tokenizer), but the ids will still not line up. Maybe I should use a SentencePiece tokenizer to create a vocab file, but there are some problems with this too. |
I'll try to have a look 😉 |
Okay, let's just take this step by step as the reproducer is huge and involved.
new_tokenizer = old_tokenizer.train_new_from_iterator(
get_training_corpus(),
old_tokenizer.vocab_size,
special_tokens_map=old_tokenizer.special_tokens_map,
new_special_tokens=old_tokenizer.all_special_tokens) for me, this is problematic, because the content of Also this was not in the training example provided so not really sure why you are adding it? |
Could you share a pushed v ersion of the tokenizers? |
new_tokenizer = old_tokenizer.train_new_from_iterator(
get_training_corpus(),
old_tokenizer.vocab_size,
special_tokens_map=old_tokenizer.special_tokens_map,
new_special_tokens=old_tokenizer.all_special_tokens) and new_tokenizer = old_tokenizer.train_new_from_iterator(
get_training_corpus(),
old_tokenizer.vocab_size,
special_tokens_map=old_tokenizer.special_tokens_map) and new_tokenizer = old_tokenizer.train_new_from_iterator(
get_training_corpus(),
old_tokenizer.vocab_size) I get the same error. In all cases, the special tokens will be placed in the beginning of new_tokenizer.vocab and not the end like in old_tokenizer.vocab.
Do you need me to share the folder containing vocab.json, tokenizer.json, merges.txt etc? |
Yes, push the tokenizer to the hub and I'll be able to have a look at the internal state 😉 |
This is my first time using this feature. It should be available at peterBagnegaard/new_tokenizer. I made it using the following lines whisper = WhisperTokenizerFast.from_pretrained("openai/whisper-medium", language="danish")
whisper_new = whisper.train_new_from_iterator(
get_training_corpus(),
whisper.vocab_size)
whisper_new.push_to_hub("new_tokenizer") |
Thanks! We actually have a few tests on our CI that should ensure that we can train a tokenizer from an old tokenizers, so if this is indeed a bug we'll have to fix it! |
This might confuse more than it helps, but I've tried training my own tokenizer using the BpeTrainer, inspired by huggingface/tokenizers#1277. # Based either on jstoone or openai
old_tokenizer = WhisperTokenizerFast.from_pretrained("jstoone/whisper-medium-da", language="danish")
# old_tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-medium", language="danish")
tokenizer = old_tokenizer.backend_tokenizer
# Either adding special tokens to trainer or not
trainer = trainers.BpeTrainer(vocab_size=old_tokenizer.vocab_size)#, special_tokens=old_tokenizer.all_special_tokens)
tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)
tokenizer.save("tokenizer.json")
fast_tokenizer = WhisperTokenizerFast(
tokenizer_file="tokenizer.json",
model_max_length=old_tokenizer.model_max_length,
language='danish',
task='transcribe',
predict_timestamps=True)
special_tokens = {"bos_token" : AddedToken(old_tokenizer.bos_token or "", normalized=True),
"eos_token" : AddedToken(old_tokenizer.eos_token or "", normalized=True),
"unk_token" : AddedToken(old_tokenizer.unk_token or "[UNK]", normalized=True),
"sep_token" : old_tokenizer.sep_token or "",
"pad_token" : old_tokenizer.pad_token or "",
"cls_token" : old_tokenizer.cls_token or "",
"mask_token" : old_tokenizer.mask_token or "",
"additional_special_tokens" : old_tokenizer.additional_special_tokens}
fast_tokenizer.add_special_tokens(special_tokens)
fast_tokenizer.set_prefix_tokens(task='transcribe', language='danish') I've been experimenting with using both openAis tokenizer, as well as a tokenizer made by Jstoone (the one I'm fine-tuning further).
So while I can technically continue, this seems like a problem (I am so confused!) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Glad to know that this worked. A few major changes were recently pushed to the |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@PeterBagnegaard Did you ever get this to work? I am doing the same thing as you, but my model is predicting gibberish at the end. Were you able to get Whisper to correctly learn a new tokenizer, and if you could, how did you? |
If you train a new tokenizer, the model will have to be trained from scratch as you are learning a new mapping from token to ids which is literally miles away from the one it was trained on |
System Info
transformers
version: 4.28.0.dev0who can help
@ArthurZucker
Information
I am using whisper-medium-da
and I've based my code on the tutorials
Training a new tokenizer from an old one
https://huggingface.co/learn/nlp-course/chapter6/2
and
Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers
https://huggingface.co/blog/fine-tune-whisper
I'm trying to finetine Whisper using a tokenizer other than the one provided by whisper (but based on it)
This gives the following error
The tokenizer from whisper-medium-da have special tokens added in the very end of the vocab dict (with indices around 50000) whereas new_tokenizer has special tokens in the very beginning (with indices around 0).
I'm expecting that the error arises because tokens like <|endoftext|> and <|startoftranscript|> don't have the same index.
It seems that whenever I try to train my own tokenizer, even when using train_new_from_iterator from, the special tokens move to the beginning of the vocabulary dict.
I'm under the impression that I don't have to retrain Whisper from scratch when retraining the tokenizer, and that I can simply set the new_tokenizer as explained above and finetune whisper-medium-da on my own data.
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
the trainer.train() would run smoothly without errors, just like it does when using the tokenizer provided by whisper.
The text was updated successfully, but these errors were encountered: