-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whisper is not learning a new tokenizer, even when i make test and train dataset the same #27583
Comments
Hey 🤗 thanks a lot for opening an issue and using transformers! We try to keep the github issues for bugs/feature requests. Otherwise you should follow the tutorial ressources on how to train a whisper model see:
Thanks! |
Hello @ArthurZucker I shall post it on the huggingface forums as you request. I saw that second post with training on the custom tokenizer. However, the fix they used was to switch it back to the regular pretrained tokenizer and just train for longer. So that doesn't seem like it would have too much effect on me. The other issue I looked at here was on the huggingface bugs page so I decided to post it here as well. They also had a similar issue, but they needed help to get the model to train, and had no information on the results after the code was correct. Maybe I should leave a comment at the author of that issue, seeing if he got it work. Anyways, thanks for the info, ill post it on the forums. |
I am not sure why you need to train a new tokenizer but I don't recommend it. You are completely losing the mapping from input_ids and tokens, thus the preptrained model is rendered useless. You should add tokens to the tokenizers rather than train a new one from scratch if you want to leverage the pretrained checkpoint |
Do you know ahead of time what the kind of jargon is? You could first try Whisper prompting by putting your 'jargon' as the prompt: processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
input_features = processor(input_speech, return_tensors="pt").input_features
# --- Without prompt ---
prompt_ids = processor.get_prompt_ids("Leighton")
output_without_prompt = model.generate(input_features)
print(processor.decode(output_without_prompt[0]))
# "<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|endoftext|>"
# --- With prompt ---
prompt_ids = processor.get_prompt_ids("Leighton")
output_with_prompt = model.generate(input_features, prompt_ids=prompt_ids)
print(processor.decode(output_with_prompt[0]))
# "<|startofprev|> Leighton<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He has grave doubts whether Sir Frederick Leighton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|endoftext|>" Your next best method would be fine-tuning using the original tokenizer on your dataset, using as much data as possible: https://huggingface.co/blog/fine-tune-whisper If you're in a low-data regime, freezing the encoder is recommended. Call this line before you do
After that, see this issue for recommendations for custom vocabulary: https://discuss.huggingface.co/t/adding-custom-vocabularies-on-whisper/29311?u=nbroad. Note that this will require more data than standard fine-tuning, so you should be completely sure standard fine-tuning with the original tokenizer doesn't work before trying this. Also note that as @ArthurZucker mentioned, it is not recommended to completely reset the tokenizer, but rather append the new vocabulary to the tokenizer. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hey! I would recommend you to use |
The issue is that you might have to train the model as well, which is much more compilcated |
Conclusion |
This assumption is not necessarily true. The most important thing is that it stays at the same position if you want to re-use the tokenizer. Now, when training the tokenizer, you don't need the special token. So you should add it afterwards or, give {"":58200}. as the initial vocab to your tokenizer. Another thing is, you should not use the ByteLevel pretokenizer but the normalizer. |
FYI @itazap |
Honestly it's a bit complicated 😅
|
Would you mind sharing what unblocked you?! 🤗 I am super curious |
(sorry my bad) |
System Info
transformers
version: 4.35.2Who can help?
@sanchit-gandhi
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hello, I want to take the audio at my workplace and transform it into a transcription; however, with base whisper, it seems as though it isn't that good. So, I have been wanting to create my own tokenizer that can understand jargon and output that jargon better. Stuff similar to acronyms. Below I have shown my steps in
I run this using huggingface trainer, with the generate option. Is it my data size? i have scoured online to try and find some sort of solution but they all just say it works. I am at my wits end and would appreciate any help on getting this tokenizer to learn my jargon.
Thank you in advance :)
Creating the tokenizer
len(tokenizer) == 193
Preprocessing steps
len(train_dataset) == 4000
len(test_dataset) == 1000
Model Config
Huggingface Trainer
Here I have made the dataset the same 30 examples to see if it would give me complete overprediction, but even with setting train and test to be the same, it is not overfitting at all.
Outputs after second epoch
Expected behavior
More understandable text descriptions
The text was updated successfully, but these errors were encountered: