Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper is not learning a new tokenizer, even when i make test and train dataset the same #27583

Closed
2 of 4 tasks
P-Sood opened this issue Nov 19, 2023 · 13 comments
Closed
2 of 4 tasks

Comments

@P-Sood
Copy link

P-Sood commented Nov 19, 2023

System Info

  • transformers version: 4.35.2
  • Platform: Linux-5.15.120+-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.19.3
  • Safetensors version: 0.4.0
  • Accelerate version: 0.24.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.0+cu118 (False)
  • Tensorflow version (GPU?): 2.14.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.7.5 (cpu)
  • Jax version: 0.4.20
  • JaxLib version: 0.4.20
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

Who can help?

@sanchit-gandhi

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Hello, I want to take the audio at my workplace and transform it into a transcription; however, with base whisper, it seems as though it isn't that good. So, I have been wanting to create my own tokenizer that can understand jargon and output that jargon better. Stuff similar to acronyms. Below I have shown my steps in

  1. Creating Tokenizer
  2. Preprocessing data pipeline
  3. Model init, and configuration
  4. Model outputs

I run this using huggingface trainer, with the generate option. Is it my data size? i have scoured online to try and find some sort of solution but they all just say it works. I am at my wits end and would appreciate any help on getting this tokenizer to learn my jargon.

Thank you in advance :)

Creating the tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
from transformers import WhisperTokenizer

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Pre-tokenizer responsible for converting the text to a stream of characters
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()#ByteLevel(add_prefix_space=False)

# Decoder responsible for converting the tokens back to a string
tokenizer.decoder = decoders.ByteLevel()

# Trainer responsible for training the BPE model
tokenizer.trainers = trainers.BpeTrainer(vocab_size=1000, min_frequency=2 , special_tokens=spec_tok)

# Training the tokenizer
tokenizer.train(["file.txt"])

# Save the tokenizer
tokenizer.save("NewWhisperTokenizer.json")

f = open('NewWhisperTokenizer.json')

# returns JSON object as
# a dictionary
data = json.load(f)
with open("vocab.json", "w") as outfile:
    json.dump(data['model']['vocab'], outfile)
with open("merges.txt", "w") as outfile:
    json.dump(data['model']['merges'], outfile)


tokenizer = WhisperTokenizer("vocab.json", "merges.txt" , errors = "replace", unk_token = "<|endoftext|>", bos_token = "<|endoftext|>", eos_token = "<|endoftext|>", pad_token = "<|endoftext|>")
tokenizer.add_special_tokens(WhisperTokenizer.from_pretrained("openai/whisper-tiny").special_tokens_map_extended)
tokenizer.save_pretrained("new_tok")

len(tokenizer) == 193

Preprocessing steps

def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    temp_labels = tokenizer(batch["phonetic_detail"]["utterance"]).input_ids
    batch["label"] = [label for sentence_labels in temp_labels for label in sentence_labels]
    return batch

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    tokenizer: Any
    feature_extractor: Any
    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.feature_extractor.pad(input_features, return_tensors="pt")
        label_features = [{"input_ids": feature["label"]} for feature in features]
        labels_batch = self.tokenizer.pad(label_features, return_tensors="pt")


        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
        if (labels[:, 0] == self.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]
        batch["labels"] = labels
        return batch

data_collator = DataCollatorSpeechSeq2SeqWithPadding(tokenizer , feature_extractor)

len(train_dataset) == 4000
len(test_dataset) == 1000

Model Config

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

voc = tokenizer.get_vocab()

model_Gen = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
model_Gen = model_Gen.to(device)

model_Gen.resize_token_embeddings(len(tokenizer))

model_Gen.config.pad_token_id = tokenizer.pad_token_id
model_Gen.config.decoder_start_token_id = voc['<|startoftranscript|>']
model_Gen.config.eos_token_id = tokenizer.eos_token_id
model_Gen.config.bos_token_id = tokenizer.bos_token_id
model_Gen.config.suppress_tokens = []
model_Gen.config.forced_decoder_ids = None
model_Gen.config.begin_suppress_tokens = [
    tokenizer.pad_token_id
  ]

model_Gen.generation_config.pad_token_id = tokenizer.pad_token_id
model_Gen.generation_config.decoder_start_token_id = voc['<|startoftranscript|>']
model_Gen.generation_config.eos_token_id = tokenizer.eos_token_id
model_Gen.generation_config.bos_token_id = tokenizer.bos_token_id
model_Gen.generation_config.suppress_tokens = []
model_Gen.generation_config.forced_decoder_ids = None
model_Gen.generation_config.begin_suppress_tokens = [
    tokenizer.pad_token_id
  ]

model_Gen.generation_config.no_timestamps_token_id = voc['<|notimestamps|>']

Huggingface Trainer

Here I have made the dataset the same 30 examples to see if it would give me complete overprediction, but even with setting train and test to be the same, it is not overfitting at all.

training_args = Seq2SeqTrainingArguments(
  output_dir='training_output',
  logging_dir='./logs',
  group_by_length=True,
  per_device_train_batch_size=1,
  gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
  per_device_eval_batch_size=1,
  num_train_epochs=8,
  gradient_checkpointing=True,
  lr_scheduler_type = "cosine_with_restarts",
  save_strategy='epoch',
  evaluation_strategy='epoch',
  logging_strategy='epoch',
  learning_rate=1e-2,
  weight_decay=0.005,
  # warmup_steps=36,
  save_total_limit=4,
  push_to_hub=False,
  predict_with_generate=True,
  generation_max_length=225,
  load_best_model_at_end=True,
  greater_is_better=False,
  generation_num_beams = 4,
  # fp16 = True,

  report_to="wandb", # Turn this off for pdb debug

)

trainer = CustomTrainer(
    compute_metrics=compute_metrics,
    args=training_args,
    model=model_Gen,
    data_collator=data_collator,
    tokenizer=processor.feature_extractor,
    train_dataset=new_test['test'] ,
    eval_dataset=new_test['test'],
)

trainer.evaluate()

Outputs after second epoch

tokenizer.batch_decode(pred.predictions , skip_special_tokens = True)
['', '', 'uwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuw', 'k', '', 'k', 'kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk', 
'awawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawaw', 'awawawaw', '', '', '', 'jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj', '', 'jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj', 'uweuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuw', '', 
'axaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxax', '', 
'kuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhk', 
'eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee', 
'eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee',
 'awawawaw', 
'eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee', 
'awawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawaw',
 '', 
'jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj']

Expected behavior

More understandable text descriptions

@ArthurZucker
Copy link
Collaborator

Hey 🤗 thanks a lot for opening an issue and using transformers!

We try to keep the github issues for bugs/feature requests.
Could you ask your question on the forum instead? I'm sure the community will be of help!

Otherwise you should follow the tutorial ressources on how to train a whisper model see:

Thanks!

@P-Sood
Copy link
Author

P-Sood commented Nov 20, 2023

Hello @ArthurZucker I shall post it on the huggingface forums as you request.

I saw that second post with training on the custom tokenizer. However, the fix they used was to switch it back to the regular pretrained tokenizer and just train for longer. So that doesn't seem like it would have too much effect on me.

The other issue I looked at here was on the huggingface bugs page so I decided to post it here as well.

They also had a similar issue, but they needed help to get the model to train, and had no information on the results after the code was correct. Maybe I should leave a comment at the author of that issue, seeing if he got it work.

Anyways, thanks for the info, ill post it on the forums.

@ArthurZucker
Copy link
Collaborator

I am not sure why you need to train a new tokenizer but I don't recommend it. You are completely losing the mapping from input_ids and tokens, thus the preptrained model is rendered useless. You should add tokens to the tokenizers rather than train a new one from scratch if you want to leverage the pretrained checkpoint

@sanchit-gandhi
Copy link
Contributor

Do you know ahead of time what the kind of jargon is? You could first try Whisper prompting by putting your 'jargon' as the prompt:

processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
input_features = processor(input_speech, return_tensors="pt").input_features

# --- Without prompt ---
prompt_ids = processor.get_prompt_ids("Leighton")
output_without_prompt = model.generate(input_features)
print(processor.decode(output_without_prompt[0]))
# "<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|endoftext|>"

# --- With prompt ---
prompt_ids = processor.get_prompt_ids("Leighton")
output_with_prompt = model.generate(input_features, prompt_ids=prompt_ids)
print(processor.decode(output_with_prompt[0]))
# "<|startofprev|> Leighton<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He has grave doubts whether Sir Frederick Leighton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|endoftext|>"

Your next best method would be fine-tuning using the original tokenizer on your dataset, using as much data as possible: https://huggingface.co/blog/fine-tune-whisper

If you're in a low-data regime, freezing the encoder is recommended. Call this line before you do trainer.train():

model.freeze_encoder()

After that, see this issue for recommendations for custom vocabulary: https://discuss.huggingface.co/t/adding-custom-vocabularies-on-whisper/29311?u=nbroad. Note that this will require more data than standard fine-tuning, so you should be completely sure standard fine-tuning with the original tokenizer doesn't work before trying this. Also note that as @ArthurZucker mentioned, it is not recommended to completely reset the tokenizer, but rather append the new vocabulary to the tokenizer.

Copy link

github-actions bot commented Jan 1, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed Jan 9, 2024
@ArthurZucker
Copy link
Collaborator

Hey! I would recommend you to use tokenizer.train_new_from_iterator for example! https://huggingface.co/learn/nlp-course/en/chapter6/2 for more details

@ArthurZucker
Copy link
Collaborator

The issue is that you might have to train the model as well, which is much more compilcated

@ArthurZucker
Copy link
Collaborator

  1. Is it possible to train a tokenizer using tokenizer.train_new_from_iterator() and avoid model training?
    Technically, yes, you can train a tokenizer using train_new_from_iterator() without re-training the model, but this usually isn't advisable. The reason is that the tokenizer and the model are tightly coupled. The model is trained on a specific vocabulary, which corresponds to the token IDs generated by the tokenizer. When you train a new tokenizer, the vocabulary and tokenization strategy change, which means that the tokens the model expects and those generated by the new tokenizer might not align. This misalignment leads to incorrect inputs to the model, which in turn can result in poor or nonsensical predictions.

  2. Why did I get an empty prediction with the new tokenizer?
    You got an empty prediction because the new tokenizer's output doesn't match what the model was trained to process. When you changed the tokenizer, the token IDs and the sequence of tokens fed into the model were different from what the model expects. The model likely received token IDs or sequences it was never trained on, causing it to fail in generating any meaningful output, hence the empty prediction. Additionally, if the special tokens used by the model (like <|endoftext|>, <|startoftranscript|>, etc.) have different IDs in the new tokenizer, the model might misinterpret these tokens, leading to the generation of no output.

  3. Is it okay that <|endoftext|> has different IDs in the old and new tokenizers?
    No, it's not okay if the model was trained with the assumption that <|endoftext|> has a specific token ID (like 50257) and now it has a different ID (like 0) in the new tokenizer. The model relies on specific token IDs to understand the input correctly. If the IDs are changed, the model's internal mechanisms (which depend on these IDs) will no longer function as intended. This misalignment can cause the model to either generate incorrect predictions or fail entirely, as seen in your case.

  4. Is it okay to have extra special tokens in the new tokenizer's vocabulary?
    Having extra special tokens in the new tokenizer's vocabulary is fine if the model is designed to recognize and utilize these tokens. However, if the model wasn't trained with these special tokens, they will likely be ignored or cause issues. For instance, if the model encounters these tokens but doesn't know how to interpret them, it may fail to generate appropriate predictions. On the other hand, if these special tokens are necessary for the model's functionality (e.g., indicating language or specific tasks), then having them is crucial. The problem arises when there's a mismatch between the special tokens the model expects and those provided by the tokenizer.

Conclusion
In summary, the root of the issues you're encountering is the misalignment between the tokenizer and the model. When you train a new tokenizer, the token IDs and tokenization strategies change, which can cause the model to malfunction if it was not retrained with this new tokenizer. For the best results, you should either retrain the model with the new tokenizer or, if retraining isn't feasible, stick to using the tokenizer that the model was originally trained with.

@ArthurZucker
Copy link
Collaborator

As I understand the "<|endoftext|>" special toke id must be the last one (or one of the last ones if other special tokens are used as well) in vocab.

This assumption is not necessarily true. The most important thing is that it stays at the same position if you want to re-use the tokenizer.

Now, when training the tokenizer, you don't need the special token. So you should add it afterwards or, give {"":58200}. as the initial vocab to your tokenizer.

Another thing is, you should not use the ByteLevel pretokenizer but the normalizer.
If you try to decoder 0, you will see that it will not be "<|endoftext|>" 😉

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Sep 6, 2024

FYI @itazap

@ArthurZucker
Copy link
Collaborator

Honestly it's a bit complicated 😅
TLDR:

  • if you train a new tokenizer entirely, you are doomed
  • if you train from an olde one, then okay some tokens will match, but some will not -> some words will be completely unsee by the model
  • if you resize the embedding without the proper function, you are doomed as well

@ArthurZucker
Copy link
Collaborator

Would you mind sharing what unblocked you?! 🤗 I am super curious

@ArthurZucker
Copy link
Collaborator

(sorry my bad)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants