Replies: 2 comments
-
I'm looking to produce a verbatim transcription, word for word including filler words. Any pointers on the recipe for either training or fine-tuning any conformer model would be appreciated. Is this not currently supported in conformer models and better to try other models such as whisper? |
Beta Was this translation helpful? Give feedback.
-
Following my post, I had some success adapting the If you want to try fine-tuning the model, there are also multiple resources available in the repo. This would allow you to change the tokenizer, enabling the model to learn additional filler words. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I'm very new to ASR, so please bear with me :)
For a project, we are using the
stt_en_conformer_transducer_xlarge
model. In general, we are quite happy with the model performance, but we have noticed that it does not seem to output speech disfluencies, such as filler words and stutters [at least most of the time].I would like to adapt / fine-tune the model to make it better at transcribing disfluencies. I can make use of around 50 hours of audio, paired with transcriptions of the
stt_en_conformer_transducer_xlarge
model, post-edited by humans to include speech disfluencies in the cases where they were present in the audio. Not all audio files actually contain disfluencies, but it's important for us to transcribe them when present.I'm looking for general pointers / advice you might have for this specific use-case. I'm planning to start my experiments using the same model and adapters, but I'm not sure if this is the best approach, or if another model [size] might make more sense in this context.
Beta Was this translation helpful? Give feedback.
All reactions