You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @wookladin ,
While creating the training data, G2P gives phonemes based on how a particular word is supposed to be pronounced, but the audio might have a slightly different pronunciation due to various accents. I understand that you've used proprietary G2P for better results. But g2p models only utilize transcript information.
A speech+transcript conditioned phoneme recognizer would give better results wouldn't it?
Phoneme error rates are still high in the latest ASR acoustic models. Usually, ASR acoustic models predict not-so-accurate phonemes and ASR language models predict the transcript from the phonemes. But here we want to improve the accuracy of the phonemes given audio and transcript. I couldn't find any literature around that. Any leads/ideas?
The text was updated successfully, but these errors were encountered:
Yes. It would be better if there is a phoneme recognizer that gets speech and transcript as an input. But I've never seen such an approach.
I think it could be better if the language model of ASR and the other model using the ASR model's output to reconstruct original speech are learned jointly. However, this training process will be unstable. I think using RL might help in this case.
Hi @wookladin ,
While creating the training data, G2P gives phonemes based on how a particular word is supposed to be pronounced, but the audio might have a slightly different pronunciation due to various accents. I understand that you've used proprietary G2P for better results. But g2p models only utilize transcript information.
The text was updated successfully, but these errors were encountered: