-
Hi, I want to transcribe an audio to the IPA symbols. • Can I do it with Whisper (assume there is a proper dataset)? |
Beta Was this translation helpful? Give feedback.
Replies: 9 comments 6 replies
-
From a semi-enthusiast linquist perspective, this is totally possible, and yes Whisper can do it. The problem is coming up with the data set, mainly because phones can vary between accents. That being said, if you just want phonemes, you could use the pronunciation of any standard dictionary as the training model instead of the word itself. But then it wouldn't be language agnostic, which I assume isn't what you're looking for. I don't know if that's different than the current system, although apparently this is word based, and you would want to break it down to at least syllable based. |
Beta Was this translation helpful? Give feedback.
-
You may use Phonemizer: |
Beta Was this translation helpful? Give feedback.
-
This issue should be a Feature Request, an important Feature Request. The whisper-timestamped project(https://github.com/linto-ai/whisper-timestamped) has done some on this, for example with this command: I'm dreaming a realtime Syllable Recognition Engine that can achieve the precision of a mechanical keyboard. ref: #2 (comment) |
Beta Was this translation helpful? Give feedback.
-
@diyism if you only need Chinese, you can achieve the same thing with a streaming pretrained Icefall model (the newer version of Kaldi). The latest models trained on Wenetspeech have quite good performance (on AISHELL at least, I can't actually find Icefall CER on FLEURS or Whisper CER on AISHELL, so it's difficult to compare the two, if anyone has capacity it would be great if they could run both those evaluations to get a baseline comparison). |
Beta Was this translation helpful? Give feedback.
-
Daniel Povey, the developer of Kaldi 2, believes that "Proactive Mandarin Syllable Recognition Engine" can't be achieved,but I still feel it can. The task of syllable recognition should be left to the speech recognition engine, |
Beta Was this translation helpful? Give feedback.
-
@averkij @Arlen22 what is the latest state of solving this problem (speech audio -> IPA transcriptions, across languages)? Has any more progress been made? Did you all figure out a solution that worked nicely? |
Beta Was this translation helpful? Give feedback.
-
It's very close for we indie hackers getting a precise and realtime Voice to Syllable/pinyin resolution, but I think currently we lack a syllable-level VAD for this, since the sherpa-onnx-kws already can precisely recognize every single-syllable WAV (manually sliced first): |
Beta Was this translation helpful? Give feedback.
-
It seems the allosaurus project (https://github.com/xinjli/allosaurus) can produce IPAs and correct timestamps but it missed a syllable of "bei" (in 5 syllables of "jiang3 you3 bo2 bei4 pai1"):
|
Beta Was this translation helpful? Give feedback.
-
The pyannote-onnx(segmentation-3.0.onnx model) can segment partial mandarin syllables/pinyins:
|
Beta Was this translation helpful? Give feedback.
From a semi-enthusiast linquist perspective, this is totally possible, and yes Whisper can do it. The problem is coming up with the data set, mainly because phones can vary between accents. That being said, if you just want phonemes, you could use the pronunciation of any standard dictionary as the training model instead of the word itself. But then it wouldn't be language agnostic, which I assume isn't what you're looking for. I don't know if that's different than the current system, although apparently this is word based, and you would want to break it down to at least syllable based.