Transcribe to IPA (International Phonetic Alphabet) #318

averkij · 2022-10-14T07:23:17Z

averkij
Oct 14, 2022

Hi,

I want to transcribe an audio to the IPA symbols.

• Can I do it with Whisper (assume there is a proper dataset)?
• Is it a good idea to take a Whisper encoder, add a CTC decoder upon it and finetune?

Answered by Arlen22

Oct 15, 2022

From a semi-enthusiast linquist perspective, this is totally possible, and yes Whisper can do it. The problem is coming up with the data set, mainly because phones can vary between accents. That being said, if you just want phonemes, you could use the pronunciation of any standard dictionary as the training model instead of the word itself. But then it wouldn't be language agnostic, which I assume isn't what you're looking for. I don't know if that's different than the current system, although apparently this is word based, and you would want to break it down to at least syllable based.

View full answer

Arlen22 · 2022-10-15T22:08:12Z

Arlen22
Oct 15, 2022

From a semi-enthusiast linquist perspective, this is totally possible, and yes Whisper can do it. The problem is coming up with the data set, mainly because phones can vary between accents. That being said, if you just want phonemes, you could use the pronunciation of any standard dictionary as the training model instead of the word itself. But then it wouldn't be language agnostic, which I assume isn't what you're looking for. I don't know if that's different than the current system, although apparently this is word based, and you would want to break it down to at least syllable based.

2 replies

Arlen22 Oct 16, 2022

Actually, I just realized that this would be extremely useful for adding additional words to the database because it's already trained on the pronunciation grammar and so you just add the new word to the dictionary! I don't know how it detects words though and I don't know if that can translate to syllables. Is it possible for a word to output multiple tokens?

jhdeov Nov 9, 2022

You could also exploit Wiktionary which provides IPA transcriptions (can be quickly gotten from Wikipron). There's also some recent open corpora that are transcribed with IPA like DoReCo.

EtienneAb3d · 2022-11-10T04:09:57Z

EtienneAb3d
Nov 10, 2022

You may use Phonemizer:
https://github.com/bootphon/phonemizer
Or directly eSpeak with --ipa option:
https://github.com/espeak-ng/espeak-ng
;-)

2 replies

641i130 Jun 25, 2023

But the question is, are these AI based?
It'd be nice if they used whisper as the backend, but I can see why they don't...

freemedom Oct 5, 2024

i think these two softwares only support the conversion from the text to audio/ipa, but not the conversion from the audio to ipa.

diyism · 2023-04-02T04:32:15Z

diyism
Apr 2, 2023

This issue should be a Feature Request, an important Feature Request.
"Transcribe to IPA" is very important for realtime interaction application,
I would like my application can get short IPAs from the user's voice and immemidately reponse(without trainning fixed hotwords),
and if we need understand the long voice of the user, we can send the long IPA sequence to ChatGPT v4,
and the LLM certainly can give us the best understanding and translation from IPA sequence.

The whisper-timestamped project(https://github.com/linto-ai/whisper-timestamped) has done some on this,
and a very concise tool: https://github.com/readbeyond/aeneas

for example with this command:
whisper_timestamped --vad True ./my_corpus/nihao_zhongguo.wav --model small --language zh
I can get "你好,人們,中國,漢字", it has 4 words which consist of 8 syllables,
but in chinese, I think the syllables can be recognized very fast and precisely, the syllable only contains "a consonant + a vowel" or "a vowel" without any tail consonants,
I only need get the 8 syllables of "ni3 hao3 ren2 men2 zong1 guo2 han4 zi4" and send it to ChatGPT-4,
the ChatGPT-4 can response the right characters of the best semantic analysis.

I'm dreaming a realtime Syllable Recognition Engine that can achieve the precision of a mechanical keyboard.

ref: #2 (comment)
ref: [Feature Request] a concise model only output IPA/Pinyin syllables k2-fsa/sherpa-ncnn#177

0 replies

nmfisher · 2023-04-03T04:05:41Z

nmfisher
Apr 3, 2023

@diyism if you only need Chinese, you can achieve the same thing with a streaming pretrained Icefall model (the newer version of Kaldi). The latest models trained on Wenetspeech have quite good performance (on AISHELL at least, I can't actually find Icefall CER on FLEURS or Whisper CER on AISHELL, so it's difficult to compare the two, if anyone has capacity it would be great if they could run both those evaluations to get a baseline comparison).

1 reply

diyism Apr 3, 2023

Thanks, I've test it in https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition
it's indeed very fast:
Wave duration : 4.928 s
Processing time: 0.439 s
RTF: 0.439/ 4.928 = 0.089

and the project: https://github.com/k2-fsa/sherpa-ncnn

diyism · 2023-04-20T06:46:55Z

diyism
Apr 20, 2023

Daniel Povey, the developer of Kaldi 2, believes that "Proactive Mandarin Syllable Recognition Engine" can't be achieved,but I still feel it can.

The task of syllable recognition should be left to the speech recognition engine,
while the task of analyzing vocabulary and sentences should be given to large language models.

ref: k2-fsa/sherpa-ncnn#177 (comment)

0 replies

lancejpollard · 2024-04-19T05:51:11Z

lancejpollard
Apr 19, 2024

@averkij @Arlen22 what is the latest state of solving this problem (speech audio -> IPA transcriptions, across languages)? Has any more progress been made? Did you all figure out a solution that worked nicely?

1 reply

vincentwi Jul 2, 2024

looks like @diyism has been working to get k2-fsa/icefall#1662 approved via sherpa. only other model i'm aware of is from friends at espnet, but the quality isnt quite the same. keep me in the loop for future updates, would love to see this feature request implemented & unlock many usecases (sentiment analysis, speech-transfer, speech translation, language learning, etc).

diyism · 2024-07-02T23:24:31Z

diyism
Jul 2, 2024

It's very close for we indie hackers getting a precise and realtime Voice to Syllable/pinyin resolution, but I think currently we lack a syllable-level VAD for this, since the sherpa-onnx-kws already can precisely recognize every single-syllable WAV (manually sliced first):
k2-fsa/sherpa-onnx#920

0 replies

diyism · 2024-09-13T15:35:16Z

diyism
Sep 13, 2024

It seems the allosaurus project (https://github.com/xinjli/allosaurus) can produce IPAs and correct timestamps but it missed a syllable of "bei" (in 5 syllables of "jiang3 you3 bo2 bei4 pai1"):

$ python -m allosaurus.run -e 1.2 --lang=cmn --timestamp=True -i ../sherpa-onnx/sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4_8000hz.wav
0.510 0.045 ɕ
0.540 0.045 i
0.600 0.045 a
0.630 0.045 ŋ

0.750 0.045 j
0.780 0.045 i
0.810 0.045 o
0.870 0.045 ə

1.020 0.045 p
1.080 0.045 o

1.590 0.045 p
1.650 0.045 a
1.710 0.045 i

k2-fsa/sherpa-onnx#920 (comment)

0 replies

diyism · 2024-11-03T23:59:37Z

diyism
Nov 3, 2024

The pyannote-onnx(segmentation-3.0.onnx model) can segment partial mandarin syllables/pinyins:
pengzhendong/pyannote-onnx#9

$ git clone https://github.com/diyism/pyannote_segment_syllables
$ cd pyannote_segment_syllables/
$ python main.py sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01_test_wavs_4.wav
Found 12 syllables:
0.560s - 0.742s
0.742s - 1.066s
1.066s - 1.298s
1.645s - 1.920s
2.035s - 2.203s
2.203s - 2.470s
2.555s - 2.725s
2.725s - 2.960s
3.150s - 3.250s
3.250s - 3.475s
3.550s - 3.760s
3.760s - 3.975s
Saved syllable 001: 0.560s - 0.742s (duration: 0.182s)
Saved syllable 002: 0.742s - 1.066s (duration: 0.324s)
Saved syllable 003: 1.066s - 1.298s (duration: 0.232s)
Saved syllable 004: 1.645s - 1.920s (duration: 0.275s)
Saved syllable 005: 2.035s - 2.203s (duration: 0.167s)
Saved syllable 006: 2.203s - 2.470s (duration: 0.267s)
Saved syllable 007: 2.555s - 2.725s (duration: 0.170s)
Saved syllable 008: 2.725s - 2.960s (duration: 0.235s)
Saved syllable 009: 3.150s - 3.250s (duration: 0.100s)
Saved syllable 010: 3.250s - 3.475s (duration: 0.225s)
Saved syllable 011: 3.550s - 3.760s (duration: 0.210s)
Saved syllable 012: 3.760s - 3.975s (duration: 0.215s)

$ aplay syllables/001.wav
$ aplay syllables/002.wav
$ aplay syllables/003.wav

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transcribe to IPA (International Phonetic Alphabet) #318

{{title}}

Replies: 9 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Transcribe to IPA (International Phonetic Alphabet) #318

Replies: 9 comments · 6 replies

Replies: 9 comments 6 replies