Steps to transcribe in French #1

Ca-ressemble-a-du-fake · 2022-12-16T10:42:54Z

Hi,

Thanks for sharing this work. You wrote that it still needs testing... can I test it in French 😉?
I am not sure what I should change. I saw that the wav2vec2 model could be passed in as parameter (see the readme), but in code there are some harcoded pipelines refering to the the english model. For French there is a wav2vec2 model for French (never tested since I was relying on Whisper only).

Looking forward to testing this in French!

m-bain · 2022-12-16T11:57:37Z

Hi you can try:
--align_model VOXPOPULI_ASR_BASE_10K_FR
https://pytorch.org/audio/stable/generated/torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_FR.html#torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_FR

This wav2vec2.0 model finetuned on french. Let me know how it goes -- I can start setting default models per language

Ca-ressemble-a-du-fake · 2022-12-16T12:29:52Z

Thanks for you reply. I used whisperx test.wav --model large --output_dir test_whisperx --align_model large --align_extend 2 --language fr because it did not accept the french model :
whisperx: error: argument --align_model: invalid choice: 'VOXPOPULI_ASR_BASE_10K_FR' (choose f rom 'tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large-v 1', 'large-v2', 'large')
But then it wanted to download wav2vec2_fairseq_large_lv60k_asr_ls960.pth.

m-bain · 2022-12-16T12:51:53Z

Ah sorry there was a bug in the argument parser, git pull and try again now :)

Ca-ressemble-a-du-fake · 2022-12-17T04:29:52Z

Ok thanks for the update!
I could test it. If I let the code as is, then there is no improvements over Whisper large model (timestamps are exactly the same with the same inaccuracy). But then if I changed the following line bundle = torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H with bundle = torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_FR then it could sometimes find intermediate timestamps. For example a long sentence starting originally (whisper large) from 26.000 to 35.000 was split into 26.000 to 29.280 and then 29.280 to 35.000.

Yet I could test further because it then failed during "Alignment" because tokens = [model_dictionary[c] for c in transcription_cleaned].

I think it has something to do with the apostroph '. Moreover accented letters are missing. Here is how transcription_cleaned looks like : C'EST|UNE|NOUVELLE|DITION where DITION should be ÉDITION (note the starting É).

Consequently so far I cannot tell you if it brings improvements in French over stock Whisper:wink:!

RaulKite · 2022-12-17T07:19:34Z

Hi,

I can let a hand with Spanish. I started yesterday with normalization and spanish.py.

I will continue during next week.

m-bain · 2022-12-17T15:09:19Z

@Ca-ressemble-a-du-fake thanks for pointing this out, I see the culprit was this line

t_words_clean = [re.sub(r"[^a-zA-Z' ]", "", x) for x in t_words]

Which hard-coded the cleaning of the whisper transcript to only contain a-z and apostrophes (which is in the english wav2vec2.0 dictionary).
I have now changed this cleaning to depend on the alignment model's dictionary in this commit, so for french this will include accented letters.

@RaulKite keep me updated on spanish!

Ca-ressemble-a-du-fake · 2022-12-18T05:12:23Z

Thanks for the update! Why is in converting the text to upper case ? Because of that it does shows a KeyError with the third character ("E").

If I print the model_dictionary it shows lower case letters. So lower casing the characters from transcription_cleaned made it found all the letters : tokens = [model_dictionary[c.lower()] for c in transcription_cleaned] but then it contradicts upper() a couple of lines earlier!

Then in utils while writing the transcript to disk in ass format, prev can be None (line 194). Consequently I had to add another check :
if prev is not None and word['start'] > prev:.

With these quick and dirty workarounds the output even with large whisper model looks far more accurate!

m-bain · 2022-12-18T12:40:24Z

Thank for this, I have pushed a big fix and it seems working for french now (with varying degrees of accuracy). Please see examples in the README.
Performance on non-english languages could be improved, and I will be investigating this. But closing this for now :)
Feel free to re-open if you find any further issues on FR

Ca-ressemble-a-du-fake · 2022-12-19T06:41:49Z

Thanks for this fix! It works good now! Punctuation is missing but I am not sure this is specific to French, is it ?

m-bain · 2022-12-19T16:25:45Z

@Ca-ressemble-a-du-fake I think this is a whisper issue. Some examples it does output punctuation: https://github.com/m-bain/whisperX#french

From my experience it seems non-english Whisper tends to transcrbine without punctuation sometimes, sorry

RaulKite · 2022-12-19T17:01:17Z

Thanks for this fix! It works good now! Punctuation is missing but I am not sure this is specific to French, is it ?

It usually is fixed giving and initial prompt containing punctuation signs. For example: "and now, we continue with next video."

Ca-ressemble-a-du-fake · 2022-12-19T21:26:38Z

@m-bain I only tried with medium and the examples use the large model (which shows much worse timestamps). By the way whisper alone does output punctuation (all models).

Furthermore I noticed that starting timestamps are often in the middle of the starting words. Is there a way that it starts earlier ? Moreover would you mind explaining very briefly what the wav2vec2 model is used for ? I mean what your agorithm does (I can read it but I still don't understand it)?

Thanks @RaulKite I'll try with initial-prompt also!

m-bain · 2022-12-19T23:08:12Z

@Ca-ressemble-a-du-fake yes I know whisper outputs punctuation, but sometimes the model gets stuck in like a "non-punctuation" mode, but this is a whisper error not an error with the alignment algorithm. You can see this by printing the whisper output before alignment here:

whisperX/whisperx/transcribe.py

Line 433 in cbaeb85

result = transcribe(model, audio_path, temperature=temperature, **args)

One thing that might be causing a discrepancy is this uses whisper with --condition_on_previous_text False otherwise whisper timestamps can be off by 5 seconds or something crazy like that.

Anyway, Raul's solution should help though (you might want to write that prompt in french).

Re: starting timestamps being in the middle of the word:
Thanks for letting me know, I recommend you for just subtract a fixed amount from the starting timestamp as a post-processing step because I am trying to minimise the amount of heuristics added.

Re: the algorithm:
wav2vec2, a model finetuned for phoneme prediction tasking. Meaning, given an input audio, it outputs a probability matrix (N x K) where N is the temporal axis (with resolution 0.02s) and K is the number of tokens (letters) in the dictionary {'a': 0, 'b': 1, ... 'z': 26}. This tells us the chance of a letter being said at any moment in time.
The algorithm I use assumes that the speech words (and therefore letters) are known a-priori, by taking them from the Whisper output. This just leaves the task of aligning whisper output to the wav2vec probability matrix.

E.g.
whisper output: 00:03:00->00:08:00 "Hi, my name is Jane"
wav2vec input: audio between "00:01:00->00:10:00" (timestamps are extended by 2 seconds).
wav2vec output: N x K matrix

Then we just look for the most likely monotonously increasing path of the following letters only
[H,I,|,M,Y,|,N,A,M,E|I,S|J,A,N,E]
over the N x K matrix.

See for more
https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html
and
https://en.wikipedia.org/wiki/Dynamic_time_warping

Ca-ressemble-a-du-fake · 2022-12-20T04:39:00Z

Awesome! Thanks a ton @m-bain I can look at your code again and learn how you did that:smile: !

Update README.md

adds link to whisperX medium on replicate and updates replicate bades…

Subs/main

* Smaller dockerfile * Small dockerfile no model * README no need for -- with new docker * Revert "README no need for -- with new docker" This reverts commit e9816fb2df019148e148e656381b44f1f044446d. --------- Co-authored-by: 陳鈞 <jim60105@gmail.com>

m-bain closed this as completed Dec 18, 2022

m-bain pushed a commit that referenced this issue Jan 26, 2023

Merge pull request #1 from MahmoudAshraf97/patch-1

6b2aa4f

Update README.md

m-bain pushed a commit that referenced this issue Aug 21, 2023

Merge pull request #1 from CaRniFeXeR/CaRniFeXeR-replicate-models

6f2ff16

adds link to whisperX medium on replicate and updates replicate bades…

kousun12 added a commit to substackinc/whisperX that referenced this issue Sep 5, 2023

Merge pull request m-bain#1 from SubstrateLabs/subs/main

76f7c62

Subs/main

petiatil mentioned this issue Oct 3, 2023

After upgrading from v2 to v3, an OpenMP error prevents execution #503

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Steps to transcribe in French #1

Steps to transcribe in French #1

Ca-ressemble-a-du-fake commented Dec 16, 2022

m-bain commented Dec 16, 2022 •

edited

Loading

Ca-ressemble-a-du-fake commented Dec 16, 2022

m-bain commented Dec 16, 2022

Ca-ressemble-a-du-fake commented Dec 17, 2022

RaulKite commented Dec 17, 2022

m-bain commented Dec 17, 2022 •

edited

Loading

Ca-ressemble-a-du-fake commented Dec 18, 2022

m-bain commented Dec 18, 2022

Ca-ressemble-a-du-fake commented Dec 19, 2022

m-bain commented Dec 19, 2022 •

edited

Loading

RaulKite commented Dec 19, 2022

Ca-ressemble-a-du-fake commented Dec 19, 2022

m-bain commented Dec 19, 2022 •

edited

Loading

Ca-ressemble-a-du-fake commented Dec 20, 2022

Steps to transcribe in French #1

Steps to transcribe in French #1

Comments

Ca-ressemble-a-du-fake commented Dec 16, 2022

m-bain commented Dec 16, 2022 • edited Loading

Ca-ressemble-a-du-fake commented Dec 16, 2022

m-bain commented Dec 16, 2022

Ca-ressemble-a-du-fake commented Dec 17, 2022

RaulKite commented Dec 17, 2022

m-bain commented Dec 17, 2022 • edited Loading

Ca-ressemble-a-du-fake commented Dec 18, 2022

m-bain commented Dec 18, 2022

Ca-ressemble-a-du-fake commented Dec 19, 2022

m-bain commented Dec 19, 2022 • edited Loading

RaulKite commented Dec 19, 2022

Ca-ressemble-a-du-fake commented Dec 19, 2022

m-bain commented Dec 19, 2022 • edited Loading

Ca-ressemble-a-du-fake commented Dec 20, 2022

m-bain commented Dec 16, 2022 •

edited

Loading

m-bain commented Dec 17, 2022 •

edited

Loading

m-bain commented Dec 19, 2022 •

edited

Loading

m-bain commented Dec 19, 2022 •

edited

Loading