Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Steps to transcribe in French #1

Closed
Ca-ressemble-a-du-fake opened this issue Dec 16, 2022 · 14 comments
Closed

Steps to transcribe in French #1

Ca-ressemble-a-du-fake opened this issue Dec 16, 2022 · 14 comments

Comments

@Ca-ressemble-a-du-fake
Copy link
Contributor

Hi,

Thanks for sharing this work. You wrote that it still needs testing... can I test it in French 😉?
I am not sure what I should change. I saw that the wav2vec2 model could be passed in as parameter (see the readme), but in code there are some harcoded pipelines refering to the the english model. For French there is a wav2vec2 model for French (never tested since I was relying on Whisper only).

Looking forward to testing this in French!

@m-bain
Copy link
Owner

m-bain commented Dec 16, 2022

Hi you can try:
--align_model VOXPOPULI_ASR_BASE_10K_FR
https://pytorch.org/audio/stable/generated/torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_FR.html#torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_FR

This wav2vec2.0 model finetuned on french. Let me know how it goes -- I can start setting default models per language

@Ca-ressemble-a-du-fake
Copy link
Contributor Author

Thanks for you reply. I used whisperx test.wav --model large --output_dir test_whisperx --align_model large --align_extend 2 --language fr because it did not accept the french model :
whisperx: error: argument --align_model: invalid choice: 'VOXPOPULI_ASR_BASE_10K_FR' (choose f rom 'tiny.en', 'tiny', 'base.en', 'base', 'small.en', 'small', 'medium.en', 'medium', 'large-v 1', 'large-v2', 'large')
But then it wanted to download wav2vec2_fairseq_large_lv60k_asr_ls960.pth.

@m-bain
Copy link
Owner

m-bain commented Dec 16, 2022

Ah sorry there was a bug in the argument parser, git pull and try again now :)

@Ca-ressemble-a-du-fake
Copy link
Contributor Author

Ok thanks for the update!
I could test it. If I let the code as is, then there is no improvements over Whisper large model (timestamps are exactly the same with the same inaccuracy). But then if I changed the following line bundle = torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H with bundle = torchaudio.pipelines.VOXPOPULI_ASR_BASE_10K_FR then it could sometimes find intermediate timestamps. For example a long sentence starting originally (whisper large) from 26.000 to 35.000 was split into 26.000 to 29.280 and then 29.280 to 35.000.

Yet I could test further because it then failed during "Alignment" because tokens = [model_dictionary[c] for c in transcription_cleaned].

I think it has something to do with the apostroph '. Moreover accented letters are missing. Here is how transcription_cleaned looks like : C'EST|UNE|NOUVELLE|DITION where DITION should be ÉDITION (note the starting É).

Consequently so far I cannot tell you if it brings improvements in French over stock Whisper:wink:!

@RaulKite
Copy link

Hi,

I can let a hand with Spanish. I started yesterday with normalization and spanish.py.

I will continue during next week.

@m-bain
Copy link
Owner

m-bain commented Dec 17, 2022

@Ca-ressemble-a-du-fake thanks for pointing this out, I see the culprit was this line

t_words_clean = [re.sub(r"[^a-zA-Z' ]", "", x) for x in t_words]

Which hard-coded the cleaning of the whisper transcript to only contain a-z and apostrophes (which is in the english wav2vec2.0 dictionary).
I have now changed this cleaning to depend on the alignment model's dictionary in this commit, so for french this will include accented letters.

@RaulKite keep me updated on spanish!

@Ca-ressemble-a-du-fake
Copy link
Contributor Author

Thanks for the update! Why is in converting the text to upper case ? Because of that it does shows a KeyError with the third character ("E").

If I print the model_dictionary it shows lower case letters. So lower casing the characters from transcription_cleaned made it found all the letters : tokens = [model_dictionary[c.lower()] for c in transcription_cleaned] but then it contradicts upper() a couple of lines earlier!

Then in utils while writing the transcript to disk in ass format, prev can be None (line 194). Consequently I had to add another check :
if prev is not None and word['start'] > prev:.

With these quick and dirty workarounds the output even with large whisper model looks far more accurate!

@m-bain
Copy link
Owner

m-bain commented Dec 18, 2022

Thank for this, I have pushed a big fix and it seems working for french now (with varying degrees of accuracy). Please see examples in the README.
Performance on non-english languages could be improved, and I will be investigating this. But closing this for now :)
Feel free to re-open if you find any further issues on FR

@m-bain m-bain closed this as completed Dec 18, 2022
@Ca-ressemble-a-du-fake
Copy link
Contributor Author

Thanks for this fix! It works good now! Punctuation is missing but I am not sure this is specific to French, is it ?

@m-bain
Copy link
Owner

m-bain commented Dec 19, 2022

@Ca-ressemble-a-du-fake I think this is a whisper issue. Some examples it does output punctuation: https://github.com/m-bain/whisperX#french

From my experience it seems non-english Whisper tends to transcrbine without punctuation sometimes, sorry

@RaulKite
Copy link

Thanks for this fix! It works good now! Punctuation is missing but I am not sure this is specific to French, is it ?

It usually is fixed giving and initial prompt containing punctuation signs. For example: "and now, we continue with next video."

@Ca-ressemble-a-du-fake
Copy link
Contributor Author

@m-bain I only tried with medium and the examples use the large model (which shows much worse timestamps). By the way whisper alone does output punctuation (all models).

Furthermore I noticed that starting timestamps are often in the middle of the starting words. Is there a way that it starts earlier ? Moreover would you mind explaining very briefly what the wav2vec2 model is used for ? I mean what your agorithm does (I can read it but I still don't understand it)?

Thanks @RaulKite I'll try with initial-prompt also!

@m-bain
Copy link
Owner

m-bain commented Dec 19, 2022

@Ca-ressemble-a-du-fake yes I know whisper outputs punctuation, but sometimes the model gets stuck in like a "non-punctuation" mode, but this is a whisper error not an error with the alignment algorithm. You can see this by printing the whisper output before alignment here:

result = transcribe(model, audio_path, temperature=temperature, **args)

One thing that might be causing a discrepancy is this uses whisper with --condition_on_previous_text False otherwise whisper timestamps can be off by 5 seconds or something crazy like that.

Anyway, Raul's solution should help though (you might want to write that prompt in french).

Re: starting timestamps being in the middle of the word:
Thanks for letting me know, I recommend you for just subtract a fixed amount from the starting timestamp as a post-processing step because I am trying to minimise the amount of heuristics added.

Re: the algorithm:
wav2vec2, a model finetuned for phoneme prediction tasking. Meaning, given an input audio, it outputs a probability matrix (N x K) where N is the temporal axis (with resolution 0.02s) and K is the number of tokens (letters) in the dictionary {'a': 0, 'b': 1, ... 'z': 26}. This tells us the chance of a letter being said at any moment in time.
The algorithm I use assumes that the speech words (and therefore letters) are known a-priori, by taking them from the Whisper output. This just leaves the task of aligning whisper output to the wav2vec probability matrix.

E.g.
whisper output: 00:03:00->00:08:00 "Hi, my name is Jane"
wav2vec input: audio between "00:01:00->00:10:00" (timestamps are extended by 2 seconds).
wav2vec output: N x K matrix

Then we just look for the most likely monotonously increasing path of the following letters only
[H,I,|,M,Y,|,N,A,M,E|I,S|J,A,N,E]
over the N x K matrix.

See for more
https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html
and
https://en.wikipedia.org/wiki/Dynamic_time_warping

@Ca-ressemble-a-du-fake
Copy link
Contributor Author

Awesome! Thanks a ton @m-bain I can look at your code again and learn how you did that:smile: !

m-bain pushed a commit that referenced this issue Jan 26, 2023
m-bain pushed a commit that referenced this issue Aug 21, 2023
adds link to whisperX medium on replicate and updates replicate bades…
kousun12 added a commit to substackinc/whisperX that referenced this issue Sep 5, 2023
C0RE1312 pushed a commit to RapidCode-Dev/whisperX that referenced this issue Apr 16, 2024
* Smaller dockerfile

* Small dockerfile no model

* README no need for -- with new docker

* Revert "README no need for -- with new docker"

This reverts commit e9816fb2df019148e148e656381b44f1f044446d.

---------

Co-authored-by: 陳鈞 <jim60105@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants