-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some problems with large-v3 #100
Comments
I ran some more tests, this time focusing on adjusting some parameters to see if I could improve the problem I had above with the large-v3 model. The results are still not good, the short audio results are acceptable, but the long audio results are not as good as the large-v2 model. |
Can you run reference Whisper large-v3? How results there? |
Whisper's large-v3 will run even worse. I tried large-v3 on the first day it came out, and even Japanese audio under ten minutes in length would repeat sentences. And I also saw a test of the conversion in whisper.cpp yesterday, and it had the same problem. For example here: ggerganov/whisper.cpp#1444 |
By the way, there are some problems with punctuation as well. Even using --initial_prompt doesn't correct these issues.... |
I did only few quick English tests, some places better some worse, but I noticed some weird repetitions with no reason too... I wonder if model was converted correctly, I can't run large in reference Whisper, could you do some tests for me? Test these short audios, in reference Whisper's console exe with command below [use ffmpeg v5, not v6]: And upload all output files somewhere. |
Sorry, I don't know some programming terms and sometimes I may not understand some sentences translated from the web page. Can you see if it's these output files? |
Results of the reference somewhat similar somewhat not, something weird is going on at chunk's end/start, not sure if it's the model or faster-whisper's code. Needs more investigation. |
Ok, I have a request, is it possible to add the download option for large-v3-fp16 to Subtitle Edit before the official release of Subtitle Edit 4.02 goes live? large-v3-fp16 will be better than large-v3-int8 when selecting words for some sentences. at least that's what I saw when I tested it in Japanese audio. |
What compute type do you use? |
False alarm, discrepancies were because of the bug, |
I'll choose different --compute_type depending on the language I'm transcribing, for English I'll use the auto-selected int8-float16, for Japanese and Korean I'll choose bfloat16. This is because English itself is a very good transcriber and transcribes sentences with good punctuation and segmentation, but for Asian languages, only bfloat16 can make However, for Asian languages, only bfloat16 allows Asian languages to get the same good sentence segmentation and punctuation as English.However, sometimes float16 or float32 is chosen, depending on the effect of the transcription in the different languages. There are a lot of differences between languages. Using --initial_prompt also gives some results, but after testing a lot of audio, I've concluded that using --initial_prompt causes some sentence splits and punctuation to be incorrectly placed when transcribing. This will result in the wrong meaning being conveyed in the final translation. |
So far I'm not convinced that large-v3 is better than v2, but so far I tested only English, btw there shouldn't be much of improvement for English according to OpenAI's tests, I see that it's a bit more accurate in some places, but v3 hits more fallbacks, it want to repeat things a lot, it wants to hallucinate... I've feeling that v3 is a flop [for English], same as v1... maybe it performs better in other languages.
Choosing computing type by a language is not right, you could have such impression only by looking at some short samples, the extensive tests would show you that accuracy is almost same between all different types. Just don't bother with it. Choose the fastest type for your hardware, that's it. Disable the fallback when benchmarking different types, "Transcription speed" printed at the end is accurate benchmark of transcription. |
Here are OpenAI's tests v2 vs v3:
|
According to that Cantonese has gone from a 30% error rate which I'm told is pretty much unusable, to 10%, putting it close to English performance. |
Actually, it involves another issue, which is that no matter if it's transcribed in Japanese, English, Korean or any other language, eventually many people have to translate those subtitles into their native language. I'm going to use deepl pro, chatgpt 3.5, and google to do the translation. Machine translation is different from human translation in that machine translation can only translate sentence by sentence, not in context like human translation. So I use different --compute_type for different languages because sentence and punctuation segmentation is more accurate and complete. For example, if I use int8_float16 when transcribing English, and I use it to transcribe Japanese, the Japanese sentences in the result will not be punctuated, and the sentences will be very fragmented. But when I use bfloat16 to transcribe Japanese, the resulting sentences are well separated and punctuated. The reason for this is to allow the machine translator to better understand what each line of the subtitles means when translating, after all, every language has polyphonic words, and if the sentences in the transcription result are too fragmented, the machine translator will translate these polyphonic words into the wrong meaning, and if the sentence breaks are too bad, it will affect the machine translation result as well. Although using --initial_prompt to fill in some example sentences, you can also get a seemingly normal sentence division and punctuation, in fact, it will still affect the sentence division, which will cause some sentences to be forcibly broken and then inserted into the middle of a punctuation mark, which will also change the meaning of the machine translation. As a simple example, the second one will have one more punctuation mark than the first one: In addition, I will say large-v3, in fact, at present large-v2, has been very good, especially faster-whisper under the work of large-v2, and your whisper-standalone-win project, make faster-whisper use more simple, easy to use. large-v2's The disadvantage of large-v2 is that it can't select better homophones, which is very forgiving when translating English subtitles, but not for other languages. For example, the Japanese words 髪 and 紙 are different in writing, but the pronunciation is exactly the same, one means hair and the other means paper. (You can listen to it on Google Translate...) Of course it's also a matter of the amount of training data. large-v3 does have a better selection of some homophones than large-v2, and if large-v3 didn't have the hallucinatory, time-stamping problems that these current transcriptions have so badly, then it would be a perfect model upgrade. |
By the way, you are right about one thing. large-v3 transcription just turns out to be very close to large-v1. This is confirmed in my multilingual transcription test. |
I don't think that "punctuation" somehow relates to some particular
How much data did you tested for this conclusion? |
I think if you want some little improvement then you could increase beam_size to 8 or 10, and same to best_of. |
I've tested close to 300 hours of different audio so far. The audio is categorized differently, from speakers with orderly pauses, to speakers with intermittent pauses, or very short pauses from the beginning to the end of the sentence, and so on. I also tried the same audio in different types of audio formats. For example, WAV, MP3, AAC and so on. Both beam_size and best_of I have tried and the improvement is very small, most of the homophones still stay wrong. The most interesting thing is the performance of large-v1, in my test, large-v2 transcription has wrong homophones, while large-v1 has correct ones. Similarly, where large-v1 makes a homophone mistake, large-v2 gets it right. Sometimes I wonder if there is any way to make large-v1 and large-v2 complement each other's deficiencies or load them at the same time for transcription, so that the correct rate will be much higher. |
I don't know how to code, but would love to know if there's a way for large-v2 and large-v1 to be able to transcribe an audio at the same time, taking the best sentence based on plausibility? Or other ways of transcribing an audio at the same time for both models to get a better result 。。。。 This is just a personal guess 。。。。 If it's not right, please take it as a joke I told :) |
Then we need 3rd model which can evaluate that "plausibility", if there would be such model then we would use IT to transcribe, we wouldn't need those "v1" and "v2", 😉 |
I will continue to wait for a more stable version and also look forward to faster-whisper's updates, which I can only help test. Please don't hesitate to let me know if you have any requests for testing! |
Hmm, maybe I would be interested in some tests, but not related to large-v3. |
I would like to see benchmarks of all 3 versions of cuBLAS and cuDNN libs. Tested in a console, not in SE. |
Use one of the two audio files from before? |
Just use audio long enough where test runs at least ~3 - 5 minutes. |
Audio Duration 6 minutes 38 seconds |
I didn't meant "Audio Duration", I meant "test duration", looking at speeds you should use ~50 minutes audio. Looks like V1 speed is much worse, I will delete it from repo after SE updates to the new version. |
Nevermind that previous test... [ probably the google translate issues 😉 ] |
Thx for retest, now we can see that V3 is a bit more faster. Longer tests are more accurate. |
I'm interested in testing the speeds of different compute_types. Post only "Transcription speed". no need for images. Benchmarks with these settings:
|
int8_float16:Transcription speed: 25.48 audio seconds/s int16 Report an error : Traceback (most recent call last): |
Thanks for these
No need to test
|
Could you do same tests but with added |
Tomorrow, because of the time difference, it's late at night here and I'm going to bed soon. |
CPU:13700k bfloat16:Not supported |
Did you forgot |
I didn’t forget, I just forgot to copy it when I was replying to you. To be honest, I don't recommend using a CPU for transcription. . . . . My 13700K took closer to around 12 minutes to transcribe this 6 minute audio sample. |
Switched to fp16 model by default in |
Closing this. |
@despairTK You can check if new fixed config for v3 improves anything: https://we.tl/t-13kKOAs9Vi |
I tested ten audio samples, including Japanese and English. The long audio samples were about 30 minutes and the short ones were less than 10 minutes. The result is that there is no improvement, and in short audio samples, the results of both profiles are exactly the same. The results for long audio files are slightly different, with only slight changes in timestamps and segmented sentences, but no change in overall accuracy. Errors in repeated statements, timestamps, and punctuation still exist. In addition, after I woke up in the morning, I saw two configuration changes in https://huggingface.co/openai/whisper-large-v3. I wonder if this will be helpful to you. |
Same here, no difference. |
Hi! I see faster-whisper has finally been updated. If you have also updated here, please don’t forget to @ me. I'd love to test if it solves the previous problem. |
Nothing new there, we have already v3, almost for a month. |
All right. . . I thought something new had been updated. . . |
NO TEXT FOUND ERR WITH MODELS |
large-v3-fp16 and large-v3-int8 are not supported anymore, you can delete them. |
I'm glad you're quick to support large-v3. Your work is outstanding!
I had some problems testing large-v3-int8 and large-v3-fp16.
I have transcribed videos in English, Japanese, Korean and Russian, dozens of videos in total, each of them varying in length from less than 10 minutes to more than 1 hour. The settings for transcription remain unchanged except for the model selection. The following problems occur when transcribing these videos.
The text was updated successfully, but these errors were encountered: