large-v3
release
#1762
Replies: 58 comments 92 replies
-
omg THANK YOU gwook |
Beta Was this translation helpful? Give feedback.
-
let's goo |
Beta Was this translation helpful? Give feedback.
-
Incredible work |
Beta Was this translation helpful? Give feedback.
-
Let's gooooo |
Beta Was this translation helpful? Give feedback.
-
Extremely cool, interesting to see Dutch as a high accuracy language. Wonder if it's the commonality with English mixed with the fact it hasn't been modified so many times by Roman, Viking, Frankish and Norman conquerors 🧐 Analysing high error rate words and categorising them by their ancient root could be a fun undergrad paper for someone in computer science with an interest in linguistics (or an avenue on which to focus future work) |
Beta Was this translation helpful? Give feedback.
-
Congartulations @jongwook and team. Your great work is much appreciated! |
Beta Was this translation helpful? Give feedback.
-
Awesome @jongwook looking forward to trying it, Edit: i was trying to convert the model into ct2 but it's fails as the large-v3 repository doesn't seems to exists or set to private in huggingface, is there plans to open the model on HF like large-v2? |
Beta Was this translation helpful? Give feedback.
-
@jongwook Congrats to the team on this exciting new release! With respect to the pseudolabeled data, were they treated the same as the weakly labeled data during training, or did they carry a lower weight? Are there any plans for a new paper for this release? Thanks! |
Beta Was this translation helpful? Give feedback.
-
That's cool, what about streaming? |
Beta Was this translation helpful? Give feedback.
-
That's so cool. Great Job!!! |
Beta Was this translation helpful? Give feedback.
-
Thats a great news. will it be available on hugging face? |
Beta Was this translation helpful? Give feedback.
-
Are there plans to make new versions of other models to match large-v3 in terms of number of supported languages (100 in v3 vs 99 in others)? |
Beta Was this translation helpful? Give feedback.
-
Thank you! |
Beta Was this translation helpful? Give feedback.
-
you are not closeai |
Beta Was this translation helpful? Give feedback.
-
Great Job |
Beta Was this translation helpful? Give feedback.
-
Hi guys, I'm a little new here. |
Beta Was this translation helpful? Give feedback.
-
A question: How did you evaluate against Common Voice and/or Fleurs? Did you just take the test set or was it the whole validated? |
Beta Was this translation helpful? Give feedback.
-
It performs so bad with Albanian! Wish we could help on that and preserve more languages that are not commonly used these days. |
Beta Was this translation helpful? Give feedback.
-
In English, V3 is unusable for me. I basically never got hallucinations in V2 except one time at the end it added "[Translated by ...]" or something. But V3 hallucinates multiple times per 10 minutes. |
Beta Was this translation helpful? Give feedback.
-
The sample notebook LibriSpeech.ipynb cannot run with large-v3 model |
Beta Was this translation helpful? Give feedback.
-
Great job!!!, I plan to fine-tune based on different regional accents in China. Before that, I would like to repeat the current benchmark test. Could you please provide the test code? |
Beta Was this translation helpful? Give feedback.
-
When I evaluate whisper-large-v3 in Fleurs for "afkk, mr, ne, sw", I find that the results I get are not consistent with the official release, and they are very different. Why? |
Beta Was this translation helpful? Give feedback.
-
large-v3 is bad. Whisper-v3 Hallucinations on Real World Data |
Beta Was this translation helpful? Give feedback.
-
In my own transcription experience with large-v3 (for Japanese), the most common occurrence of hallucinations tends to be in silent or music-only sections. After multiple transcriptions, I observed certain characteristics in the attribute values of duration, segment_last, avg_logprob, compression_ratio, and no_speech_prob from the transcription results. The following features may indicate hallucinated results:
And I have set several conditions that I believe can filter out hallucinated results:
Due to the limited number of tests, these filtering conditions may not be applicable to everyone. However, they can be provided for reference. Additionally, I have implemented the aforementioned filtering functionality in the whisper-webui-translate spaces on Hugging Face. Note 1: This spaces is built based on the aadnk/whisper-webui version. |
Beta Was this translation helpful? Give feedback.
-
Hi, I was calculating the word error rates and I got around |
Beta Was this translation helpful? Give feedback.
-
What's the difference between "weakly labeled audio" and "pseudolabeled audio"? |
Beta Was this translation helpful? Give feedback.
-
The presence of unrelated text during transcriptions is not due to hallucination, or at least much different from the word used in LLM. It's just because OpenAI uses a data crawler to extract (audio,subtitle) from websites like YouTube and other video sharing sites that provide subtitles. Contributors supply these subtitles along with notes crediting their contribution, for example, 'contributed by ... community'. However, these notes aren't accurately paired with the corresponding audio content. The notes aren't removed from the transcript. Hence, the end-to-end ASR, whether it's based on transformers trained by CTC or attention, learns to associate the silence at the end of the audio data with these contributor notes. This is why, during the inference stage, silence tends to be transcribed as these contribution notes. |
Beta Was this translation helpful? Give feedback.
-
Is large-v3 pre trained on common voice and fleurs or they are just evaluated on those dataset? |
Beta Was this translation helpful? Give feedback.
-
hi, where the model downloaded saved in my computer? i downloaded large model (2.8gb) but i don't know where it is. |
Beta Was this translation helpful? Give feedback.
-
We're pleased to announce the latest iteration of Whisper, called
large-v3
. Whisper-v3 has the same architecture as the previouslarge
models except the following minor differences:The
large-v3
model is trained on 1 million hours of weakly labeled audio and 4 million hours of pseudolabeled audio collected usinglarge-v2
. The model was trained for 2.0 epochs over this mixture dataset.The
large-v3
model shows improved performance over a wide variety of languages, and the plot below includes all languages where Whisperlarge-v3
performs lower than 60% error rate on Common Voice 15 and Fleurs, showing 10% to 20% reduction of errors compared to large-v2:transcription
column which contains labels that are pre-processed and normalized from theraw_transcription
column, with these two exceptions:The
large-v3
model is available inopenai-whisper==20231106
and after. To uselarge-v3
, please update the Whisper package using the following command:and load the model using the name
"large-v3"
. The name"large"
now aliases to the latest model in the series,"large-v3"
.Beta Was this translation helpful? Give feedback.
All reactions