-
Notifications
You must be signed in to change notification settings - Fork 4.2k
tests : add WER benchmarks #2454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi Grigory, perhaps we can use LibriSpeech for measuring long audio (approx ~ 1000 hours but could trim it to fit for requirements). For short audio, we can use Libri-Light Alternatively, there are other audio datasets for measuring WER: https://github.com/jim-schwoebel/voice_datasets I could start making small sample scripts to see how whisper.cpp fairs among these datasets |
Thanks. Yes, I'm not sure what is typically used. But in general, I think any dataset would work. The main goal here is not to compare Ideally, we can have scripts that perform heavier benchmarks that developers would use locally. But we also need a mode where the scripts run just a few fast benchmarks that can be added to the CI without overloading it, so that these would be computed on every commit. |
@harvestingmoon are you working on this? |
@foldl hi yes I'm looking at it, more or less likely to start after 12 as it's currently Chinese new year period... |
I think we need a tiny dataset (~10MB) just contained in this repo. WER can then be measured on-the-fly. |
Sorry please ignore the WER calculation above, I will develop another script since the calculations are completely off from what it should be . I will also look for a smaller lightweight dataset so that audio can be measured on the fly |
I have created a better and more robust lightweight script that meets the requirements @foldl , @ggerganov WER is measured at 0.3. It uses this lightweight dataset: My script calculates the WER for each individual audio file as well as the overall average file, here is the pull request #2824 Link for reference: https://huggingface.co/learn/audio-course/en/chapter5/evaluation |
The pull request contains the script as well as the full ~10mb dataset, making it fairly lightweight when measuring on the fly as well |
Hi @harvestingmoon, thank you for the effort, but I expect more attention to detail. Will close the PR for now and let someone else give this a try. |
Should nt the very first step to add an edit dist (used to compute WER/TER) minimalist src code (header only?) to measure it? |
@ggerganov I'm not sure there is a reasonably sized dataset containing short audio, long audio, english, & non-english content. What do you think about an approach like This would be lightweight for each commit. We could have a script to download a larger dataset for local testing. Smaller datasets usually contain a single language or consistent audio duration. |
Yes, sounds good. The CI should download audio files with
Yes, a much bigger dataset for local testing would be useful. |
@ggerganov Hi. I was working on this ticket for a while, and spent last Now, here is the summary of my measurement results:
Comparison with OpenAI whisper To illustrate the result shown above, the following table compares whisper.cpp's To put it very short, the performance was pretty much comparable!
How I performed the benchmark test I submitted the code I wrote for the benchmark test in PR #2999. The code The testing process is fairly automated (using the power of Makefile), Please tell me if anything is unclear! I hope it's interesting for you. |
@fujimotos Thank you, this is very interesting! Will be taking a look in the next few days. |
@ggerganov Thank you! Techinical Note: how long it took to perform the full benchmark This time, I rent an EC2 c8g.xlarge instance from AWS to perform the It took roughly 80 hours to benchmark all the eight model sizes.
Observation: Tradeoff between speed vs accuracy Looking from a different angle, I think this confirms the existence of The following graph should illustrate the relationship:
![]() |
It would be interesting to perform these benchmarks with |
Here are some results on M2 Ultra with Flash Attention enabled:
Though the timings might be a bit off because I was using the computer while the computations were running. But overall, there is no degradation of the quality when going to Q8 models, which is expected, but good to confirm. |
It would be nice to start measuring the word error rate (WER) of
whisper.cpp
across some representative dataset:This will help us catch regressions in the future. I'm not familiar with what is typically used for TTS WER benchmarks, so looking for help from the community.
The text was updated successfully, but these errors were encountered: