Report: Good Japanese ASR results with Fast Conformer #8473
Replies: 3 comments 2 replies
-
This is excellent news ! Very glad Fast Conformer achieved strong results while being much more efficient than the largest Whisper models. |
Beta Was this translation helpful? Give feedback.
-
What batch size are you using for RTF? Is the batch size 1? For RNN-T greedy decoding inference, we have noticed that the majority of the time is spent in the greedy decoder. My PR #8191 which proposes a fix by reducing the time spent waiting on the CPU in each iteration will go in shortly. At batch size 16, for a 600 million parameter model, I see a 3.125x speed up. There is also complementary work from @artbataev in #8286 to reduce the number of iterations done in decoding, though we have yet to combine the two efforts into one implementation using both. |
Beta Was this translation helpful? Give feedback.
-
@fujimotos I was wondering whether the above results of Whisper be without any fine-tuning? Do we have WER experimental results for Whisper after fine-tuning on the ReazonSpeech v2.0 corpus? |
Beta Was this translation helpful? Give feedback.
-
Hi, we are Speech research team from Reazon Human Interaction Lab. 1
Recently, we experimented with Fast Conformer using Japanese datasets,
and could confirm that it delivers excellent performance. So we want to
share our findings.
We hope this post is interesting to NeMo team.
Method
We trained a Fast Conformer model using NeMo framework with the
following configuration:
Subword-based RNN-T model with 619M parameters.
Encoder uses Longformer-style attention with limited context size of
[128, 128]
.We enabled local attention from the beginning of training.
Decoder has a vocabulary space of 3000 tokens constructed by
SentencePiece unigram tokenizer.
We trained this model on our ReazonSpeech v2.0 corpus, which provides
35,000 hours of diverse Japanese speech. 2
Result (Model)
We released our model on Hugging Face under Apache License 2.0.
https://huggingface.co/reazon-research/reazonspeech-nemo-v2
Discussion
1) Inference Speed vs Accuracy
This graph compares accuracy and inference speed of Japanese ASR models.
The x-axis represents the speed of inference, expressed as real-time
factor.
The y-axis represents the character error rate on JSUT-book dataset
(1 hour of Japanese read speech).
As you can see, the Fast Conformer model archives better inference accuracy
than Whisper v1/2/3, while being as fast as Whisper Tiny.
2) Robustness
Here is another graph that compares Japanese ASR models on public corpus:
So, the Fast Conformer model shows robust performance across multiple datasets.
Footnotes
A research division of Reazon Holdings, a Tokyo-based infotech company. ↩
For more details about this dataset, refer to our paper last year. ↩
Beta Was this translation helpful? Give feedback.
All reactions