Report: Good Japanese ASR results with Fast Conformer #8473

fujimotos · 2024-02-21T14:47:12Z

fujimotos
Feb 21, 2024

Hi, we are Speech research team from Reazon Human Interaction Lab. ¹

Recently, we experimented with Fast Conformer using Japanese datasets,
and could confirm that it delivers excellent performance. So we want to
share our findings.

We hope this post is interesting to NeMo team.

Method

We trained a Fast Conformer model using NeMo framework with the
following configuration:

Subword-based RNN-T model with 619M parameters.
Encoder uses Longformer-style attention with limited context size of [128, 128].
We enabled local attention from the beginning of training.
Decoder has a vocabulary space of 3000 tokens constructed by
SentencePiece unigram tokenizer.

We trained this model on our ReazonSpeech v2.0 corpus, which provides
35,000 hours of diverse Japanese speech. ²

Result (Model)

We released our model on Hugging Face under Apache License 2.0.

https://huggingface.co/reazon-research/reazonspeech-nemo-v2

Discussion

1) Inference Speed vs Accuracy

This graph compares accuracy and inference speed of Japanese ASR models.

The x-axis represents the speed of inference, expressed as real-time
factor.
The y-axis represents the character error rate on JSUT-book dataset
(1 hour of Japanese read speech).

As you can see, the Fast Conformer model archives better inference accuracy
than Whisper v1/2/3, while being as fast as Whisper Tiny.

2) Robustness

Here is another graph that compares Japanese ASR models on public corpus:

So, the Fast Conformer model shows robust performance across multiple datasets.

A research division of Reazon Holdings, a Tokyo-based infotech company. ↩
For more details about this dataset, refer to our paper last year. ↩

titu1994 · 2024-02-21T19:40:09Z

titu1994
Feb 21, 2024
Maintainer

This is excellent news ! Very glad Fast Conformer achieved strong results while being much more efficient than the largest Whisper models.

0 replies

galv · 2024-02-21T21:46:58Z

galv
Feb 21, 2024
Collaborator

What batch size are you using for RTF? Is the batch size 1?

For RNN-T greedy decoding inference, we have noticed that the majority of the time is spent in the greedy decoder. My PR #8191 which proposes a fix by reducing the time spent waiting on the CPU in each iteration will go in shortly. At batch size 16, for a 600 million parameter model, I see a 3.125x speed up. There is also complementary work from @artbataev in #8286 to reduce the number of iterations done in decoding, though we have yet to combine the two efforts into one implementation using both.

1 reply

fujimotos Feb 22, 2024
Author

@galv We did a bit of research on decoding strategies implemented in NeMo,
and we found that the strategy choice has significant effects on inference results.

The following graph compares Greedy, ALSD, TSD, and mAES:

All the measurements were done with batch size 1. For reference, the dict
below shows the decoding parameter used in our testing:

{'strategy': '<alsd, tsd, greedy or beam>',
 'beam': {'alsd_max_target_len': 1.0,
          'beam_size': 4,
          'preserve_alignments': False,
          'return_best_hypothesis': True,
          'score_norm': True,
          'search_type': 'default',
          'softmax_temperature': 1.0},
 'greedy': {'max_symbols': 10}}

Based on this result, we ended up using ALSD. We found it produced
the best inference quality.

What batch size are you using for RTF? Is the batch size 1?

As mentioned above, the batch size was 1 (both for NeMo and Whisper).

For RNN-T greedy decoding inference, we have noticed that the majority of the time is spent in the greedy decoder.

We could confirm this. Currently the majority of inference time is spent on
beam search, and it's CPU bound (especially for longform audio test sets).

yuekaizhang · 2024-02-22T01:57:12Z

yuekaizhang
Feb 22, 2024

1) Inference Speed vs Accuracy

This graph compares accuracy and inference speed of Japanese ASR models.
* The x-axis represents the speed of inference, expressed as real-time factor. * The y-axis represents the character error rate on JSUT-book dataset (1 hour of Japanese read speech).
As you can see, the Fast Conformer model archives better inference accuracy than Whisper v1/2/3, while being as fast as Whisper Tiny.

2) Robustness

Here is another graph that compares Japanese ASR models on public corpus:

@fujimotos I was wondering whether the above results of Whisper be without any fine-tuning? Do we have WER experimental results for Whisper after fine-tuning on the ReazonSpeech v2.0 corpus?

1 reply

fujimotos Feb 22, 2024
Author

@yuekaizhang These results are of official Whisper models (no finetuning).
We have some idea on finetuning Whisper, but not working on it yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report: Good Japanese ASR results with Fast Conformer #8473

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

1) Inference Speed vs Accuracy

2) Robustness

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Report: Good Japanese ASR results with Fast Conformer #8473

fujimotos Feb 21, 2024

Method

Result (Model)

Discussion

1) Inference Speed vs Accuracy

2) Robustness

Footnotes

Replies: 3 comments · 2 replies

titu1994 Feb 21, 2024 Maintainer

galv Feb 21, 2024 Collaborator

fujimotos Feb 22, 2024 Author

yuekaizhang Feb 22, 2024

1) Inference Speed vs Accuracy

2) Robustness

fujimotos Feb 22, 2024 Author

fujimotos
Feb 21, 2024

Replies: 3 comments 2 replies

titu1994
Feb 21, 2024
Maintainer

galv
Feb 21, 2024
Collaborator

fujimotos Feb 22, 2024
Author

yuekaizhang
Feb 22, 2024

fujimotos Feb 22, 2024
Author