-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Consistency-Regularized CTC #1766
Conversation
On LibriSpeech dataset, results comparison with Zipformer, without using an external language model:
|
Could you update RESULTS.md to include the URLs for the checkpoints and training logs of your PR? |
Sure. Will do it later. |
@@ -950,7 +943,6 @@ def compute_loss( | |||
spec_augment=spec_augment, | |||
supervision_segments=supervision_segments, | |||
time_warp_factor=params.spec_aug_time_warp_factor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can not find the definition of spec_aug_time_warp_factor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is defined in zipformer/asr_datamodule.py
An example of training script using 4 * 32G-V100: export CUDA_VISIBLE_DEVICES="0,1,2,3"
./zipformer/train.py \
--world-size 4 \
--num-epochs 50 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp-cr-loss-scale-0.2-time-mask-ratio-2.5 \
--use-cr-ctc 1 \
--use-ctc 1 \
--use-transducer 0 \
--use-attention-decoder 0 \
--enable-spec-aug 0 \
--cr-loss-scale 0.2 \
--time-mask-ratio 2.5 \
--full-libri 1 \
--max-duration 700 \
--master-port 12345 |
I have uploaded the checkpoints and updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I did some finetuning exps:
Results on GigaSpeech:
Finetuned results on LibriSpeech:
The results show that CR-CTC could be a good choice for pretraining. |
First of all, I would like to express my deepest gratitude for sharing your invaluable code and paper. They have been immensely helpful in my research endeavors. While reading through your paper and exploring the code, I have encountered a question concerning the batch_size setting, and I would appreciate your insights. In your paper, you mention that "As CR-CTC requires two forward pass during training, we train CR-CTC models with half the batch size and half the number of epochs compared to CTC models, ensuring a fair comparison in terms of training cost". However, in the model.py file, I noticed that the forward function scale the ctc_loss and transducer_loss by 0.5. I wonder do I need to continue adjusting the setting of batch_size(max_duration) ? Once again, thank you for your hard work and generous sharing! |
For example, if you use max-duration of 1400 for standard CTC, you could use max-duration of 700 for CR-CTC. It will create two copies and then concat them along the batch dim. The reason why we scale the loss values by 0.5 is to keep the logging loss values comparable to other setups (without CR-CTC), as we get the info["frames"] in train.py (before batch duplicating) and normalize the loss values by that before printing. You could refer to the script examples in |
Are there any results in streaming ASR? My experiments on streaming ASR using CTC seem to not be working. The CTC loss gets worse while the CR loss gets better, WER gets worse. |
This PR implements the Consistency-Regularized CTC (CR-CTC) in https://arxiv.org/pdf/2410.05101,
which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram. It significantly improves the CTC performance, and could also be an auxiliary loss to boost the performance of transducer or CTC/AED. Please see paper for more details.