alignment using ctc_segmentation fails with Hybrid RNNT-CTC models #8750

Jesteinbe · 2024-03-26T12:58:34Z

Following the steps outlined in CTC_Segmentation_Tutorial.ipynb I'm trying to align text and audio. If I use a CTC-only model like stt_en_fastconformer_ctc_large then things work fine. However, if I try to use a EncDecHybridRNNTCTCBPEModel model like stt_en_fastconformer_hybrid_large_pc then things break. As far as I can tell, there are at least two problems.

The first issue is that hybrid models' vocabularies are mdl.cfg.aux_ctc.decoder.vocabulary, not mdl.cfg.decoder.vocabulary which is what is used by prepare_data.py and run_ctc_segmentation.py. Once I fixed this, prepare_data.py seems to work fine but the segmentation still fails.

I'm not sure what the second issue is yet but the alignment fails and I just get a generic error like:

INFO:root:Processing 1st_call.wav...
Transcribing: 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.21s/it]
ERROR:root:list indices must be integers or slices, not tuple
ERROR:root:Skipping 1st_call.wav

I'm guessing the call to the forward pass of the ASR decoder is different for the hybrid RNNT-CTC models.

As for expected behavior, the ctc_segmentation should work regardless of which CTC-based model you use. In this case, the bug shouldn't have anything to do with my environment but I'm running on bare metal (Ubuntu 20.04) using a conda environment in which i installed Nemo v.1.22.0 via pip.

The text was updated successfully, but these errors were encountered:

nithinraok · 2024-04-04T16:17:38Z

@erastorgueva-nv can you look at this issue pls.

erastorgueva-nv · 2024-04-04T22:00:49Z

Hi @Jesteinbe, thanks for bringing this to our attention. You are correct, unfortunately NeMo CTC Segmentation does not currently support Hybrid models.

Your fix of "mdl.cfg.aux_ctc.decoder.vocabulary" is a reasonable partial fix for this. The remaining issue of list indices must be integers or slices, not tuple can be fixed if you add the following lines in run_ctc_segmentation.py after asr_model is instantiated:

    if isinstance(asr_model, nemo_asr.models.EncDecHybridRNNTCTCModel):
        asr_model.change_decoding_strategy(decoder_type="ctc")

Jesteinbe · 2024-04-05T15:01:17Z

Thanks @erastorgueva-nv ! I figured that out too.

I'm also seeing that I get the same results no matter what window size I use when calling run_ctc_segmentation.py. I've tried values ranging from 10 (which I thought should result in bad alignment if not break it), 8000, 16000, 32000, and 48000. i am using single channel audio with limited amounts of silence so i didn't expect a bigger window size was necessary but I was surprised to find identical results. I saw this behavior on both 2 recordings, one English and the other Spanish, that were about ~5min long. it seems like the window size isn't getting update properly but the log/segment file names are all showing the expected values. i haven't tested on a larger amount of audio yet so this could be an anomaly but thought it was worth mentioning.

github-actions · 2024-05-06T01:45:46Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

nithinraok · 2024-05-08T18:15:38Z

@erastorgueva-nv could we close this issue?

Jesteinbe added the bug Something isn't working label Mar 26, 2024

nithinraok assigned erastorgueva-nv Apr 4, 2024

erastorgueva-nv mentioned this issue Apr 4, 2024

Enable using hybrid asr models in CTC Segmentation tool #8828

Merged

8 tasks

github-actions bot added the stale label May 6, 2024

erastorgueva-nv closed this as completed May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alignment using ctc_segmentation fails with Hybrid RNNT-CTC models #8750

alignment using ctc_segmentation fails with Hybrid RNNT-CTC models #8750

Jesteinbe commented Mar 26, 2024

nithinraok commented Apr 4, 2024

erastorgueva-nv commented Apr 4, 2024

Jesteinbe commented Apr 5, 2024

github-actions bot commented May 6, 2024

nithinraok commented May 8, 2024

alignment using ctc_segmentation fails with Hybrid RNNT-CTC models #8750

alignment using ctc_segmentation fails with Hybrid RNNT-CTC models #8750

Comments

Jesteinbe commented Mar 26, 2024

nithinraok commented Apr 4, 2024

erastorgueva-nv commented Apr 4, 2024

Jesteinbe commented Apr 5, 2024

github-actions bot commented May 6, 2024

nithinraok commented May 8, 2024