Adding alignment constraints to supervision #378

danpovey · 2021-08-19T09:28:07Z

Did you end up uploading a version of alignments for Librispeech training data, and can you recommend
how one might go about preparing these somehow for use in training?
I'm thinking that for Icefall we can start training with some alignment constraints for Librispeech.

For @csukuangfj and @pkufool: I think I figured out how we can apply alignment constraints without decoder
changes, using mechanisms that already exist in k2. (Of course we can add a more efficient mechanism later
if we want). Firstly: intersect_dense already supports the option frame_idx_name, where you can specify the
name of an attribute corresponding to the frame-index, e.g. frame_idx_name='frame_idx'

We could also attach an attribute to the original FSA containing word sequences, the attribute being the time
of the word within the segment. Let's suppose that we convert this to frames after subsampling, and let the corresponding
filler be -1. (Can do this by setting, say, fsa.times = [tensor of times], and fsa.times_filler = -1, suppposing
fsa.times contained int32). This will mean that when we construct the graph, it will have -1 for arcs that
don't have a word on. (Might require expand_ragged_attributes()). Once we attach the frame_idx to the
lattice, we can modify the lattice's scores via a torch expression that compares the frame_idx to the time of the
word as recorded in the attribute, possibly with some kind of collar to avoid constraining it too tightly.
We can compare as signed for one of the directions of comparison, and as unsigned for the other direction,
which will give us the correct behavior for arcs that don't have a word label (i.e., like the ace in cards, the -1 can
be high or low).

We could add, say, -10 to arcs where the time is out of bounds. That should be more than enough to get training
started.
[Or we can do 2 separate comparisons with 2 separate times, one for the beginning and one for the end of the word,
again possibly with a small collar.]

The text was updated successfully, but these errors were encountered:

csukuangfj · 2021-08-19T09:43:42Z

This repo
https://github.com/CorentinJ/librispeech-alignments

contains word alignment information. It contains the timestamp about the beginning and ending of each word
within an utterance. Silences are also included.

Shall we make it available in K2SpeechRecognitionDataset?

csukuangfj · 2021-08-19T09:49:18Z

We could add, say, -10 to arcs where the time is out of bounds

Does -10 mean somelog_prob here and is the purpose to make sure frames corresponding to silences have a lower probability?

danpovey · 2021-08-19T09:53:42Z

Yes, that's what I mean about the -10.
Sure, I think using that repo is a good idea, if it seems easy.
I had a vague recollection that Piotr had done something about this, but don't remember for certain. Certainly there is a feature in Lhotse to read ctm's, but I don't know know how to expose the information in the dataset.

pzelasko · 2021-08-19T14:22:07Z

Disclosure: I just skimmed this thread -- will read more carefully later.

SupervisionSegment has an "alignment" attribute and it can be read from CTM (thanks to @desh2608). We can look at the repo you guys shared and insert the alignments from there in the supervisions.

Yes, we'll need to modify K2SpeechRecognitionDataset to expose the information. Maybe it should be a new 2D tensor of shape (num_words_in_batch, 5) where each row represents (sequence_idx, supervision_idx, start_frame, num_frames, word_id)? We can auto-detect if supervisions have word alignments and only provide it then.

On a separate note, from experience in another project, I see that it would be very convenient to support alignments as integer sequences read from some HDF5 file (or sth else). As well as posteriors, or multiple kinds of features. I might refactor Lhotse a little bit so that Cut has potentially multiple types of "features", which could be any type of tensor and have a corresponding name. I will try to draft a proposal sometime soon so you can review before I start.

danpovey · 2021-08-19T14:51:31Z

It would be easiest to use, I think, if the words had a 'begin_frame' and 'end_frame' (or just a single frame index) and these were prepared with the same shape as the words themselves-- not sure if it becomes a list of list of int at some point? I assume that they'd be floating point times in seconds at the point we get them from lhotse, as we need to set the frame rate. Of course we can easily change formats. I'm just saying what we eventually will need.

…

On Thu, Aug 19, 2021 at 10:22 PM Piotr Żelasko ***@***.***> wrote: Disclosure: I just skimmed this thread -- will read more carefully later. SupervisionSegment has an "alignment" attribute and it can be read from CTM (thanks to @desh2608 <https://github.com/desh2608>). We can look at the repo you guys shared and insert the alignments from there in the supervisions. Yes, we'll need to modify K2SpeechRecognitionDataset to expose the information. Maybe it should be a new 2D tensor of shape (num_words_in_batch, 5) where each row represents (sequence_idx, supervision_idx, start_frame, num_frames, word_id)? We can auto-detect if supervisions have word alignments and only provide it then. On a separate note, from experience in another project, I see that it would be very convenient to support alignments as integer sequences read from some HDF5 file (or sth else). As well as posteriors, or multiple kinds of features. I might refactor Lhotse a little bit so that Cut has potentially multiple types of "features", which could be any type of tensor and have a corresponding name. I will try to draft a proposal sometime soon so you can review before I start. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#378 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLOYIOYTHOMNKD6LTDYDT5UHRVANCNFSM5CNZ7ITQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

csukuangfj mentioned this issue Aug 20, 2021

Adding LibriSpeech word alignments in supervisions #379

Merged

pzelasko closed this as completed Aug 23, 2021

csukuangfj mentioned this issue Sep 9, 2021

Extract framewise alignment information using CTC decoding k2-fsa/icefall#39

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding alignment constraints to supervision #378

Adding alignment constraints to supervision #378

danpovey commented Aug 19, 2021

csukuangfj commented Aug 19, 2021

csukuangfj commented Aug 19, 2021

danpovey commented Aug 19, 2021

pzelasko commented Aug 19, 2021

danpovey commented Aug 19, 2021 via email

Adding alignment constraints to supervision #378

Adding alignment constraints to supervision #378

Comments

danpovey commented Aug 19, 2021

csukuangfj commented Aug 19, 2021

csukuangfj commented Aug 19, 2021

danpovey commented Aug 19, 2021

pzelasko commented Aug 19, 2021

danpovey commented Aug 19, 2021 via email