Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding alignment constraints to supervision #378

Closed
danpovey opened this issue Aug 19, 2021 · 5 comments
Closed

Adding alignment constraints to supervision #378

danpovey opened this issue Aug 19, 2021 · 5 comments

Comments

@danpovey
Copy link
Collaborator

Did you end up uploading a version of alignments for Librispeech training data, and can you recommend
how one might go about preparing these somehow for use in training?
I'm thinking that for Icefall we can start training with some alignment constraints for Librispeech.

For @csukuangfj and @pkufool: I think I figured out how we can apply alignment constraints without decoder
changes, using mechanisms that already exist in k2. (Of course we can add a more efficient mechanism later
if we want). Firstly: intersect_dense already supports the option frame_idx_name, where you can specify the
name of an attribute corresponding to the frame-index, e.g. frame_idx_name='frame_idx'

We could also attach an attribute to the original FSA containing word sequences, the attribute being the time
of the word within the segment. Let's suppose that we convert this to frames after subsampling, and let the corresponding
filler be -1. (Can do this by setting, say, fsa.times = [tensor of times], and fsa.times_filler = -1, suppposing
fsa.times contained int32). This will mean that when we construct the graph, it will have -1 for arcs that
don't have a word on. (Might require expand_ragged_attributes()). Once we attach the frame_idx to the
lattice, we can modify the lattice's scores via a torch expression that compares the frame_idx to the time of the
word as recorded in the attribute, possibly with some kind of collar to avoid constraining it too tightly.
We can compare as signed for one of the directions of comparison, and as unsigned for the other direction,
which will give us the correct behavior for arcs that don't have a word label (i.e., like the ace in cards, the -1 can
be high or low).

We could add, say, -10 to arcs where the time is out of bounds. That should be more than enough to get training
started.
[Or we can do 2 separate comparisons with 2 separate times, one for the beginning and one for the end of the word,
again possibly with a small collar.]

@csukuangfj
Copy link
Contributor

This repo
https://github.com/CorentinJ/librispeech-alignments

contains word alignment information. It contains the timestamp about the beginning and ending of each word
within an utterance. Silences are also included.

Shall we make it available in K2SpeechRecognitionDataset?

@csukuangfj
Copy link
Contributor

We could add, say, -10 to arcs where the time is out of bounds

Does -10 mean somelog_prob here and is the purpose to make sure frames corresponding to silences have a lower probability?

@danpovey
Copy link
Collaborator Author

Yes, that's what I mean about the -10.
Sure, I think using that repo is a good idea, if it seems easy.
I had a vague recollection that Piotr had done something about this, but don't remember for certain. Certainly there is a feature in Lhotse to read ctm's, but I don't know know how to expose the information in the dataset.

@pzelasko
Copy link
Collaborator

Disclosure: I just skimmed this thread -- will read more carefully later.

SupervisionSegment has an "alignment" attribute and it can be read from CTM (thanks to @desh2608). We can look at the repo you guys shared and insert the alignments from there in the supervisions.

Yes, we'll need to modify K2SpeechRecognitionDataset to expose the information. Maybe it should be a new 2D tensor of shape (num_words_in_batch, 5) where each row represents (sequence_idx, supervision_idx, start_frame, num_frames, word_id)? We can auto-detect if supervisions have word alignments and only provide it then.

On a separate note, from experience in another project, I see that it would be very convenient to support alignments as integer sequences read from some HDF5 file (or sth else). As well as posteriors, or multiple kinds of features. I might refactor Lhotse a little bit so that Cut has potentially multiple types of "features", which could be any type of tensor and have a corresponding name. I will try to draft a proposal sometime soon so you can review before I start.

@danpovey
Copy link
Collaborator Author

danpovey commented Aug 19, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants