-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding alignment constraints to supervision #378
Comments
This repo contains word alignment information. It contains the timestamp about the beginning and ending of each word Shall we make it available in |
Does -10 mean some |
Yes, that's what I mean about the -10. |
Disclosure: I just skimmed this thread -- will read more carefully later. SupervisionSegment has an "alignment" attribute and it can be read from CTM (thanks to @desh2608). We can look at the repo you guys shared and insert the alignments from there in the supervisions. Yes, we'll need to modify On a separate note, from experience in another project, I see that it would be very convenient to support alignments as integer sequences read from some HDF5 file (or sth else). As well as posteriors, or multiple kinds of features. I might refactor Lhotse a little bit so that |
It would be easiest to use, I think, if the words had a 'begin_frame' and
'end_frame' (or just a single frame index) and these were prepared with
the same shape as the words themselves-- not sure if it becomes a list of
list of int at some point?
I assume that they'd be floating point times in seconds at the point we get
them from lhotse, as we need to set the frame rate.
Of course we can easily change formats. I'm just saying what we eventually
will need.
…On Thu, Aug 19, 2021 at 10:22 PM Piotr Żelasko ***@***.***> wrote:
Disclosure: I just skimmed this thread -- will read more carefully later.
SupervisionSegment has an "alignment" attribute and it can be read from
CTM (thanks to @desh2608 <https://github.com/desh2608>). We can look at
the repo you guys shared and insert the alignments from there in the
supervisions.
Yes, we'll need to modify K2SpeechRecognitionDataset to expose the
information. Maybe it should be a new 2D tensor of shape (num_words_in_batch,
5) where each row represents (sequence_idx, supervision_idx, start_frame,
num_frames, word_id)? We can auto-detect if supervisions have word
alignments and only provide it then.
On a separate note, from experience in another project, I see that it
would be very convenient to support alignments as integer sequences read
from some HDF5 file (or sth else). As well as posteriors, or multiple kinds
of features. I might refactor Lhotse a little bit so that Cut has
potentially multiple types of "features", which could be any type of tensor
and have a corresponding name. I will try to draft a proposal sometime soon
so you can review before I start.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#378 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOYIOYTHOMNKD6LTDYDT5UHRVANCNFSM5CNZ7ITQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
Did you end up uploading a version of alignments for Librispeech training data, and can you recommend
how one might go about preparing these somehow for use in training?
I'm thinking that for Icefall we can start training with some alignment constraints for Librispeech.
For @csukuangfj and @pkufool: I think I figured out how we can apply alignment constraints without decoder
changes, using mechanisms that already exist in k2. (Of course we can add a more efficient mechanism later
if we want). Firstly: intersect_dense already supports the option
frame_idx_name
, where you can specify thename of an attribute corresponding to the frame-index, e.g.
frame_idx_name='frame_idx'
We could also attach an attribute to the original FSA containing word sequences, the attribute being the time
of the word within the segment. Let's suppose that we convert this to frames after subsampling, and let the corresponding
filler
be -1. (Can do this by setting, say, fsa.times = [tensor of times], and fsa.times_filler = -1, suppposingfsa.times contained int32). This will mean that when we construct the graph, it will have -1 for arcs that
don't have a word on. (Might require expand_ragged_attributes()). Once we attach the frame_idx to the
lattice, we can modify the lattice's scores via a torch expression that compares the frame_idx to the time of the
word as recorded in the attribute, possibly with some kind of collar to avoid constraining it too tightly.
We can compare as signed for one of the directions of comparison, and as unsigned for the other direction,
which will give us the correct behavior for arcs that don't have a word label (i.e., like the ace in cards, the -1 can
be high or low).
We could add, say, -10 to arcs where the time is out of bounds. That should be more than enough to get training
started.
[Or we can do 2 separate comparisons with 2 separate times, one for the beginning and one for the end of the word,
again possibly with a small collar.]
The text was updated successfully, but these errors were encountered: