Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding LibriSpeech word alignments in supervisions #379

Merged
merged 3 commits into from
Aug 20, 2021

Conversation

pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Aug 20, 2021

No description provided.

@pzelasko pzelasko added this to the v0.8 milestone Aug 20, 2021
@pzelasko
Copy link
Collaborator Author

Note: silences are represented as an empty string

@pzelasko
Copy link
Collaborator Author

Adding the alignments in K2 dataset seems fairly easy; the "supervisions" dict has three more keys, "word", "word_start", "word_end"; it's list of lists of str/float.

@danpovey before I merge let me know if this format works for you guys (don't mind my transparent terminal with Spotify in the background).

image

image

image

@csukuangfj
Copy link
Contributor

it's list of lists of str/float.

Would it be easier for later use if it returns frames for word_start and word_end, i.e., use int32_t ?

@pzelasko
Copy link
Collaborator Author

Up to you guys. I don’t have a good idea of how you want to use it atm. Do you prefer that to be in frames?

@csukuangfj
Copy link
Contributor

Up to you guys. I don’t have a good idea of how you want to use it atm. Do you prefer that to be in frames?

From
#378 (comment)

Let's suppose that we convert this to frames after subsampling, and let the corresponding
filler be -1. (Can do this by setting, say, fsa.times = [tensor of times], and fsa.times_filler = -1, suppposing
fsa.times contained int32)

It says we need start/end frames.

But from #378 (comment)

It would be easiest to use, I think, if the words had a 'begin_frame' and
'end_frame' (or just a single frame index) and these were prepared with
the same shape as the words themselves-- not sure if it becomes a list of
list of int at some point?

I assume that they'd be floating point times in seconds at the point we get
them from lhotse, as we need to set the frame rate.

It suggests using times in seconds.


I am not sure which one is better.

@danpovey
Copy link
Collaborator

Seconds is OK, it's best if the calling code converts that to frames because the calling code knows the frame rate.
I think this should be OK.

@danpovey
Copy link
Collaborator

danpovey commented Aug 20, 2021

... BTW, part of the reason I want this to be in integers when attached as an attribute is that k2 basically assumes that floating-point attributes are "score-like", so for instance they will be added together when integer attributes would be converted to ragged, such as when removing epsilons; and the default value can only be 0, never -1. Later we can change this behavior if it becomes a problem.

@pzelasko
Copy link
Collaborator Author

I think the calling code doesn’t know the frame shift anymore (unless you are using precomputed features and use dataset with return_cuts=True so you can query the cuts, but then it will fail with on the fly features). Also we are already returning start frame and num frames for each supervision from the dataset, so this is inconsistent. I’d suggest using frames here after all, unless you’re sure about seconds.

@danpovey
Copy link
Collaborator

danpovey commented Aug 20, 2021 via email

@pzelasko
Copy link
Collaborator Author

Let's see if this is better, if it's OK I'm going to merge (can't thoroughly test it right now but seems fine on isolated examples -- I plan to clean it up and add some tests later)

@danpovey
Copy link
Collaborator

danpovey commented Aug 20, 2021 via email

@csukuangfj
Copy link
Contributor

+2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants