-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timestamps for Wav2Vec 2.0 models and/or ASR pipelines #15502
Comments
I very much agree - this would be a very welcoming feature to add and it actually shouldn't be too difficult for CTC. For CTC we know exactly what characters are predicted at what time because we know the In terms of the implementation details, I think We could then also integrate this into the tokenizer's @iskaj would you be interested in opening a PR for this? I think we could start by adding the following method to def retrieve_time_stamps(token_ids, stride, sampling_rate):
# 1. compute total stride: `total_stride = reduce(stride, multiply)`
# 2. time_frame_per_logit_in_s = total_stride / sampling_rate
# 3. now we need to find the first non- `pad_token_id` in token_ids which represents the start_id. Then the first `word_delimeter_token` represents the first end_id. The next non-pad_token_id then represents the next start_id, the next word_delimiter_token` after the next end_id and so on. This can be done in a simple for loop
# 4. that's pretty much it -> then we can return a list of tuples which correspond to the time stamps of the returned words. |
I agree too, it's a very welcome feature. The main concern I have is the actual implementation of this. Ideally, it would be finely manageable by users, because I can see (at least) a usage for video, where you want to add subtitles, and you need to put timestamps at definite boundaries (most likely sentence boundary and/or length of text). The ideal way would be to be highly transparent and maybe quite noisy: pipe = pipeline(..., add_timestamps=True) # crashes on non CTC
out = pipe(...)
# out = {"text": "ABCD", "timestamps": [0.01, 0.03, 0.03, 0.04]} Here I propose 1 float per character of the output. it's very noisy, but seems still quite simple to use and give everything needed for someone wanting fine control over timestamps. I imagine this float would correspond the the first TOKEN using that letter in CTC context. As an implementation, it might be tedious to properly add with chunking and striding. (Or not I am not sure) |
I think what @Narsil proposed will work fine for what most people want indeed, so I agree with that sentiment. For me the interest lies in automatic subtitling and the noise in this solution would be fine. It is also nicely interpretable. I think it should work with the pipeline approach (chunking and striding), otherwise the purpose would be kind of lost right? I'm also not sure how that would work though... In the future I might be interesting in doing a pull request for this, but currently my priorities lay elsewhere. Hope I can help with this in the near feature. |
@anton-l - thoughts on this? |
I think we can add an alternative to {
"text": "I AM HERE",
"tokens": [
{
"token": "I",
"time_start": <first_ctc_logit_index> * 0.02,
"time_end": <last_ctc_logit_index> * 0.02 + 0.025,
"probability": 0.7
},
{
"token": " ",
"time_start": <first_ctc_logit_index> * 0.02,
"time_end": <last_ctc_logit_index> * 0.02 + 0.025,
"probability": 0.4
},
{
"token": "A",
"time_start": <first_ctc_logit_index> * 0.02,
"time_end": <last_ctc_logit_index> * 0.02 + 0.025,
"probability": 0.6
},
....
]
} where 0.02 is the frame stride, and 0.025 is the frame width in seconds (could be calculated from Returning the word boundaries' offsets (whitespaces in this example) is also important for consistency IMO.
Since the whole step can be contained inside the tokenizer, it shouldn't be a problem to add it inside |
@anton-l , Fully agree having both I just realised though, will that approach be possible with |
@Narsil it's possible for |
Interesting idea @anton-l! I thought about it a bit - couple of remarks: 1.) I don't think we should entangle the probabilities with the time stamps too much to be honest and rather treat them as separate feature additions because:
Will do a draft PR and share it here. |
Also this could definitely be directly in the pipeline if it's going to be the main/sole user. Might make implementation easier. |
🚀 Feature request
So the ASR pipeline (https://github.com/huggingface/transformers/blob/v4.16.2/src/transformers/pipelines/automatic_speech_recognition.py#L122) is great for leveraging Wav2Vec 2.0 for longer files. However it does not allow us to get the timestamps of the words, so when each word was spoken out.
Motivation
This is relevant for many applications of ASR, such as automatic subtitles or anything else requiring this timing information. Since this information should be available somewhere "under the hood", it might be beneficial to many to include this in the output. This might not be specific to the pipelines, but also to the general output of Wav2Vec 2.0 models.
Your contribution
I'm not yet that familiar with HF + Wav2Vec 2.0, but this is https://github.com/lumaku/ctc-segmentation is a useful github page. Would be willing to help out though!
The text was updated successfully, but these errors were encountered: