-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make chuking smartly (long files) work on asr ctc_with_lm. #15219
Make chuking smartly (long files) work on asr ctc_with_lm. #15219
Conversation
@@ -66,14 +66,27 @@ def ffmpeg_read(bpayload: bytes, sampling_rate: int) -> np.array: | |||
return audio | |||
|
|||
|
|||
def apply_stride(tokens, stride): | |||
max_token_n = tokens.shape[-1] | |||
def audio_to_logits(tokens_or_logits, stride): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly does the function do? Could we add some docstring? Also I don't understand the name audio_to_logits
really
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just change the stride number to go from audio space
(10s at 16_000 means (160_000, 8_000, 8_000)) for instance. to logits_space
(2333, 160, 160) for instance.
Do you think of a better name ? Doctstring could help a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. Maybe get_output_stride_from_input_stride(input_shape, stride)
and directly pass the shape? Yeah think a little docstring can help here
# we need to reconstruct this information | ||
# This won't work with left padding (which doesn't exist right now) | ||
right_n = total_n - right | ||
logits = logits[:, left:right_n] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, yes I think that's the only approach that'll work right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Padding is not possible here sadly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Padding works, it's handled by the batching mecanism. All I mentionned here is why we need so much information (logits
might get padded while stride
no).
We can make it work, relatively trivially for left padding.
if self.feature_extractor.padding_side == "left":
left = logits.shape[1] - total_n + left_n
right = logits.shaoe[1] - total_n + right_n
else:
left = left_n
right_n = total_n - right_n
Just thought this was overly complex since left padding doesn't seem likely here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see - ok yeah
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super cool! Works well for the swedish example of:
./eval.py --model_id hf-test/xls-r-300m-sv --dataset speech-recognition-community-internal/tedx_manual_dev_test --config sv --split validation --chunk_length_s 5.0 --stride_length_s 1.0
with the eval script here: https://github.com/huggingface/transformers/blob/master/examples/research_projects/robust-speech-event/eval.py
Made some small changes to make it work for batch size 1 - feel free to refactor those to make it cleaner @Narsil
- Renamed to `rescale_stride`.
Great job merging this PR! the documentation will now be removed from the staging environment. |
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.