- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
[Bugfix] Relax lang pin for voxtral #21833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run  Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add  🚀 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully relaxes the language requirement for Voxtral models, allowing None as a valid language token, and adds a warning for Whisper models when the language defaults to 'en'. The changes align well with the stated objectives. I've identified two high-severity issues related to code duplication and a confusing error message in voxtral.py that should be addressed to improve maintainability and user experience.
7db5c68    to
    063171f      
    Compare
  
    Signed-off-by: Sanchit Gandhi <sgandhi3141@gmail.com>
Signed-off-by: Sanchit Gandhi <sgandhi3141@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Sanchit Gandhi <sgandhi3141@gmail.com>
063171f    to
    1df3084      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks for addressing this TODO!
As I commented in the review, I would like to keep the FE as architecture agnostic as possible, so I would propose we lower your changes into the SupportsTranscription interface.
One easy thing we can do is to expand validate_language functionality to something like
def validate_language(cls, language: Optional[str]) -> Optional[str]:
or 
def get_preferred_language(cls, language: Optional[str]) -> Optional[str]:
where each model is free to overwrite to provide the preferred default language.
I would also have the default implementation of this function be the one Voxtral is currently using, namely:
- if lang is not provided return None
- else check if the language is supported and return it
I think this would fit most use-cases better imo. We must then have Whisper overwrite the default logic to return en instead of None.
| from transformers import (BatchFeature, WhisperConfig, WhisperFeatureExtractor, | ||
| WhisperProcessor) | ||
| from transformers.models.whisper.modeling_whisper import sinusoids | ||
| from transformers.models.whisper.tokenization_whisper import LANGUAGES | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I am not 100% sure this was the same list, I think I got mine from comparing https://platform.openai.com/docs/guides/speech-to-text and the whisper repo.
We might even move this to a centralized utils in vllm (future work)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one-to-one the same as the one in the Whisper repo. Think we should be using this as our ground-truth for Whisper, rather than the OAI API guide, as there may be languages supported in the API that are not supported by Whisper.
Signed-off-by: Sanchit Gandhi <sgandhi3141@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking much better thanks! I left a few minor comments, other than those and pre-commit fixes, this lgtm.
@DarkLight1337 can you also take a quick look if you have the time?
| # TODO language should be optional and can be guessed. | ||
| # For now we default to en. See | ||
| # https://github.com/huggingface/transformers/blob/main/src/transformers/models/whisper/generation_whisper.py#L1520 | ||
| logger.warning( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking perhaps this should be a logger.warning_once case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty big assumption that's being made in creating the token ids, e.g. it completely nullifies the model's ability to do multilingual transcription.
Until the TODO is resolved, the best practice when using Whisper is to always specify the language. Otherwise, you end up with undefined behaviour (e.g. audio in Spanish, task set to "transcribe", lang token set to "en" → mis-match with how the model was trained!). This would happen silently if you passed a mix of transcription requests, some with and some without the language field.
Ideally, we would enforce this best practice by throwing an exception if the language is not specified. However, since that's not backwards compatible, a persistent warning is the best we can do. So I'd be in favour of keeping it as logger.warning!
Signed-off-by: Sanchit Gandhi <sgandhi3141@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Head branch was pushed to by a user without write access
Signed-off-by: Sanchit Gandhi <sgandhi3141@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Sanchit Gandhi <sgandhi3141@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Sanchit Gandhi <sgandhi3141@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: x22x22 <wadeking@qq.com>
Signed-off-by: Sanchit Gandhi <sgandhi3141@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Sanchit Gandhi <sgandhi3141@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Sanchit Gandhi <sgandhi3141@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Noam Gat <noamgat@gmail.com>
Signed-off-by: Sanchit Gandhi <sgandhi3141@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
Signed-off-by: Sanchit Gandhi <sgandhi3141@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Signed-off-by: Sanchit Gandhi <sgandhi3141@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Sanchit Gandhi <sgandhi3141@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Purpose
Voxtral transcription works completely independently of whether the language token is specified. Currently, if the language token is not specified, it is forced to "en" silently, leading to unexpected results.
This PR:
Nonefor Voxtralenfor Whisper (existing behaviour), but with a warning