Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add V3 Support #578

Merged
merged 3 commits into from
Nov 24, 2023
Merged

Add V3 Support #578

merged 3 commits into from
Nov 24, 2023

Conversation

Oscaarjs
Copy link
Contributor

This is a continuation of #548 so credit to its contributors and author @stillmatic.

I've tried to adress some of the comments recieved on that PR.

Potential TODOs:

  • Dynamic _LANGUAGE_CODES in tokenizer.py depending on whether V3 (which supports "yue") is loaded.

@Oscaarjs
Copy link
Contributor Author

@nguyendc-systran , would be great if you could have a look

This was referenced Nov 24, 2023
@@ -76,6 +77,7 @@ def download_model(

allow_patterns = [
"config.json",
"preprocessor_config.json",
Copy link
Collaborator

@trungkienbkhn trungkienbkhn Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Oscaarjs , hello. We have generated the Systran Whisper large-v3 conversion model with new file preprocessor_config.json in the HF-to-ct2 script, could you please update this info also in README.md file of your PR.
Example: ct2-transformers-converter --model openai/whisper-large-v3 --output_dir whisper-large-v3-ct2
--copy_files tokenizer.json preprocessor_config.json --quantization float16

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trungkienbkhn Ah yes! Did the changes, please check if I made them correctly

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Hopefully there will be some benchmark tests for the faster whisper large-v3 soon.

@salahzoubi
Copy link

@Oscaarjs im guessing this still doesn't allow for batch transcribe (which is built-in large-v3)?

@Oscaarjs
Copy link
Contributor Author

@Oscaarjs im guessing this still doesn't allow for batch transcribe (which is built-in large-v3)?

What do you mean by built-in? Afaik there's nothing changed with v3 and v2 that changes that part of the model. Or are you referring to HFs pipeline implementation of it? (which do support batching)

@funboarder13920
Copy link

@Oscaarjs im guessing this still doesn't allow for batch transcribe (which is built-in large-v3)?

Your question was off topic in the other PR, it is still off topic in this PR.
Batch inference has nothing to do with large-v3

@nguyendc-systran nguyendc-systran merged commit 3084409 into SYSTRAN:master Nov 24, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants