Whisper in 🤗 Transformers #654

sanchit-gandhi · 2022-12-07T15:50:19Z

sanchit-gandhi
Dec 7, 2022

Whisper in 🤗 Transformers

Whisper is available in the Hugging Face Transformers library from Version 4.23.1, with both PyTorch and TensorFlow implementations. All the official checkpoints can be found on the Hugging Face Hub, alongside documentation and examples scripts.

Fine-Tuning

Using the 🤗 Trainer, Whisper can be fine-tuned for speech recognition and speech translation tasks, boosting the performance of the model especially on low-resource languages. Refer to the blog post for a complete guide on fine-tuning Whisper. If you're interested in fine-tuning Whisper in your language, join us for our two-week Whisper fine-tuning event!

Evaluation

See the following example for evaluating Whisper on the LibriSpeech ASR dataset.

First, install the relevant Hugging Face packages:

pip -U transformers datasets evaluate

Next, run the Python code cell to evaluate on the "test-clean" subset of LibriSpeech. You can change the model checkpoint to any one of the official checkpoints on the Hugging Face Hub.

from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
from evaluate import load

librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

processor = WhisperProcessor.from_pretrained("openai/whisper-base.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base.en").to("cuda")

def map_to_pred(batch):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    batch["reference"] = processor.tokenizer._normalize(batch['text'])

    with torch.no_grad():
        predicted_ids = model.generate(input_features.to("cuda"))[0]
    transcription = processor.decode(predicted_ids)
    batch["prediction"] = processor.tokenizer._normalize(transcription)
    return batch

result = librispeech_test_clean.map(map_to_pred)

wer = load("wer")
print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))

Print Output:

4.254436419182681

Multi-Dataset Evaluation

We provide a Google Colab for evaluating Whisper on eight English speech recognition datasets in one script. This serves as a template for performing multi-dataset evaluation in a style similar to the official Whisper paper.

Dataset name	Whisper tiny.en	Whisper small.en
LibriSpeech Clean	5.66	3.05
LibriSpeech Other	15.38	7.53
Common Voice	31.17	15.20
VoxPopuli	12.58	8.45
TEDLIUM	14.28	12.21
GigaSpeech	14.07	11.36
SPGISpeech	5.82	3.63
Earnings-22	13.79	16.40
AMI	24.68	17.88

nyadla-sys · 2022-12-13T22:46:59Z

nyadla-sys
Dec 13, 2022

@sanchit-gandhi
Can I train the model to accept 10 seconds of audio spectrograms for inference instead of 30 seconds?

0 replies

nshmyrev · 2022-12-13T22:54:56Z

nshmyrev
Dec 13, 2022

TEDLIUM | 14.28 | 12.21

These numbers don't look right. Tiny is at 8.6%, Small 6.7% and even less.

Earnings-22 | 13.79 | 16.40

Something wrong here too

1 reply

sanchit-gandhi Dec 16, 2022
Author

Taking a look at the TEDLIUM numbers! Note that the Earnings dataset is by default un-chunked (audio samples between 10mins and 1hr). For our evaluation, we use the default 'chunks' on the HF website, giving 20s long audio samples. The paper uses the entire long-form audio for long-form transcriptions. Thus we anticipate differences here!

mn9891 · 2022-12-28T21:24:07Z

mn9891
Dec 28, 2022

Can you please explain how to output the avg_logprob ?
Also how to use DecodingOptions (like in the openAI implementation)?

6 replies

mn9891 Jan 5, 2023

Thanks a lot for you reply @sanchit-gandhi!
In addition to output_scores=True, return_dict_in_generate needs to be set to True to get the prediction scores from the generate method.. and then averaging the log_softmax of the scores would give you the avg_logprob.
As for the decoding options I meant, that includes temperature, use of prompt or prefix, beam search/greedy (beam_size, best_of, ...) You can see them in here.

sanchit-gandhi Jan 13, 2023
Author

Hey @mn9891! Great spot, we do indeed need to set model.generate(... , output_scores=True, return_dict_in_generate=True) in this case.

The full list of generate args can be found here, including temperature and beam size: https://huggingface.co/docs/transformers/v4.24.0/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin.generate

Setting the beam size and do_sample arg control the behaviour of the generate method (greedy vs beam): https://huggingface.co/docs/transformers/v4.24.0/en/main_classes/text_generation#transformers.generation_utils.GenerationMixin

roudimit Feb 20, 2023

@sanchit-gandhi thanks for your great work with Whisper in HF!

How could I add beam search decoding to the inference script from the whisper sprint? Can I add decoding options when using whisper in a pipeline? https://github.com/huggingface/community-events/blob/606f62d19c818cde6331f809a37dbbab2dca85a8/whisper-fine-tuning-event/run_eval_whisper_streaming.py#L51

roudimit Feb 21, 2023

answering my question: yes it's possible:

whisper_asr = pipeline(
        "automatic-speech-recognition", model=args.model_id, device=args.device,
        generate_kwargs={'num_beams': 5, 'temperature': 1},
)

rizwanishaq Jan 24, 2024

with pipleine how we can get no_speech_prob?

HenryYuen128 · 2023-01-04T09:37:08Z

HenryYuen128
Jan 4, 2023

it seems that cannot run in batch?

3 replies

sanchit-gandhi Jan 5, 2023
Author

The Google Colab provides a method for running batched inference

Anwarvic Feb 9, 2023

You can do that easily using huggingface.datasets, let me show how:

from datasets.arrow_dataset import Dataset

audio_files = ["1.wav", "2.wav", ... "100.wav"]
ds = Dataset.from_list(audio_files)

Now, you can iterate over batches of ds like so:

batch_size=32
for batch in ds.iter(batch_size):
    # do something !!

Hope this helps!

zbbwss May 13, 2023

how to resolve it ?

jturner116 · 2023-01-17T17:21:32Z

jturner116
Jan 17, 2023

Hey Sanchit! Thank you for the great Whisper write-ups.

This example seems to run great for the first 30 seconds. Is it possible to run inference on audio files longer than 30 seconds without Pipe? I've noticed that it is necessary to tune stride_length_s in the pipeline to get stable results and was wondering if it is possible to make outputs more stable without pipe.

Thanks!
Jonah

3 replies

sanchit-gandhi Jan 19, 2023
Author

Hey @jturner116!

Very happy to hear that the resources have been helpful! 🤗 The Whisper model is defined such that input audio sequences are padded/truncated to 30s, so all audio sequences are fed to the model with an input length of 30s, regardless of their original input length. If samples are shorter, Whisper pads them with zeros to 30s. If they're longer, Whisper truncates them to 30s. In terms of the Whisper model itself, it can only handle 30s inputs.

In order to transcribe samples longer than 30s, we have to employ a special algorithm on top of Whisper. Here, we divide up our audio sequence into 30s chunks, transcribe these chunks individually and stitch the resulting transcriptions together. This is what pipeline does when you specify chunk_length_s=30 - here we tell pipeline to divide the audio into 30s chunks and use Whisper on each of these chunks sequentially. This is the same logic used by the 'official' Whisper implementation for .transcribe():

whisper/whisper/transcribe.py

Line 175 in f82bc59

while seek < num_frames:

So in terms of making the standalone model work on samples longer than 30s, unfortunately this is a limitation of the model! But the special algorithm for longer audio inputs makes transcribing files of arbitrary length possible with Whisper!

jturner116 Jan 19, 2023

@sanchit-gandhi , thank you for the explanation, that definitely helps my Whisper understanding!

I am pretty excited about this PR, do you have any inside scoops on when the next Transformers release will be? 👀👀
I encountered a VRAM leak putting the Whisper pipeline into a FastAPI and running a big queue through it, that was my impetus for looking for an alternative method. I will test out that implementation a little more and might open an issue if it is reproducible.

Thanks for your help :D
Jonah

sanchit-gandhi Jan 20, 2023
Author

Hey @jturner116!

Me too! It's a very cool feature addition to transformers, can't wait to see the all the creative applications we can build with time-stamp prediction!

You can already use this feature if by installing transformers from main: https://huggingface.co/docs/transformers/installation#install-from-source But a word of warning that there are some small bugs that are being ironed out (as is the case with a bleeding edge main version vs a stable version)

I believe the next transformers release will be fairly soon!

astrowonk · 2023-01-21T14:26:58Z

astrowonk
Jan 21, 2023

As far as I know, this is the only way to get Whisper to run on the GPU of an Apple Silicon Mac. Whisper with the TF backend + Apple's Metal Tensorflow plugin and I was able to see my GPU cores light up when transcribing an audio file.

However, it's a pretty low level API, and unlike the reference Whisper implementation in python, it doesn't have any utility functions to produce .vtt files or output that includes time stamps. It also isn't (for me) creating any punctuation or sentences like the standard OpenAI implementation. And I had to manually pull audio from .wav to a numpy array, split it, and then send a list of small arrays to the feature extractor and model.

Does anyone know of any modules or wrappers that use the Tensorflow/Huggingface backend, but can just run on the command line, take an audio file, and spit out a .vtt file? One of the notebooks I found refers to a pipeline("automatic-speech-recognition", model="openai/whisper-small") but that only works with Pytorch.

2 replies

sanchit-gandhi Jan 24, 2023
Author

Indeed time stamp prediction is only currently possible in PyTorch with the pipeline method

It also isn't (for me) creating any punctuation or sentences like the standard OpenAI implementation

The model performs 1-to-1 the same as the OpenAI implementation! So should have equal logits and outputs

And I had to manually pull audio from .wav to a numpy array, split it, and then send a list of small arrays to the feature extractor and model.

This is the API level expected by the TF model (see https://huggingface.co/docs/transformers/model_doc/whisper#transformers.TFWhisperForConditionalGeneration.call.example). We can go more abstract and pass a .wav file directly to pipeline method, but again this is currently only a (slightly experimental!) PyTorch feature

sanchit-gandhi Jan 25, 2023
Author

Hey @astrowonk! I've not tried it, but you can specify framework="tf" when you load the pipeline: https://github.com/huggingface/transformers/blob/255257f3ea0862cbb92ea9fa1113cbee1898aadd/src/transformers/pipelines/automatic_speech_recognition.py#L285

rjac-ml · 2023-02-21T01:31:07Z

rjac-ml
Feb 21, 2023

hey @sanchit-gandhi I have saw one implementation of whisper that is doing a very good job into speed up the inference, using the chunk_length_s, stride_length_s and batch_size params it can process 30 mins in less than 1 minute. What I have notice is that the timestamp prediction get worse. I have test out the OpenAI whisper model using a VAD (pyannote.audio) before trnascribe and in that way it brings a very accurate prediction of the timestamps. I was wondering if I could get some guidence If i want to help implementing a chunk base on VAD and leverage the Batch Inference . (Maybe this would need to go into https://github.com/huggingface/speechbox )

5 replies

sanchit-gandhi Feb 22, 2023
Author

Hey @rjac-ml! You could indeed first pre-process the audio samples using a VAD model and then pass them to the HF Whisper pipeline with batch_size > 1 - I think this would be cleanest!

sanchit-gandhi Feb 22, 2023
Author

Feel free to open a PR on SpeechBox! Think this would be a cool feature addition!

sanchit-gandhi Feb 22, 2023
Author

You can look at how the ASRDiarizationPipeline works and largely copy from that!

rjac-ml Feb 22, 2023

Hey @sanchit-gandhi thanks for the replay I will try to see if I can integrate that , the Idea is that instead of chunk by time , It will be done by Voice Activity (it will always need to manage the case were Voice Activity is more that 30 sec) in that way the Speed up in the inference will be mantain. Let me Try to go first into the ASRDiarizationPipeline and if not The HF Pipeline implementation of the chunk!

sanchit-gandhi Mar 3, 2023
Author

Awesome, sounds great @rjac-ml!

zzzacwork · 2023-02-21T03:54:04Z

zzzacwork
Feb 21, 2023

Can you share the fast implementation with us? Many thanks.

…

On Mon, Feb 20, 2023 at 7:31 PM rjac-ml ***@***.***> wrote: hey @sanchit-gandhi <https://github.com/sanchit-gandhi> I have saw one implementation of whisper that is doing a very good job into speed up the inference, using the chunk_length_s, stride_length_s and batch_size params it can process 30 mins in less than 1 minute. What I have notice is that the timestamp prediction get worse. I have test out this model using a VAD (pyannote.audio) before ingesting to OpenAI Whisper model and brings a very accurate prediction. I was wondering if I could get some guidence If i want to help implementing a chunk base on VAD and leverage the Batch Inference . (Maybe this would need to go into https://github.com/huggingface/speechbox ) — Reply to this email directly, view it on GitHub <#654 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AYKSSRAYXU2CJ4MS2Q7A4R3WYQLHLANCNFSM6AAAAAASW7LRGI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

1 reply

rjac-ml Feb 21, 2023

https://colab.research.google.com/drive/1rS1L4YSJqKUH_3YxIQHBI982zso23wor , take into consideration that install transformer from git and that this colab notbook is for translation (easy to change for transcription) then using the Batch_Size speed up all the process

nyadla-sys · 2023-02-23T00:30:33Z

nyadla-sys
Feb 23, 2023

@rjac-ml I see your colab is using WhisperForConditionalGeneration ,have you tried using TFWHisperForCOnditionalGeneration as it is is failing to transcribe for spanish and japanese

1 reply

rjac-ml Feb 23, 2023

Hey @nyadla-sys I only use the Pytorch version and the pipeline reference with the chunk_lenght_s and other params. Also the colab is from somehow in the HF team. I will take a look into that

nyadla-sys · 2023-02-23T01:02:50Z

nyadla-sys
Feb 23, 2023

Please refer the below colab and it is failing for transcribe,instead it is doing translate
https://colab.research.google.com/drive/1qWxWimzrTadZsttTnsLo-oOjDmDrUcx7?usp=sharing

4 replies

rjac-ml Feb 23, 2023

hey @nyadla-sys I have remove

forced_decoder_ids = processor.get_decoder_prompt_ids(language="es", task="transcribe")
model.config.forced_decoder_ids = forced_decoder_ids
generation_config = GenerationConfig(force_token_map=forced_decoder_ids)

and use instead

processor = WhisperProcessor.from_pretrained("openai/whisper-tiny",language="es", task="transcribe")
predicted_ids = model.generate(input_features)

and run with spanish , let me know

nyadla-sys Feb 23, 2023

Actually it is working ,does it means model.generate recognizes the language automatically?

nyadla-sys Feb 23, 2023

Updated the notebook further,please review and let me know if i am making any mistakes init

sanchit-gandhi Mar 3, 2023
Author

Related: huggingface/transformers#19691 (comment)

pli66 · 2023-03-23T01:04:49Z

pli66
Mar 23, 2023

The return_timestamps=True for pipeline does not seem to work for any .en model, only the general versions. En versions simply return an empty list for chunks. Is this expected behavior?

Also, is there any way to get things like 'avg_logprob', 'compression_ratio', 'no_speech_prob' that the regular whisper outputs from pipeline?

Thanks

4 replies

sanchit-gandhi Mar 24, 2023
Author

Hey @pli66! That's pretty strange - they're working for me locally. Could you make sure you're running the latest version of transformers:

pip install --upgrade transformers

If the problem persists, feel free to open an issue on the transformers repo and I'll endeavour to look into the cause!

Currently they cannot, since these are Whisper-specific outputs and the pipeline is designed to handle any speech recognition models, see huggingface/transformers#21311 (comment) for a similar argument as to why we haven't added the language output.

pli66 Mar 24, 2023

Thanks for your reply. Is there any way to get those outputs then from the non-pipeline way? I used model.generate(input_features, return_dict_in_generate=True, output_scores=True), but I don't quite understand what these scores are that I'm getting back, or if I'm able to "decode" them. Is this the right way?

Thanks!

pli66 Mar 28, 2023

Hey @sanchit-gandhi any idea if those timestamp descriptors like no_speech_prob can be exposed with WhisperForConditionalGeneration? Thanks.

sanchit-gandhi Apr 18, 2023
Author

Hey @pli66 - these will be the logit scores for each token. Currently, these kinds of super-specific Whisper decoding strategies aren't supported, but we'd love PRs to add them! 🤗

SinanAkkoyun · 2023-03-23T12:30:17Z

SinanAkkoyun
Mar 23, 2023

Hey!
Is it possible to somehow load the model weights into this official whisper codebase?
I've made some modifications and would love to fine-tune my model and still use the modified codebase of this exact repo

Thank you!

1 reply

sanchit-gandhi Mar 24, 2023
Author

Hey @SinanAkkoyun! You might be able to using this script: #830 (comment)

Note: I haven't tested this and can't warrant for its functionality!

Sriramraja05 · 2023-04-04T09:03:13Z

Sriramraja05
Apr 4, 2023

Hi @sanchit-gandhi ,

How could i get the word timestamp?

1 reply

sanchit-gandhi Apr 18, 2023
Author

Hey @Sriramraja05, we're hoping to have this in Transformers in the next couple of weeks time 🤗

phineas-pta · 2023-04-08T15:57:51Z

phineas-pta
Apr 8, 2023

Hello,

When I load the large model with transformers on Google Colab, it immediately crashes because out of memory. This doesn't happen if I use the whisper module. Is there any explaination ?

5 replies

sanchit-gandhi Apr 18, 2023
Author

Hey @phineas-pta, that's odd, could you share the entire code (or Colab) you're running? The memory should be nearly identical (if not lower since we compute the log-mel spectrograms using NumPy instead of torch).

konradipipan Apr 19, 2023

That's also my experience - huggingface's versions of models are much heavier than openai's original versions. OpenAI's whisper large weights about 3 GiB and takes up 10-11 GiB of CUDA memory during inference. HF's version weighs 2 times that and needs more GPU memory

phineas-pta Apr 19, 2023

@sanchit-gandhi the code is simple:

!pip install -q transformers

from transformers import pipeline
model = pipeline(task="automatic-speech-recognition", model="openai/whisper-large-v2")

the colab session crashes immediately

sanchit-gandhi Sep 6, 2023
Author

Are you connected to a GPU in your Colab runtime? You can set device="cuda:0" to move the model to the GPU once connected:

from transformers import pipeline

model = pipeline(task="automatic-speech-recognition", model="openai/whisper-large-v2", device="cuda:0")

phineas-pta Sep 30, 2023

tysm, seem like new updates fix this issue, idk which version fixed this, now vram usage stay <7gb with large-v2

tusharagarwal25 · 2023-05-05T05:02:47Z

tusharagarwal25
May 5, 2023

@sanchit-gandhi How can I get the encoder embeddings (last layer) for my audio files? I need to use it for some downstream classification task.

1 reply

sanchit-gandhi Sep 6, 2023
Author

Hey @tusharagarwal25 - you can set output_hidden_states=True. See https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperForConditionalGeneration.forward.output_hidden_states

phineas-pta · 2023-05-16T20:31:52Z

phineas-pta
May 16, 2023

just a quick example on how to use transformers

from transformers import pipeline
import torch

model = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-large-v3",
    device="cuda:0",
    torch_dtype=torch.float16,
    chunk_length_s=30, # if not precised then only generate as much as `max_new_tokens`
    generate_kwargs={"num_beams": 5} # same setting as `openai-whisper` default
)

result = model("audio.mp3", return_timestamps=True)
# result = model("audio.mp3", return_timestamps=True, generate_kwargs={"language": "es", "task": "translate"})

print(result["text"])
print(result["chunks"])

to export .srt

def convert_to_hms(seconds: float) -> str:
    hours, remainder = divmod(seconds, 3600)
    minutes, seconds = divmod(remainder, 60)
    milliseconds = round((seconds % 1) * 1000)
    output = f"{int(hours):02}:{int(minutes):02}:{int(seconds):02},{milliseconds:03}"
    return output

def convert_chunk(chunk: dict) -> str:
    start = convert_to_hms(chunk["timestamp"][0])
    end = convert_to_hms(chunk["timestamp"][1])
    text = chunk["text"].strip()
    return f"{start} --> {end}\n{text}\n\n"

with open("file.srt", "w", encoding="utf-8") as f:
    for i, chunk in enumerate(result["chunks"], start=1):
        f.write(f"{i}\n{convert_chunk(chunk)}")

for batch inference

from datasets import Dataset, Audio 
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset

files = [...] # a list of audio paths
dataset = Dataset.from_dict({"audio": files}).cast_column("audio", Audio(sampling_rate=16000))

transcriptions = []
with torch.no_grad():
    for tt in tqdm(model(KeyDataset(dataset, "audio"), batch_size=10, truncation=True)):
        transcriptions.append(tt)

1 reply

vltmedia Aug 16, 2023

Thanks for this, all documentation shows to load the audio files via the Dataset module everywhere. Totally saved my ass thanks.

phu-lam · 2023-09-08T11:35:45Z

phu-lam
Sep 8, 2023

Whisper in 🤗 Transformers

Whisper is available in the Hugging Face Transformers library from Version 4.23.1, with both PyTorch and TensorFlow implementations. All the official checkpoints can be found on the Hugging Face Hub, alongside documentation and examples scripts.

Fine-Tuning

Using the 🤗 Trainer, Whisper can be fine-tuned for speech recognition and speech translation tasks, boosting the performance of the model especially on low-resource languages. Refer to the blog post for a complete guide on fine-tuning Whisper. If you're interested in fine-tuning Whisper in your language, join us for our two-week Whisper fine-tuning event!

Evaluation

See the following example for evaluating Whisper on the LibriSpeech ASR dataset.

First, install the relevant Hugging Face packages:
pip -U transformers datasets evaluate
Next, run the Python code cell to evaluate on the "test-clean" subset of LibriSpeech. You can change the model checkpoint to any one of the official checkpoints on the Hugging Face Hub.
from datasets import load_dataset
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
from evaluate import load

librispeech_test_clean = load_dataset("librispeech_asr", "clean", split="test")

processor = WhisperProcessor.from_pretrained("openai/whisper-base.en")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base.en").to("cuda")

def map_to_pred(batch):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    batch["reference"] = processor.tokenizer._normalize(batch['text'])

    with torch.no_grad():
        predicted_ids = model.generate(input_features.to("cuda"))[0]
    transcription = processor.decode(predicted_ids)
    batch["prediction"] = processor.tokenizer._normalize(transcription)
    return batch

result = librispeech_test_clean.map(map_to_pred)

wer = load("wer")
print(100 * wer.compute(references=result["reference"], predictions=result["prediction"]))
Print Output:
4.254436419182681
Multi-Dataset Evaluation

We provide a Google Colab for evaluating Whisper on eight English speech recognition datasets in one script. This serves as a template for performing multi-dataset evaluation in a style similar to the official Whisper paper.

Dataset name Whisper tiny.en Whisper small.en
LibriSpeech Clean 5.66 3.05
LibriSpeech Other 15.38 7.53
Common Voice 31.17 15.20
VoxPopuli 12.58 8.45
TEDLIUM 14.28 12.21
GigaSpeech 14.07 11.36
SPGISpeech 5.82 3.63
Earnings-22 13.79 16.40
AMI 24.68 17.88

In my situation, when i use the first solution the result is different than when i use the code you provided in colab with the same datasets. Can u explain me why @sanchit-gandhi. Thanks you in a advance !

0 replies

Sriramraja05 · 2023-12-18T07:56:23Z

Sriramraja05
Dec 18, 2023

Hi @sanchit-gandhi , How could I enable the profinty filter in whisper, Is it supported in transformer?

0 replies

zeynabyousefi · 2024-01-16T06:06:46Z

zeynabyousefi
Jan 16, 2024

Hi friends, I want to use this whisper model for language identification, but the language I want it to recognize is one of the languages with few sources, so I have to fine-tune it on the custom data source, can you help me? I look forward to hearing from you and thank you . @sanchit-gandhi

0 replies

toanhuynhnguyen · 2024-10-19T15:58:08Z

toanhuynhnguyen
Oct 19, 2024

Based on this guide https://huggingface.co/blog/fine-tune-whisper, I tried to fine-tune "small" and "large-v3" models.

The fine-tuned "small" model works normally.
But the fine-tuned "large-v3" model works poorly on non-English audio files such as Chinese audio files, it auto-translates Chinese to English though I specified transcribing in Chinese, not translating. Have you faced this issue, and can give me advice? Thank you so much.

0 replies

Whisper in 🤗 Transformers #654