Whisper in 🤗 Transformers #654
Replies: 20 comments 39 replies
-
@sanchit-gandhi |
Beta Was this translation helpful? Give feedback.
-
These numbers don't look right. Tiny is at 8.6%, Small 6.7% and even less.
Something wrong here too |
Beta Was this translation helpful? Give feedback.
-
Can you please explain how to output the avg_logprob ? |
Beta Was this translation helpful? Give feedback.
-
it seems that cannot run in batch? |
Beta Was this translation helpful? Give feedback.
-
Hey Sanchit! Thank you for the great Whisper write-ups. This example seems to run great for the first 30 seconds. Is it possible to run inference on audio files longer than 30 seconds without Pipe? I've noticed that it is necessary to tune stride_length_s in the pipeline to get stable results and was wondering if it is possible to make outputs more stable without pipe. Thanks! |
Beta Was this translation helpful? Give feedback.
-
As far as I know, this is the only way to get Whisper to run on the GPU of an Apple Silicon Mac. Whisper with the TF backend + Apple's Metal Tensorflow plugin and I was able to see my GPU cores light up when transcribing an audio file. However, it's a pretty low level API, and unlike the reference Whisper implementation in python, it doesn't have any utility functions to produce .vtt files or output that includes time stamps. It also isn't (for me) creating any punctuation or sentences like the standard OpenAI implementation. And I had to manually pull audio from .wav to a numpy array, split it, and then send a list of small arrays to the feature extractor and model. Does anyone know of any modules or wrappers that use the Tensorflow/Huggingface backend, but can just run on the command line, take an audio file, and spit out a .vtt file? One of the notebooks I found refers to a |
Beta Was this translation helpful? Give feedback.
-
hey @sanchit-gandhi I have saw one implementation of whisper that is doing a very good job into speed up the inference, using the |
Beta Was this translation helpful? Give feedback.
-
Can you share the fast implementation with us?
Many thanks.
…On Mon, Feb 20, 2023 at 7:31 PM rjac-ml ***@***.***> wrote:
hey @sanchit-gandhi <https://github.com/sanchit-gandhi> I have saw one
implementation of whisper that is doing a very good job into speed up the
inference, using the chunk_length_s, stride_length_s and batch_size
params it can process 30 mins in less than 1 minute. What I have notice is
that the timestamp prediction get worse. I have test out this model using a
VAD (pyannote.audio) before ingesting to OpenAI Whisper model and brings a
very accurate prediction. I was wondering if I could get some guidence If i
want to help implementing a chunk base on VAD and leverage the Batch
Inference . (Maybe this would need to go into
https://github.com/huggingface/speechbox )
—
Reply to this email directly, view it on GitHub
<#654 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AYKSSRAYXU2CJ4MS2Q7A4R3WYQLHLANCNFSM6AAAAAASW7LRGI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
@rjac-ml I see your colab is using WhisperForConditionalGeneration ,have you tried using TFWHisperForCOnditionalGeneration as it is is failing to transcribe for spanish and japanese |
Beta Was this translation helpful? Give feedback.
-
Please refer the below colab and it is failing for transcribe,instead it is doing translate |
Beta Was this translation helpful? Give feedback.
-
The return_timestamps=True for pipeline does not seem to work for any .en model, only the general versions. En versions simply return an empty list for chunks. Is this expected behavior? Also, is there any way to get things like 'avg_logprob', 'compression_ratio', 'no_speech_prob' that the regular whisper outputs from pipeline? Thanks |
Beta Was this translation helpful? Give feedback.
-
Hey! Thank you! |
Beta Was this translation helpful? Give feedback.
-
Hi @sanchit-gandhi , How could i get the word timestamp? |
Beta Was this translation helpful? Give feedback.
-
Hello, When I load the large model with |
Beta Was this translation helpful? Give feedback.
-
@sanchit-gandhi How can I get the encoder embeddings (last layer) for my audio files? I need to use it for some downstream classification task. |
Beta Was this translation helpful? Give feedback.
-
just a quick example on how to use from transformers import pipeline
import torch
model = pipeline(
task="automatic-speech-recognition",
model="openai/whisper-large-v3",
device="cuda:0",
torch_dtype=torch.float16,
chunk_length_s=30, # if not precised then only generate as much as `max_new_tokens`
generate_kwargs={"num_beams": 5} # same setting as `openai-whisper` default
)
result = model("audio.mp3", return_timestamps=True)
# result = model("audio.mp3", return_timestamps=True, generate_kwargs={"language": "es", "task": "translate"})
print(result["text"])
print(result["chunks"]) to export .srt def convert_to_hms(seconds: float) -> str:
hours, remainder = divmod(seconds, 3600)
minutes, seconds = divmod(remainder, 60)
milliseconds = round((seconds % 1) * 1000)
output = f"{int(hours):02}:{int(minutes):02}:{int(seconds):02},{milliseconds:03}"
return output
def convert_chunk(chunk: dict) -> str:
start = convert_to_hms(chunk["timestamp"][0])
end = convert_to_hms(chunk["timestamp"][1])
text = chunk["text"].strip()
return f"{start} --> {end}\n{text}\n\n"
with open("file.srt", "w", encoding="utf-8") as f:
for i, chunk in enumerate(result["chunks"], start=1):
f.write(f"{i}\n{convert_chunk(chunk)}") for batch inference from datasets import Dataset, Audio
from tqdm import tqdm
from transformers.pipelines.pt_utils import KeyDataset
files = [...] # a list of audio paths
dataset = Dataset.from_dict({"audio": files}).cast_column("audio", Audio(sampling_rate=16000))
transcriptions = []
with torch.no_grad():
for tt in tqdm(model(KeyDataset(dataset, "audio"), batch_size=10, truncation=True)):
transcriptions.append(tt) |
Beta Was this translation helpful? Give feedback.
-
In my situation, when i use the first solution the result is different than when i use the code you provided in colab with the same datasets. Can u explain me why @sanchit-gandhi. Thanks you in a advance ! |
Beta Was this translation helpful? Give feedback.
-
Hi @sanchit-gandhi , How could I enable the profinty filter in whisper, Is it supported in transformer? |
Beta Was this translation helpful? Give feedback.
-
Hi friends, I want to use this whisper model for language identification, but the language I want it to recognize is one of the languages with few sources, so I have to fine-tune it on the custom data source, can you help me? I look forward to hearing from you and thank you . @sanchit-gandhi |
Beta Was this translation helpful? Give feedback.
-
Based on this guide https://huggingface.co/blog/fine-tune-whisper, I tried to fine-tune "small" and "large-v3" models.
|
Beta Was this translation helpful? Give feedback.
-
Whisper in 🤗 Transformers
Whisper is available in the Hugging Face Transformers library from Version 4.23.1, with both PyTorch and TensorFlow implementations. All the official checkpoints can be found on the Hugging Face Hub, alongside documentation and examples scripts.
Fine-Tuning
Using the 🤗 Trainer, Whisper can be fine-tuned for speech recognition and speech translation tasks, boosting the performance of the model especially on low-resource languages. Refer to the blog post for a complete guide on fine-tuning Whisper. If you're interested in fine-tuning Whisper in your language, join us for our two-week Whisper fine-tuning event!
Evaluation
See the following example for evaluating Whisper on the LibriSpeech ASR dataset.
First, install the relevant Hugging Face packages:
Next, run the Python code cell to evaluate on the "test-clean" subset of LibriSpeech. You can change the model checkpoint to any one of the official checkpoints on the Hugging Face Hub.
Print Output:
Multi-Dataset Evaluation
We provide a Google Colab for evaluating Whisper on eight English speech recognition datasets in one script. This serves as a template for performing multi-dataset evaluation in a style similar to the official Whisper paper.
Beta Was this translation helpful? Give feedback.
All reactions