Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements #856

Merged
merged 145 commits into from
Jul 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
145 commits
Select commit Hold shift + click to select a range
fc54cb9
seed, multilingual and fixes
Jiltseb Jun 9, 2023
84d58fa
added languages in tokenizer
Jiltseb Jun 14, 2023
63bea66
multilingual fixes
Jiltseb Jun 21, 2023
b95d694
vocabulary extension fix for downloads
Jiltseb Jun 21, 2023
a8626bb
code fixes for multilingual
Jiltseb Jun 28, 2023
c2ca8d4
Squash long words at window and sentence boundaries
Jiltseb Jul 4, 2023
9edf960
added commits specifying changes to original package
Jiltseb Jul 26, 2023
d008650
seed, multilingual and fixes
Jiltseb Jun 9, 2023
2573982
added languages in tokenizer
Jiltseb Jun 14, 2023
8add326
multilingual fixes
Jiltseb Jun 21, 2023
afc3f5c
vocabulary extension fix for downloads
Jiltseb Jun 21, 2023
dd55c03
code fixes for multilingual
Jiltseb Jun 28, 2023
d34780e
Squash long words at window and sentence boundaries
Jiltseb Jul 4, 2023
9fab8d9
added commits specifying changes to original package
Jiltseb Jul 26, 2023
162fbf0
modifications based on review
Jiltseb Jul 28, 2023
ca6a2ba
removed LANGUAGES from tokenizer and added numpy requirements
Jiltseb Oct 6, 2023
0df6953
Merge remote-tracking branch 'upstream/master'
Jiltseb Oct 9, 2023
988c528
Merge local master to 'updated_js_v2.1'
Jiltseb Oct 9, 2023
443eb86
Merge pull request #1 from mobiusml/js_asr_v2.1_pr
Jiltseb Oct 9, 2023
6a51407
Update requirements.txt
Jiltseb Oct 9, 2023
4138e16
Merge pull request #2 from SYSTRAN/master
Jiltseb Dec 12, 2023
b906a98
changes to README.md
Jiltseb Dec 13, 2023
0464122
Added BatchedInferencePipeline
Jiltseb Dec 13, 2023
78b5cd7
Added language detection from multiple segments and batched inference…
Jiltseb Dec 13, 2023
f397e37
added additional packages
Jiltseb Dec 13, 2023
83895ac
changes to batched inference based on the review
Jiltseb Dec 20, 2023
e1c1699
change in silence detection
Jiltseb Dec 21, 2023
b516bc8
Merge pull request #3 from mobiusml/batched_asr
Jiltseb Dec 22, 2023
3477d86
Merge pull request #4 from SYSTRAN/master
Jiltseb Jan 22, 2024
95df9eb
added logic for torchaudio based feature extraction
Jiltseb Jan 23, 2024
0cc2d1d
added requirements
Jiltseb Jan 23, 2024
d6624ff
added feature extraction in README
Jiltseb Jan 23, 2024
fa69694
Merge pull request #5 from mobiusml/add_new_feat_extract
Jiltseb Jan 23, 2024
6698a9a
removing unwanted dataclasses and non-generator transcribe function, …
Jiltseb Mar 19, 2024
1b6376f
Merge remote-tracking branch systran/faster_whisper 'upstream/master'…
Jiltseb Mar 19, 2024
92867e3
uses same type annotation as faster_whisper for batched transcribe, c…
Jiltseb Mar 25, 2024
8452cf2
added jsons for dict conversion
Jiltseb Mar 25, 2024
4535963
made vad_segments as optional parameter, modified docstring
Jiltseb Mar 25, 2024
95671d2
made default batched asr options optional as this can be taken care d…
Jiltseb Mar 25, 2024
5fa21b8
Merge pull request #7 from mobiusml/fixes_and_update
Jiltseb Mar 26, 2024
b421086
Update requirements.txt
Jiltseb Mar 26, 2024
16d54e5
Update requirements.txt
Jiltseb Mar 26, 2024
827df36
Update requirements.txt
Jiltseb Mar 27, 2024
911c62d
Update requirements.txt
Jiltseb Mar 27, 2024
fcf8519
merging with systran fw
Jiltseb Apr 8, 2024
e288337
adding vad model and defaults for language detection
Jiltseb Apr 8, 2024
9c85222
adding utility functions for vad model
Jiltseb Apr 8, 2024
21f4640
add pyannote dependency
Jiltseb Apr 8, 2024
eff5e23
adding VAD model, tests and update README
Jiltseb Apr 9, 2024
caaa593
update requirements
Jiltseb Apr 10, 2024
538366b
Merge pull request #8 from mobiusml/fw_pr
Jiltseb Apr 11, 2024
c41e4f2
added 'use_vad_model' to better handle vad segments
Jiltseb Apr 12, 2024
0e8fa00
Update error message
Jiltseb Apr 12, 2024
0d6c62e
Merge pull request #9 from mobiusml/fw_pr
Jiltseb Apr 12, 2024
56d68a1
added gpu implementation for vad by default
Jiltseb Apr 28, 2024
2812d99
adding a vad_device, modifying vad_url
Jiltseb Apr 29, 2024
1cd3c60
adding get_device function
Jiltseb Apr 29, 2024
3f27636
Merge pull request #10 from mobiusml/fw_pr_compliance
Jiltseb Apr 29, 2024
93c327d
updating the fork
Jiltseb May 17, 2024
2152d11
Merge remote-tracking branch 'upstream/master' into pr_expt
Jiltseb May 22, 2024
10242fc
updated version, credits to whisper-x, model made optional
Jiltseb May 22, 2024
2dde3c9
Merge branch 'master' into fw_compliance
Jiltseb May 22, 2024
8fd2ec0
Merge pull request #11 from mobiusml/fw_compliance
Jiltseb May 24, 2024
0fd5003
added compatibility for python 3.8
Jiltseb May 24, 2024
9d70f0f
Reformatted the code
Jiltseb May 24, 2024
d263cbd
Merge pull request #12 from mobiusml/fw_compliance
Jiltseb May 24, 2024
c9e5f3b
making default vad_device same as asr model device
Jiltseb May 24, 2024
883be4d
added docstring
Jiltseb May 24, 2024
18bdaa8
added docstring
Jiltseb May 24, 2024
b10b8cb
Merge pull request #13 from mobiusml/fw_compliance
Jiltseb May 24, 2024
ce21fc7
Merge remote-tracking branch 'upstream/master'
Jiltseb Jun 11, 2024
afcc0f6
changes after review suggestions: remove redundant info, add vad mode…
Jiltseb Jun 11, 2024
0b63e22
modified timings for edge padding
Jiltseb Jun 12, 2024
e3dc61d
adding word_timestamps fir batched version
Jiltseb Jun 12, 2024
c694174
remove the input dictionary in place modification
Jiltseb Jun 13, 2024
a0d3891
adding model file
Jiltseb Jun 13, 2024
3c22842
Merge pull request #14 from mobiusml/fw_changes
Jiltseb Jun 13, 2024
d30b377
removing clip_timestamps and redundant info, minor typos
Jiltseb Jun 17, 2024
9937ab7
Merge pull request #15 from mobiusml/fw_changes
Jiltseb Jun 17, 2024
5c3e6f2
test scripts for word level timestamps, audios less than chunk_length…
Jiltseb Jun 18, 2024
d1f4a7e
added code validation
Jiltseb Jun 18, 2024
46310af
Merge pull request #16 from mobiusml/fw_changes
Jiltseb Jun 18, 2024
7498451
Update MANIFEST.in to include pyannote asset
hargunmujral Jun 20, 2024
307de38
Merge pull request #17 from hargunmujral/patch-1
Jiltseb Jun 20, 2024
17e30a4
.
MahmoudAshraf97 Jun 20, 2024
46532fc
Merge branch 'mobiusml:master' into master
MahmoudAshraf97 Jun 20, 2024
ad2379b
remove tokenizer reinitialization
MahmoudAshraf97 Jun 20, 2024
abcbedd
remove the need for a separate `encode_batched` function
MahmoudAshraf97 Jun 21, 2024
f584a6c
fix flake8 error
MahmoudAshraf97 Jun 21, 2024
1bd1bf7
Added punctuation changes in word_timestamps, removed jsons requirement
Jiltseb Jun 21, 2024
ebf7b65
enable word timestamps using original functions
MahmoudAshraf97 Jun 21, 2024
7f84e34
* remove `PyAV` and use `torchaudio` instead, this fixes the memory l…
MahmoudAshraf97 Jun 22, 2024
b54d828
added back `np.ndarray` support for `transcribe`
MahmoudAshraf97 Jun 24, 2024
2c617c2
fix wrong padding scheme leading to very high WER
MahmoudAshraf97 Jun 24, 2024
99d61e0
remove `num_workers` argument from batched `transcribe`
MahmoudAshraf97 Jun 24, 2024
aef4b97
generalized word timestamps function
MahmoudAshraf97 Jun 24, 2024
5fc5fca
remove redundant parameters related to `num_workers`
MahmoudAshraf97 Jun 25, 2024
389da33
fix word timestamps for non-batched inference
MahmoudAshraf97 Jun 25, 2024
2b0a252
support `without_timestamps` in batched mode
MahmoudAshraf97 Jun 25, 2024
f03d8ca
adjust tests
MahmoudAshraf97 Jun 25, 2024
7c38429
fix typing hints for older python versions
MahmoudAshraf97 Jun 25, 2024
579da0e
correct timestamps
MahmoudAshraf97 Jun 26, 2024
8642f1d
use original `Segment` instead of `BatchedSegment`
MahmoudAshraf97 Jun 27, 2024
6e47bd3
* added `duration_after_vad`, `all_language_probs` to `info`
MahmoudAshraf97 Jun 27, 2024
537317f
formatting changes
MahmoudAshraf97 Jun 27, 2024
74db8be
.
MahmoudAshraf97 Jun 27, 2024
fcf0e82
remove `float16` conversion in feature extractor as it led to halluci…
MahmoudAshraf97 Jun 27, 2024
9f78b36
enable running benchmark from anywhere
MahmoudAshraf97 Jun 29, 2024
d95c7a6
review feature extraction implementation
MahmoudAshraf97 Jun 29, 2024
968057e
formatting fixes
MahmoudAshraf97 Jun 29, 2024
eff81f5
Merge pull request #18 from MahmoudAshraf97/master
Jiltseb Jul 1, 2024
71fca47
Merge remote-tracking branch 'origin/master' into final_changes
Jiltseb Jul 1, 2024
369f297
black tool reformats
Jiltseb Jul 1, 2024
248d517
Merge remote-tracking branch 'upstream/master' into final_changes
Jiltseb Jul 1, 2024
647c092
revert silero change to master
Jiltseb Jul 1, 2024
923c5d9
moving language_id functions to WhisperModel class and removing other…
Jiltseb Jul 1, 2024
70346ca
evaluate lang_detect to a false boolean if not found
Jiltseb Jul 1, 2024
3235640
review changes
MahmoudAshraf97 Jul 1, 2024
781c051
Merge branch 'mobiusml:master' into master
MahmoudAshraf97 Jul 1, 2024
aea77b1
Merge pull request #21 from MahmoudAshraf97/master
Jiltseb Jul 1, 2024
c26e4e2
Merge remote-tracking branch 'origin/master' into final_changes
Jiltseb Jul 1, 2024
059d849
rename detect_language to detect_langauge_function in WhisperModel
Jiltseb Jul 1, 2024
5c6f6b5
Merge pull request #20 from mobiusml/fw_changes
Jiltseb Jul 1, 2024
3a63df0
fix conflicts with systran master
MahmoudAshraf97 Jul 5, 2024
3271a4a
Merge pull request #23 from MahmoudAshraf97/master
Jiltseb Jul 5, 2024
e57b5ca
Merge remote-tracking branch 'systran_master/master'
MahmoudAshraf97 Jul 5, 2024
2fc6c50
.
MahmoudAshraf97 Jul 5, 2024
8bdbca0
rename `chunk_size` to `chunk_length` for consistency
MahmoudAshraf97 Jul 5, 2024
b94bd93
Merge branch 'master' into master
Jiltseb Jul 5, 2024
fec8c4e
Merge pull request #24 from MahmoudAshraf97/master
Jiltseb Jul 5, 2024
ad080cd
review comments
MahmoudAshraf97 Jul 5, 2024
aef5869
.
MahmoudAshraf97 Jul 5, 2024
9b39b73
fixing docstring
MahmoudAshraf97 Jul 5, 2024
1dcf0c9
Merge pull request #25 from MahmoudAshraf97/master
Jiltseb Jul 5, 2024
e988ac6
fix usage with english-only models
MahmoudAshraf97 Jul 6, 2024
b3c1ace
Merge pull request #26 from MahmoudAshraf97/master
Jiltseb Jul 8, 2024
c51b877
added licensing comments inthe doc and the code
Jiltseb Jul 10, 2024
7a90ab8
Merge pull request #27 from mobiusml/fw_changes
Jiltseb Jul 10, 2024
3fd6f7c
added formatting checks
Jiltseb Jul 10, 2024
6a87d85
Merge pull request #28 from mobiusml/fw_changes
Jiltseb Jul 10, 2024
4681caa
update license info
Jiltseb Jul 11, 2024
62bb5f0
Merge pull request #29 from mobiusml/fw_changes
Jiltseb Jul 11, 2024
bb6696b
.
MahmoudAshraf97 Oct 2, 2024
5e6a426
remove duplicate `detect_language` function
MahmoudAshraf97 Oct 2, 2024
3ffb18f
Merge pull request #22 from MahmoudAshraf97/master
Jiltseb Jul 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
include faster_whisper/assets/silero_vad.onnx
include requirements.txt
include requirements.conversion.txt
include faster_whisper/assets/pyannote_vad_model.bin
30 changes: 29 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,6 @@ segments, info = model.transcribe("audio.mp3", beam_size=5, language="en")

* Python 3.8 or greater

Unlike openai-whisper, FFmpeg does **not** need to be installed on the system. The audio is decoded with the Python library [PyAV](https://github.com/PyAV-Org/PyAV) which bundles the FFmpeg libraries in its package.

### GPU

Expand Down Expand Up @@ -166,6 +165,35 @@ for segment in segments:
segments, _ = model.transcribe("audio.mp3")
segments = list(segments) # The transcription will actually run here.
```

### multi-segment language detection

To directly use the model for improved language detection, the following code snippet can be used:

```python
from faster_whisper import WhisperModel
model = WhisperModel("medium", device="cuda", compute_type="float16")
language_info = model.detect_language_multi_segment("audio.mp3")
```

### Batched faster-whisper


The batched version of faster-whisper is inspired by [whisper-x](https://github.com/m-bain/whisperX) licensed under the BSD-2 Clause license and integrates its VAD model to this library. We modify this implementation and also replaced the feature extraction with a faster torch-based implementation. Batched version improves the speed upto 10-12x compared to openAI implementation and 3-4x compared to the sequential faster_whisper version. It works by transcribing semantically meaningful audio chunks as batches leading to faster inference.

The following code snippet illustrates how to run inference with batched version on an example audio file. Please also refer to the test scripts of batched faster whisper.

```python
from faster_whisper import WhisperModel, BatchedInferencePipeline

model = WhisperModel("medium", device="cuda", compute_type="float16")
batched_model = BatchedInferencePipeline(model=model)
segments, info = batched_model.transcribe("audio.mp3", batch_size=16)

for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
```

### Faster Distil-Whisper

The Distil-Whisper checkpoints are compatible with the Faster-Whisper package. In particular, the latest [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)
Expand Down
5 changes: 4 additions & 1 deletion benchmark/wer_benchmark.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import argparse
import json
import os

from datasets import load_dataset
from evaluate import load
Expand All @@ -26,7 +27,9 @@

# define the evaluation metric
wer_metric = load("wer")
normalizer = EnglishTextNormalizer(json.load(open("normalizer.json")))

with open(os.path.join(os.path.dirname(__file__), "normalizer.json"), "r") as f:
normalizer = EnglishTextNormalizer(json.load(f))


def inference(batch):
Expand Down
3 changes: 2 additions & 1 deletion faster_whisper/__init__.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
from faster_whisper.audio import decode_audio
from faster_whisper.transcribe import WhisperModel
from faster_whisper.transcribe import BatchedInferencePipeline, WhisperModel
from faster_whisper.utils import available_models, download_model, format_timestamp
from faster_whisper.version import __version__

__all__ = [
"available_models",
"decode_audio",
"WhisperModel",
"BatchedInferencePipeline",
"download_model",
"format_timestamp",
"__version__",
Expand Down
Binary file added faster_whisper/assets/pyannote_vad_model.bin
Binary file not shown.
105 changes: 22 additions & 83 deletions faster_whisper/audio.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,7 @@
"""We use the PyAV library to decode the audio: https://github.com/PyAV-Org/PyAV

The advantage of PyAV is that it bundles the FFmpeg libraries so there is no additional
system dependencies. FFmpeg does not need to be installed on the system.

However, the API is quite low-level so we need to manipulate audio frames directly.
"""

import gc
import io
import itertools

from typing import BinaryIO, Union

import av
import numpy as np
import torch
import torchaudio


def decode_audio(
Expand All @@ -29,91 +17,42 @@ def decode_audio(
split_stereo: Return separate left and right channels.

Returns:
A float32 Numpy array.
A float32 Torch Tensor.

If `split_stereo` is enabled, the function returns a 2-tuple with the
separated left and right channels.
"""
resampler = av.audio.resampler.AudioResampler(
format="s16",
layout="mono" if not split_stereo else "stereo",
rate=sampling_rate,
)

raw_buffer = io.BytesIO()
dtype = None

with av.open(input_file, mode="r", metadata_errors="ignore") as container:
frames = container.decode(audio=0)
frames = _ignore_invalid_frames(frames)
frames = _group_frames(frames, 500000)
frames = _resample_frames(frames, resampler)

for frame in frames:
array = frame.to_ndarray()
dtype = array.dtype
raw_buffer.write(array)

# It appears that some objects related to the resampler are not freed
# unless the garbage collector is manually run.
del resampler
gc.collect()

audio = np.frombuffer(raw_buffer.getbuffer(), dtype=dtype)

# Convert s16 back to f32.
audio = audio.astype(np.float32) / 32768.0
waveform, audio_sf = torchaudio.load(input_file) # waveform: channels X T

Jiltseb marked this conversation as resolved.
Show resolved Hide resolved
if audio_sf != sampling_rate:
waveform = torchaudio.functional.resample(
waveform, orig_freq=audio_sf, new_freq=sampling_rate
)
if split_stereo:
left_channel = audio[0::2]
right_channel = audio[1::2]
return left_channel, right_channel

return audio


def _ignore_invalid_frames(frames):
iterator = iter(frames)

while True:
try:
yield next(iterator)
except StopIteration:
break
except av.error.InvalidDataError:
continue


def _group_frames(frames, num_samples=None):
fifo = av.audio.fifo.AudioFifo()

for frame in frames:
frame.pts = None # Ignore timestamp check.
fifo.write(frame)

if num_samples is not None and fifo.samples >= num_samples:
yield fifo.read()

if fifo.samples > 0:
yield fifo.read()

return waveform[0], waveform[1]

def _resample_frames(frames, resampler):
# Add None to flush the resampler.
for frame in itertools.chain(frames, [None]):
yield from resampler.resample(frame)
return waveform.mean(0)


def pad_or_trim(array, length: int, *, axis: int = -1):
"""
Pad or trim the audio array to N_SAMPLES, as expected by the encoder.
"""
axis = axis % array.ndim
if array.shape[axis] > length:
array = array.take(indices=range(length), axis=axis)
idx = [Ellipsis] * axis + [slice(length)] + [Ellipsis] * (array.ndim - axis - 1)
return array[idx]

if array.shape[axis] < length:
pad_widths = [(0, 0)] * array.ndim
pad_widths[axis] = (0, length - array.shape[axis])
array = np.pad(array, pad_widths)
pad_widths = (
[
0,
]
* array.ndim
* 2
)
pad_widths[2 * axis] = length - array.shape[axis]
array = torch.nn.functional.pad(array, tuple(pad_widths[::-1]))

return array
Loading
Loading