Merge pull request #8 from mobiusml/fw_pr

Updates to mobius version to comply with SYSTRAN version
mobiusml · Apr 11, 2024 · 538366b · 538366b
2 parents 911c62d + caaa593
commit 538366b
Show file tree

Hide file tree

Showing 14 changed files with 860 additions and 306 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -7,7 +7,7 @@ Contributions are welcome! Here are some pointers to help you install the librar
 We recommend installing the module in editable mode with the `dev` extra requirements:
 
 ```bash
-git clone https://github.com/guillaumekln/faster-whisper.git
+git clone https://github.com/SYSTRAN/faster-whisper.git
 cd faster-whisper/
 pip install -e .[dev]
 ```

diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2023 Guillaume Klein
+Copyright (c) 2023 SYSTRAN
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -1,21 +1,11 @@
-[![CI](https://github.com/guillaumekln/faster-whisper/workflows/CI/badge.svg)](https://github.com/guillaumekln/faster-whisper/actions?query=workflow%3ACI) [![PyPI version](https://badge.fury.io/py/faster-whisper.svg)](https://badge.fury.io/py/faster-whisper)
+[![CI](https://github.com/SYSTRAN/faster-whisper/workflows/CI/badge.svg)](https://github.com/SYSTRAN/faster-whisper/actions?query=workflow%3ACI) [![PyPI version](https://badge.fury.io/py/faster-whisper.svg)](https://badge.fury.io/py/faster-whisper)
 
 # Mobius Faster Whisper transcription with CTranslate2
 
 **faster-whisper** is a reimplementation of OpenAI's Whisper model using [CTranslate2](https://github.com/OpenNMT/CTranslate2/), which is a fast inference engine for Transformer models.
 
 This implementation is up to 4 times faster than [openai/whisper](https://github.com/openai/whisper) for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.
 
-Mobius faster-whisper builds on top of faster-whisper v0.10.0 (latest stable version) and support additional functionalities:
-
-- Handling multilingual videos.
-- Seed fixing for consistency across runs.
-- Use `log_prob_low_threshold` to skip ambiguous segments from transcription.
-- Better language prediction using multiple audio segments.
-- Batched inference for faster transcription: Around 100x real time speed.
-- Streaming (segment-level) or non-streaming options for Batched inference.
-- Option for faster feature extraction with torchaudio.
-
 ## Benchmark
 
 ### Whisper
@@ -24,7 +14,7 @@ For reference, here's the time and memory usage that are required to transcribe
 
 * [openai/whisper](https://github.com/openai/whisper)@[6dea21fd](https://github.com/openai/whisper/commit/6dea21fd7f7253bfe450f1e2512a0fe47ee2d258)
 * [whisper.cpp](https://github.com/ggerganov/whisper.cpp)@[3b010f9](https://github.com/ggerganov/whisper.cpp/commit/3b010f9bed9a6068609e9faf52383aea792b0362)
-* [faster-whisper](https://github.com/guillaumekln/faster-whisper)@[cce6b53e](https://github.com/guillaumekln/faster-whisper/commit/cce6b53e4554f71172dad188c45f10fb100f6e3e)
+* [faster-whisper](https://github.com/SYSTRAN/faster-whisper)@[cce6b53e](https://github.com/SYSTRAN/faster-whisper/commit/cce6b53e4554f71172dad188c45f10fb100f6e3e)
 
 ### Large-v2 model on GPU
 
@@ -127,13 +117,13 @@ pip install faster-whisper
 ### Install the master branch
 
 ```bash
-pip install --force-reinstall "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/refs/heads/master.tar.gz"
+pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/refs/heads/master.tar.gz"
 ```
 
 ### Install a specific commit
 
 ```bash
-pip install --force-reinstall "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/a4f1cc8f11433e454c3934442b5e1a4ed5e865c3.tar.gz"
+pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/a4f1cc8f11433e454c3934442b5e1a4ed5e865c3.tar.gz"
 ```
 
 </details>
@@ -169,18 +159,53 @@ for segment in segments:
 segments, _ = model.transcribe("audio.mp3")
 segments = list(segments)  # The transcription will actually run here.
 ```
-### Faster-distil-whisper
-For usage of `faster-distil-whisper`, please refer to: https://github.com/guillaumekln/faster-whisper/issues/533
+
+### multi-segment language detection
+
+To directly use the model for improved language detection, following code snippet can be used:
 
 ```python
-model_size = "distil-large-v2"
-# model_size = "distil-medium.en"
+from faster_whisper import WhisperModel
+model = WhisperModel("medium", device="cuda", compute_type="float16")
+language_info = model.detect_language_multi_segment("audio.mp3")
+```
+
+### Batched faster-whisper
+
+The batched version of faster-whisper is inspired by [whisper-x](https://github.com/m-bain/whisperX) licensed under the BSD-4 Clause license and kaldi-based feature extraction. It improves the speed upto 10x compared to openAI implementation. It works by transcribing semantically meaningful audio chunks as batches leading to faster inference.
+
+The following code snippet illustrates how to run inference with batched version on a specified audio file. Please also refer to the test scripts of batched faster whisper.
+
+```python
+from faster_whisper import BatchedInferencePipeline
+
+model = WhisperModel("medium", device="cuda", compute_type="float16")
+batched_model = BatchedInferencePipeline(model=model)
+result = batched_model.transcribe("audio.mp3", batch_size=16)
+
+for segment, info in result:
+    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
+```
+
+### Faster Distil-Whisper
+
+The Distil-Whisper checkpoints are compatible with the Faster-Whisper package. In particular, the latest [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)
+checkpoint is intrinsically designed to work with the Faster-Whisper transcription algorithm. The following code snippet 
+demonstrates how to run inference with distil-large-v3 on a specified audio file:
+
+```python
+from faster_whisper import WhisperModel
+
+model_size = "distil-large-v3"
+
 model = WhisperModel(model_size, device="cuda", compute_type="float16")
-segments, info = model.transcribe("audio.mp3", beam_size=5, 
-    language="en", max_new_tokens=128, condition_on_previous_text=False)
+segments, info = model.transcribe("audio.mp3", beam_size=5, language="en", condition_on_previous_text=False)
 
+for segment in segments:
+    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
 ```
-NOTE: Empirically, `condition_on_previous_text=True` will degrade the performance of `faster-distil-whisper` for long audio. Degradation on the first chunk was observed with `initial_prompt` too.
+
+For more information about the distil-large-v3 model, refer to the original [model card](https://huggingface.co/distil-whisper/distil-large-v3).
 
 ### Word-level timestamps
 
@@ -200,7 +225,7 @@ The library integrates the [Silero VAD](https://github.com/snakers4/silero-vad)
 segments, _ = model.transcribe("audio.mp3", vad_filter=True)
 ```
 
-The default behavior is conservative and only removes silence longer than 2 seconds. See the available VAD parameters and default values in the [source code](https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/vad.py). They can be customized with the dictionary argument `vad_parameters`:
+The default behavior is conservative and only removes silence longer than 2 seconds. See the available VAD parameters and default values in the [source code](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/vad.py). They can be customized with the dictionary argument `vad_parameters`:
 
 ```python
 segments, _ = model.transcribe(
@@ -223,7 +248,7 @@ logging.getLogger("faster_whisper").setLevel(logging.DEBUG)
 
 ### Going further
 
-See more model and transcription options in the [`WhisperModel`](https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/transcribe.py) class implementation.
+See more model and transcription options in the [`WhisperModel`](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/transcribe.py) class implementation.
 
 ## Community integrations
 

diff --git a/faster_whisper/__init__.py b/faster_whisper/__init__.py
@@ -1,5 +1,5 @@
 from faster_whisper.audio import decode_audio
-from faster_whisper.transcribe import WhisperModel, BatchedInferencePipeline
+from faster_whisper.transcribe import BatchedInferencePipeline, WhisperModel
 from faster_whisper.utils import available_models, download_model, format_timestamp
 from faster_whisper.version import __version__
 

diff --git a/faster_whisper/assets/__init__.py b/faster_whisper/assets/__init__.py
diff --git a/faster_whisper/feature_extractor.py b/faster_whisper/feature_extractor.py
@@ -23,7 +23,7 @@ def __init__(
         self.mel_filters = self.get_mel_filters(
             sampling_rate, n_fft, n_mels=feature_size
         )
-        self.n_mels=feature_size
+        self.n_mels = feature_size
 
     def get_mel_filters(self, sr, n_fft, n_mels=128, dtype=np.float32):
         # Initialize the weights
@@ -145,16 +145,16 @@ def stft(self, frames, window):
             data[f] = np.fft.fft(fft_signal, axis=0)[:num_fft_bins]
         return data.T
 
-    def __call__(self, waveform, enable_ta = False, padding=True, chunk_length=None):
+    def __call__(self, waveform, enable_ta=False, padding=True, chunk_length=None):
         """
         Compute the log-Mel spectrogram of the provided audio, gives similar results
-        whisper's original torch implementation with 1e-5 tolerance. Additionally, faster 
+        whisper's original torch implementation with 1e-5 tolerance. Additionally, faster
         feature extraction option using kaldi fbank features are available if torchaudio is
         available.
         """
         if enable_ta:
             waveform = waveform.astype(np.float32)
-        
+
         if chunk_length is not None:
             self.n_samples = chunk_length * self.sampling_rate
             self.nb_max_frames = self.n_samples // self.hop_length
@@ -165,16 +165,16 @@ def __call__(self, waveform, enable_ta = False, padding=True, chunk_length=None)
         if enable_ta:
             audio = torch.from_numpy(waveform).unsqueeze(0)
             fbank = ta_kaldi.fbank(
-                    audio,
-                    sample_frequency=self.sampling_rate,
-                    window_type="hanning",
-                    num_mel_bins=self.n_mels,
-                )
-            log_spec = fbank.numpy().T.astype(np.float32) #ctranslate does not take 64
-        
-            #normalize
-            
-            #Audioset values as default mean and std for audio
+                audio,
+                sample_frequency=self.sampling_rate,
+                window_type="hanning",
+                num_mel_bins=self.n_mels,
+            )
+            log_spec = fbank.numpy().T.astype(np.float32)  # ctranslate does not take 64
+
+            # normalize
+
+            # Audioset values as default mean and std for audio
             mean_val = -4.2677393
             std_val = 4.5689974
             scaled_features = (log_spec - (mean_val)) / (std_val * 2)