Skip to content

Latest commit

 

History

History
193 lines (143 loc) · 8.89 KB

speech-transcription-whisper.md

File metadata and controls

193 lines (143 loc) · 8.89 KB

whisper module

The whisper module contains the WhisperLearner class, which inherits from the abstract class Learner.

Class WhisperLearner

Bases: engine.learners.Learner

The WhisperLearner class is a wrapper of Whisper libary [1] implementation. The integration focus on the speech transcription task. Below are the names of available models and their approximate memory requirements and relative speed [1].

Size Parameters English-only model Multilingual model Required VRAM Relative speed
tiny 39 M tiny.en tiny ~1 GB ~32x
base 74 M base.en base ~1 GB ~16x
small 244 M small.en small ~2 GB ~6x
medium 769 M medium.en medium ~5 GB ~2x
large 1550 M N/A large ~10 GB 1x

The .en models for English-only applications tend to perform better, especially for the tiny.en and base.en models. We observed that the difference becomes less significant for the small.en and medium.en models.

The WhisperLearner class has the following public methods:

WhisperLearner constructor

WhisperLearner(self, verbose, temperature, compression_ratio_threshold, logprob_threshold, 
               no_speech_threshold, condition_on_previous_tex, word_timestamps, prepend_punctuations,
               append_punctuations, language, sample_len, best_of, beam_size, patience, length_penalty,
               prompt, prefix, suppress_tokens, suppress_blank, without_timestamps, max_initial_timestamp, fp16, device)

Constructor parameters:

  • verbose: bool
    Whether to display the text being decoded to the console. If True, displays all the details. If False, displays minimal details. If None, does not display anything.
  • temperature: Union[float, Tuple[float, ...]], default=0.0
    Temperature for sampling. It can be a tuple of temperatures, which will be successively used upon failures according to either compression_ratio_threshold or logprob_threshold.
  • compression_ratio_threshold: Optional[float], default=2.4
    If the gzip compression ratio is above this value, treat as failed.
  • logprob_threshold: Optional[float], default=-0.8
    If the average log probability over sampled tokens is below this value, treat as failed
  • no_speech_threshold: Optional[float], default=0.6
    If the no_speech probability is higher than this value AND the average log probability over sampled tokens is below logprob_threshold, consider the segment as silent.
  • condition_on_previous_text: bool, default=False
    If True, the previous output of the model is provided as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop, such as repetition looping or timestamps going out of sync.
  • word_timestamps: bool, default=False
    Extract word-level timestamps using the cross-attention pattern and dynamic time warping, and include the timestamps for each word in each segment.
  • prepend_punctuations: str, default='"'“¿([{-'
    If word_timestamps is True, merge these punctuation symbols with the next word.
  • append_punctuations: float, default='"'.。,,!!??::”)]}、'
    If word_timestamps is True, merge these punctuation symbols with the previous word.
  • language: Optional[str], default='en'
    Language spoken in the audio, specify None to perform language detection.
  • beam_size: Optional[int], default=None
    Number of beams in beam search, only applicable when temperature is zero.
  • patience: Optional[float], default=None
    Optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search.
  • length_penalty: Optional[float], default=None
    Optional token length penalty coefficient (alpha) as in https://arxiv.org/abs/1609.08144, uses simple length normalization by default.
  • prompt: Optional[Union[str, List[int]]], default=None
    Text or tokens to feed as the prompt; for more info: openai/whisper#117 (comment)
  • prefix: Optional[Union[str, List[int]]], default=None
    Text or tokens to feed as the prefix; for more info: openai/whisper#117 (comment)
  • suppress_tokens: Optional[Union[str, Iterable[int]]], default=-1
    Comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations.
  • suppress_blank: bool, default=True
    Suppress blank outputs.
  • without_timestamps: bool, default=False
    Use <|notimestamps|> to sample text tokens only, the timestamp will be multiple of 30 seconds if the audio file is longer than 30 seconds.
  • max_initial_timestamp: Optional[float], default=1
    Limit the range of timestamp tokens that can be generated at the beginning of a sequence.
  • fp16: bool, default=True
    Whether to perform inference in fp16. fp16 is not available on CPU.
  • device: str, default="cuda"
    Device to use for PyTorch inference, either "cpu" or "cuda".

See tokenizer.py in [1] for all available languages. If the model name already includes .en, then the language will be English.

WhisperLearner.eval

WhisperLearner.eval(self, dataset, save_path_csv)

This method is used to evaluate Whisper model on the given dataset.

Returns a dictionary containing evaluation metrics such as word error rate.

Parameters:

  • dataset: DatasetIterator
    A speech dataset.
  • save_path_csv: Optional[str], default=None
    The path to save the evaluation results.

WhisperLearner.infer

WhisperLearner.infer(self, audio, initial_prompt)

This method runs inference on an audio sample. Please call the load() method before calling this method. initial_prompt is a string that can be used to suggest the context of the transcription text. For example: the name of a person that will appear in the transcription.

Return transcription as WhisperTranscription that contains transcription text and other side information.

Parameters:

  • audio: Union[Timeseries, np.ndarray, torch.Tensor, str]
    The audio sample as a Timeseries, torch.Tensor, or np.ndarray or a file path as str.
  • initial_prompt: Optional[str]
    Optional text to provide as a prompt for the first window. This can be used to provide, or "prompt-engineer" a context for transcription, e.g. custom vocabularies or proper nouns to make it more likely to predict those word correctly.

WhisperLearner.load

WhisperLearner.load(self, name, model_path, download_dir, in_memory)

This method loads Whisper model using Whisper builtin load() method. This method will download model if necessary.

Parameters:

  • name: Optional[str], default=None
    Name of Whisper model. Could be: tiny.en, tiny, base, base.en, etc.

  • model_path: Optional[str], default=None
    Path to model checkpoint.

  • download_dir: Optional[str], default=None
    Directory to save the downloaded model.

  • in_memory: Optional[bool], default=False
    Whether to load the model in memory.

WhisperLearner.download

WhisperLearner.download(self, name, download_dir)

This method downloads Whisper model.

Parameters:

  • name: Optional[str]
    Name or path of model.
  • download_dir: Optional[str], default=None
    Directory to save the downloaded model.

WhisperLearner.reset

WhisperLearner.reset(self)

This method sets Whisper model and model name attributes to None. Use before loading a new model.

Examples

  • Download and load a model by its name and infer a sample from an existing file.
import librosa
import numpy as np

from opendr.engine.data import Timeseries
from opendr.perception.speech_transcription import WhisperLearner

learner = WhisperLearner(language="en")
learner.load(name="tiny.en")

# Assuming you have recorded your own voice sample in audio.wav in the current directory
signal, sampling_rate = librosa.load("audio.wav", sr=learner.sample_rate)
signal = np.expand_dims(signal, axis=0)
timeseries = Timeseries(signal)
result = learner.infer(timeseries)
print(result)

References

[1] Github: openai/whisper.