-
-
Notifications
You must be signed in to change notification settings - Fork 66
Options
The documentation below is for detailed on customizing the app. If you are just getting started you can check the getting started guide at getting started
You can choose the following action:
- Stay on top
- Hide
- Exit
You can choose the following action:
- Settings (Shortcut:
F2
)
Open the settings menu - Log (Shortcut:
Ctrl + F1
)
Open log window - Export Directory
Open export directory - Log Directory
Open log directory - Model Directory
Open model directory
You can choose the following action:
-
Transcribed speech subtitle window (Shortcut:
F3
)
Shows the result of the transcription in recording session but in a detached window just like a subtitle box. -
Translated speech subtitle window (Shortcut:
F4
)
Shows the result of the translation in recording session but in a detached window just like a subtitle box.
Preview:
Windows user can further customize it to remove the background by right clicking the window
and choosing the clickthrough/transparent
option.
You can choose the following action:
- About (Shortcut:
F1
) - Open Documentation / Wiki
- Visit Repository
Select the model for transcription, you can choose between the following:
- Tiny
- Base
- Small
- Medium
- Large
Each model have different requirements and produce different result. for more information you can check it directly at the whisper repository.
Select the method for translation.
- Whisper (To english only from 99 language available)
- Google Translate (133 target language with 94 of it have compatibility with whisper as source language)
- LibreTranslate v1.5.1 (45 target language with 43 of it have compatibility with whisper as source language)
- MyMemoryTranslator (127 target language with 93 of it have compatibility with whisper as source language)
Set language to translate from. The selection of the language in this option will be different depending on the selected method in the Translate
option.
Set language to translate to. The selection of the language in this option will be different depending on the selected method in the Translate
option.
Swap the language in From
and To
option. Will also swap the textbox result.
Clear the textbox result.
Set the device Host API for recording.
Set the mic device for recording. This option will be different depending on the selected Host API in the HostAPI
option.
Set the speaker device for recording. This option will be different depending on the selected Host API in the HostAPI
option. (Only on windows 8 and above)
Set the task to do for recording. The task available are:
- Transcribe
- Translate
Set the input for recording. The input available are:
- Microphone
- Speaker
Copy the textbox result.
Open the tool dropdown menu. The tool available are:
- Export recorded results
- Align results
- Refine results
- Translate results
Start recording. The button will change to Stop
when recording.
Import file to transcribe, will open its own modal window.
Wether to check if there is any new update or not on every app startup. (Default checked
)
Wether to suppress the notification to show that the app is now hidden to tray. (Default unchecked
)
Wether to supress any warning that might show up related to device. (Default unchecked
)
Set log folder location, to do it press the button on the right. Action available:
- Open folder
- Change log folder
- Set back to default
- Empty log folder
Wether to log the record session verbosely. (Default unchecked
)
Wether to keep the log files or not. If not checked
, the log files will be deleted everytime the app runs. (Default unchecked
)
Set log level. (Default DEBUG)
Wether to show debug log for record session. Setting this on might slow down the app. (Default unchecked
)
Wether to save recorded audio in the record session into the debug
folder. The audio here will be saved as .wav
in the debug folder, and if unchecked will be deleted automatically every run. Setting this on might slow down the app (Default unchecked
)
Wethere to show debug log for translate session. (Default unchecked
)
Set model folder location, to do it press the button on the right. Action available:
- Open folder
- Change model folder
- Set back to default
- Empty model folder
- Download model
You can download the model by pressing the download button. Each model have different requirements and produce different result. You can read more about it here.
Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
---|---|---|---|---|---|
tiny | 39 M | tiny.en |
tiny |
~1 GB | ~32x |
base | 74 M | base.en |
base |
~1 GB | ~16x |
small | 244 M | small.en |
small |
~2 GB | ~6x |
medium | 769 M | medium.en |
medium |
~5 GB | ~2x |
large | 1550 M | N/A | large |
~10 GB | 1x |
Whisper model have 3 large version (v1, v2, and v3) but for now faster whisper only support v1 and v2. You can read more about it here
Note
Speaker input only works on windows 8 and above
Set sample rate for the input device. (Default 16000
)
Set channels for the input device. (Default 1
)
Set chunk size for the input device. (Default 1024
)
Wether to automatically set the sample rate based on the input device. (Default unchecked
for microphone
and checked
for speaker
)
Wether to automatically set the channels based on the input device. (Default unchecked
for microphone
and checked
for speaker
)
Set the rate for transcribing the audio in milisecond. (Default 300
)
Conversion method to feed to the whisper model. (Default is using Numpy Array
)
Use numpy array to feed to the model. This method is faster because of no need to write out the audio to wav file.
Use temporary wav file to feed to the model. Using this might slow down the process because of the File IO operation. Using this might help fix error related to device (which rarely happens). When both VAD and Demucs are enabled in record session, this option will be used automatically.
Set the maximum buffer size for the audio in seconds. (Default 10
)
Set max number of sentences. One sentence equals to one buffer. So if max buffer is 10 seconds, the words that are in those 10 seconds is the sentence. (Default 5
)
Wether to enable threshold or not. If enabled, the app will only transcribe the audio if the audio is above the threshold. (Default checked
)
If set to auto, will use VAD (voice activity detection) for the threshold. The VAD is using WebRTC VAD through py-webrtcvad. (Default checked
)
If set to auto, the user will need to select the VAD sensitivity to filter out the noise. The higher the sensitivity, the more noise will be filtered out. If not set to auto, the user will need to set the threshold manually.
Set the separator for the text result. (Default \n
)
Wether to use faster whisper or not. (Default checked
)
Set the decoding preset. (Default Beam Search
). You can choose between the following:
- Greedy, greedy will set the temperature parameter to 0.0 with both best of and beam size set to none
- Beam Search, beam search will set the temperature parameter with fallback of 0.2, so the temperature is 0.0 0.2 0.4 0.6 0.8 1.0, both best of and beam size are set to 5
- Custom, set your own decoding option
Temperature to use for sampling
Number of candidates when sampling with non-zero temperature
Number of beams in beam search, only applicable when temperature is zero
If the gzip compression ratio is higher than this value, treat the decoding as failed
If the average log probability is lower than this value, treat the decoding as failed
If the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to logprob_threshold
, consider the segment as silence
Optional text to provide as a prompt for the first window
Comma-separated list of token ids to suppress during sampling. '-1' will suppress most special characters except common punctuations
if True, provide the previous output of the model as a prompt for the next window, disabling may make the text inconsistent across windows,
Command line arguments / parameters to be used. It has the same options as when using stable-ts
with CLI but with some parameter removed because it is set in the app / GUI. All of the parameter are:
# [device]
* description: device to use for PyTorch inference (A Cuda compatible GPU and PyTorch with CUDA support are still required for GPU / CUDA)
* type: str, default cuda
* usage: --device cpu
# [cpu_preload]
* description: load model into CPU memory first then move model to specified device; this reduces GPU memory usage when loading model.
* type: bool, default True
* usage: --cpu_preload True
# [dynamic_quantization]
* description: whether to apply Dynamic Quantization to model to reduce memory usage (~half less) and increase inference speed at cost of slight decrease in accuracy; Only for CPU; NOTE: overhead might make inference slower for models smaller than 'large'
* type: bool, default False
* usage: --dynamic_quantization
# [prepend_punctuations]
* description: Punctuations to prepend to the next word
* type: str, default "'“¿([{-"
* usage: --prepend_punctuations "<punctuation>"
# [append_punctuations]
* description: Punctuations to append to the previous word
* type: str, default "\"'.。,,!!??::”)]}、"
* usage: --append_punctuations "<punctuation>"
# [gap_padding]
* description: padding to prepend to each segment for word timing alignment; used to reduce the probability of the model predicting timestamps earlier than the first utterance
* type: str, default " ..."
* usage: --gap_padding "padding"
# [word_timestamps]
* description: extract word-level timestamps using the cross-attention pattern and dynamic time warping, and include the timestamps for each word in each segment; disabling this will prevent segments from splitting/merging properly.
* type: bool, default True
* usage: --word_timestamps True
# [regroup]
* description: whether to regroup all words into segments with more natural boundaries; specify a string for customizing the regrouping algorithm; ignored if [word_timestamps]=False.
* type: str, default "True"
* usage: --regroup "regroup_option"
# [ts_num]
* description: number of extra inferences to perform to find the mean timestamps
* type: int, default 0
* usage: --ts_num <number>
# [ts_noise]
* description: percentage of noise to add to audio_features to perform inferences for [ts_num]
* type: float, default 0.1
* usage: --ts_noise 0.1
# [suppress_silence]
* description: whether to suppress timestamps where audio is silent at segment-level and word-level if [suppress_word_ts]=True
* type: bool, default True
* usage: --suppress_silence True
# [suppress_word_ts]
* description: whether to suppress timestamps where audio is silent at word-level; ignored if [suppress_silence]=False
* type: bool, default True
* usage: --suppress_word_ts True
# [suppress_ts_tokens]
* description: whether to use silence mask to suppress silent timestamp tokens during inference; increases word accuracy in some cases, but tends to reduce 'verbatimness' of the transcript; ignored if [suppress_silence]=False
* type: bool, default False
* usage: --suppress_ts_tokens True
# [q_levels]
* description: quantization levels for generating timestamp suppression mask; acts as a threshold to marking sound as silent; fewer levels will increase the threshold of volume at which to mark a sound as silent
* type: int, default 20
* usage: --q_levels <number>
# [k_size]
* description: Kernel size for average pooling waveform to generate suppression mask; recommend 5 or 3; higher sizes will reduce detection of silence
* type: int, default 5
* usage: --k_size 5
# [time_scale]
* description: factor for scaling audio duration for inference; greater than 1.0 'slows down' the audio; less than 1.0 'speeds up' the audio; 1.0 is no scaling
* type: float
* usage: --time_scale <value>
# [vad]
* description: whether to use Silero VAD to generate timestamp suppression mask; Silero VAD requires PyTorch 1.12.0+; Official repo: https://github.com/snakers4/silero-vad
* type: bool, default False
* usage: --vad True
# [vad_threshold]
* description: threshold for detecting speech with Silero VAD. (Default: 0.35); low threshold reduces false positives for silence detection
* type: float, default 0.35
* usage: --vad_threshold 0.35
# [vad_onnx]
* description: whether to use ONNX for Silero VAD
* type: bool, default False
* usage: --vad_onnx True
# [min_word_dur]
* description: only allow suppressing timestamps that result in word durations greater than this value
* type: float, default 0.1
* usage: --min_word_dur 0.1
# [max_chars]
* description: maximum number of characters allowed in each segment
* type: int
* usage: --max_chars <value>
# [max_words]
* description: maximum number of words allowed in each segment
* type: int
* usage: --max_words <value>
# [demucs]
* description: whether to reprocess the audio track with Demucs to isolate vocals/remove noise; Demucs official repo: https://github.com/facebookresearch/demucs
* type: bool, default False
* usage: --demucs True
# [only_voice_freq]
* description: whether to only use sound between 200 - 5000 Hz, where the majority of human speech is.
* type: bool
* usage: --only_voice_freq True
# [strip]
* description: whether to remove spaces before and after text on each segment for output
* type: bool, default True
* usage: --strip True
# [tag]
* description: a pair of tags used to change the properties of a word at its predicted time; SRT Default: '<font color=\"#00ff00\">', '</font>'; VTT Default: '<u>', '</u>'; ASS Default: '{\\1c&HFF00&}', '{\\r}'
* type: str
* usage: --tag "<start_tag> <end_tag>"
# [reverse_text]
* description: whether to reverse the order of words for each segment of text output
* type: bool, default False
* usage: --reverse_text True
# [font]
* description: word font for ASS output(s)
* type: str, default 'Arial'
* usage: --font "<font_name>"
# [font_size]
* description: word font size for ASS output(s)
* type: int, default 48
* usage: --font_size 48
# [karaoke]
* description: whether to use progressive filling highlights for karaoke effect (only for ASS outputs)
* type: bool, default False
* usage: --karaoke True
# [temperature]
* description: temperature to use for sampling
* type: float, default 0
* usage: --temperature <value>
# [best_of]
* description: number of candidates when sampling with non-zero temperature
* type: int
* usage: --best_of <value>
# [beam_size]
* description: number of beams in beam search, only applicable when temperature is zero
* type: int
* usage: --beam_size <value>
# [patience]
* description: optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search
* type: float
* usage: --patience <value>
# [length_penalty]
* description: optional token length penalty coefficient (alpha) as in https://arxiv.org/abs/1609.08144, uses simple length normalization by default
* type: float
* usage: --length_penalty <value>
# [fp16]
* description: whether to perform inference in fp16; True by default
* type: bool, default True
* usage: --fp16
# [compression_ratio_threshold]
* description: if the gzip compression ratio is higher than this value, treat the decoding as failed
* type: float
* usage: --compression_ratio_threshold <value>
# [logprob_threshold]
* description: if the average log probability is lower than this value, treat the decoding as failed
* type: float
* usage: --logprob_threshold <value>
# [no_speech_threshold]
* description: if the probability of the token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence
* type: float, default 0.6
* usage: --no_speech_threshold 0.6
# [threads]
* description: number of threads used by torch for CPU inference; supercedes MKL_NUM_THREADS/OMP_NUM_THREADS
* type: int
* usage: --threads <value>
# [mel_first]
* description: process the entire audio track into a log-Mel spectrogram first instead in chunks
* type: bool
* usage: --mel_first
# [demucs_option]
* description: Extra option(s) to use for Demucs; Replace True/False with 1/0; E.g. --demucs_option "shifts=3" --demucs_option "overlap=0.5"
* type: str
* usage: --demucs_option "<option>"
# [refine_option]
* description: Extra option(s) to use for refining timestamps; Replace True/False with 1/0; E.g. --refine_option "steps=sese" --refine_option "rel_prob_decrease=0.05"
* type: str
* usage: --refine_option "<option>"
# [model_option]
* description: Extra option(s) to use for loading the model; Replace True/False with 1/0; E.g. --model_option "in_memory=1" --model_option "cpu_threads=4"
* type: str
* usage: --model_option "<option>"
# [transcribe_option]
* description: Extra option(s) to use for transcribing/alignment; Replace True/False with 1/0; E.g. --transcribe_option "ignore_compatibility=1"
* type: str
* usage: --transcribe_option "<option>"
# [save_option]
* description: Extra option(s) to use for text outputs; Replace True/False with 1/0; E.g. --save_option "highlight_color=ffffff"
* type: str
* usage: --save_option "<option>"
Set the mode for export. You can choose between the following:
- Segment level
- Word level
segment_level=True
+ word_level=True
00:00:07.760 --> 00:00:09.900
But<00:00:07.860> when<00:00:08.040> you<00:00:08.280> arrived<00:00:08.580> at<00:00:08.800> that<00:00:09.000> distant<00:00:09.400> world,
segment_level=True
+ word_level=False
00:00:07.760 --> 00:00:09.900
But when you arrived at that distant world,
segment_level=False
+ word_level=True
00:00:07.760 --> 00:00:07.860
But
00:00:07.860 --> 00:00:08.040
when
00:00:08.040 --> 00:00:08.280
you
00:00:08.280 --> 00:00:08.580
arrived
...
Can choose between the following:
- Text
- Json
- SRT
- ASS
- VTT
- TSV
- CSV
It is recommended to have the json output always enabled just in case you want to further modify the results with the tool
menu in the main menu.
Set the export folder location
Wether to auto open the export folder for file import
Amount to slice the filename from the start
Amount to slice the filename from the end
Set the filename export format. The following are the options for it:
Default value: %Y-%m-%d %H_%M {file}_{task}
Available parameters:
{file}
Will be replaced with the file name
{task}
Will be replaced with the task name. (transcribe or translate)
{task-short}
Will be replaced with the task name but shorten. (tc or tl)
{lang-source}
Will be replaced with the source language
{lang-target}
Will be replaced with the target language
{model}
Will be replaced with the model name
{engine}
Will be replaced with the translation engine name
Set the proxies list for HTTPS. Each proxies is separated by new line tab or space.
Set the proxies list for HTTP. Each proxies is separated by new line tab or space.
Set the API key for libre translate.
Set the host for libre translate. If you are hosting it locally you can set it to localhost
. If you are using the official instance for example https://libretranslate.com
you need to input only the host name so it will be libretranslate.com
.
Set the port for libre translate.
Wether to use HTTPS or not.
Wether to supress the warning if the API key is empty.
Set the max character shown in the textbox.
Set the max character shown per line in the textbox.
Set the font for the textbox.
Wether to colorize the text based on confidence value when available. (Default checked
)
Set the color for low confidence value. (Default #ff0000
)
Set the color for high confidence value. (Default #00ff00
)
Set the colorize per. You can choose between the following:
- Segment
- Word
You can only choose one of the option.
Wether to parse arabic character to a unicode character for tkinter to show it properly. (Default checked
)
Go to: Download Page - Wiki Home - Code