Options

The documentation below is for detailed on customizing the app. If you are just getting started you can check the getting started guide at getting started

📌 Main Window

Window Menubar

File

You can choose the following action:

Stay on top
Hide
Exit

View

You can choose the following action:

Settings (Shortcut: F2)
Open the settings menu
Log (Shortcut: Ctrl + F1)
Open log window
Export Directory
Open export directory
Log Directory
Open log directory
Model Directory
Open model directory

Show

You can choose the following action:

Transcribed speech subtitle window (Shortcut: F3)
Shows the result of the transcription in recording session but in a detached window just like a subtitle box.
Translated speech subtitle window (Shortcut: F4)
Shows the result of the translation in recording session but in a detached window just like a subtitle box.

Preview:

Windows user can further customize it to remove the background by right clicking the window and choosing the clickthrough/transparent option.

Help

You can choose the following action:

About (Shortcut: F1)
Open Documentation / Wiki
Visit Repository

Transcribe

Select the model for transcription, you can choose between the following:

Tiny
Base
Small
Medium
Large

Each model have different requirements and produce different result. for more information you can check it directly at the whisper repository.

Translate

Select the method for translation.

Whisper (To english only from 99 language available)
Google Translate (133 target language with 94 of it have compatibility with whisper as source language)
LibreTranslate v1.5.1 (45 target language with 43 of it have compatibility with whisper as source language)
MyMemoryTranslator (127 target language with 93 of it have compatibility with whisper as source language)

From

Set language to translate from. The selection of the language in this option will be different depending on the selected method in the Translate option.

To

Set language to translate to. The selection of the language in this option will be different depending on the selected method in the Translate option.

Swap

Swap the language in From and To option. Will also swap the textbox result.

Clear

Clear the textbox result.

HostAPI

Set the device Host API for recording.

Microphone

Set the mic device for recording. This option will be different depending on the selected Host API in the HostAPI option.

Speaker

Set the speaker device for recording. This option will be different depending on the selected Host API in the HostAPI option. (Only on windows 8 and above)

Task

Set the task to do for recording. The task available are:

Transcribe
Translate

Input

Set the input for recording. The input available are:

Microphone
Speaker

Copy

Copy the textbox result.

Tool

Open the tool dropdown menu. The tool available are:

Export recorded results
Align results
Refine results
Translate results

Record

Start recording. The button will change to Stop when recording.

Import File

Import file to transcribe, will open its own modal window.

📌 General Options

Application related

Check for update on start

Wether to check if there is any new update or not on every app startup. (Default checked)

Supress hidden to tray notif

Wether to suppress the notification to show that the app is now hidden to tray. (Default unchecked)

Supress device warning

Wether to supress any warning that might show up related to device. (Default unchecked)

Logging

Log Directory

Set log folder location, to do it press the button on the right. Action available:

Open folder
Change log folder
Set back to default
Empty log folder

Verbose logging for whisper

Wether to log the record session verbosely. (Default unchecked)

Keep log files

Wether to keep the log files or not. If not checked, the log files will be deleted everytime the app runs. (Default unchecked)

Log level

Set log level. (Default DEBUG)

Debug recording

Wether to show debug log for record session. Setting this on might slow down the app. (Default unchecked)

Debug recorded audio

Wether to save recorded audio in the record session into the debug folder. The audio here will be saved as .wav in the debug folder, and if unchecked will be deleted automatically every run. Setting this on might slow down the app (Default unchecked)

Debug translate

Wethere to show debug log for translate session. (Default unchecked)

Model

Model directory

Set model folder location, to do it press the button on the right. Action available:

Open folder
Change model folder
Set back to default
Empty model folder
Download model

Model

You can download the model by pressing the download button. Each model have different requirements and produce different result. You can read more about it here.

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	`tiny.en`	`tiny`	~1 GB	~32x
base	74 M	`base.en`	`base`	~1 GB	~16x
small	244 M	`small.en`	`small`	~2 GB	~6x
medium	769 M	`medium.en`	`medium`	~5 GB	~2x
large	1550 M	N/A	`large`	~10 GB	1x

Whisper model have 3 large version (v1, v2, and v3) but for now faster whisper only support v1 and v2. You can read more about it here

📌 Record Options

Device Parameters

Note

Speaker input only works on windows 8 and above

Sample Rate

Set sample rate for the input device. (Default 16000)

Channels

Set channels for the input device. (Default 1)

Chunk Size

Set chunk size for the input device. (Default 1024)

Auto Sample Rate

Wether to automatically set the sample rate based on the input device. (Default unchecked for microphone and checked for speaker)

Auto channels value

Wether to automatically set the channels based on the input device. (Default unchecked for microphone and checked for speaker)

Recording

Transcribe Rate (ms)

Set the rate for transcribing the audio in milisecond. (Default 300)

Audio Processing

Conversion

Conversion method to feed to the whisper model. (Default is using Numpy Array)

Numpy Array

Use numpy array to feed to the model. This method is faster because of no need to write out the audio to wav file.

Temporary wav file

Use temporary wav file to feed to the model. Using this might slow down the process because of the File IO operation. Using this might help fix error related to device (which rarely happens). When both VAD and Demucs are enabled in record session, this option will be used automatically.

Max Buffer

Set the maximum buffer size for the audio in seconds. (Default 10)

Max Sentences

Set max number of sentences. One sentence equals to one buffer. So if max buffer is 10 seconds, the words that are in those 10 seconds is the sentence. (Default 5)

Enable Threshold

Wether to enable threshold or not. If enabled, the app will only transcribe the audio if the audio is above the threshold. (Default checked)

Auto Threshold

If set to auto, will use VAD (voice activity detection) for the threshold. The VAD is using WebRTC VAD through py-webrtcvad. (Default checked)

If set to auto, the user will need to select the VAD sensitivity to filter out the noise. The higher the sensitivity, the more noise will be filtered out. If not set to auto, the user will need to set the threshold manually.

Result

Text Separator

Set the separator for the text result. (Default \n)

📌 Whisper

Whisper Options

Use Faster Whisper

Wether to use faster whisper or not. (Default checked)

Decoding

Decoding Preset

Set the decoding preset. (Default Beam Search). You can choose between the following:

Greedy, greedy will set the temperature parameter to 0.0 with both best of and beam size set to none
Beam Search, beam search will set the temperature parameter with fallback of 0.2, so the temperature is 0.0 0.2 0.4 0.6 0.8 1.0, both best of and beam size are set to 5
Custom, set your own decoding option

Temperature

Temperature to use for sampling

Best of

Number of candidates when sampling with non-zero temperature

Beam Size

Number of beams in beam search, only applicable when temperature is zero

Threshold

Compression Ratio Threshold

If the gzip compression ratio is higher than this value, treat the decoding as failed

Log Probability Threshold

If the average log probability is lower than this value, treat the decoding as failed

No Speech Threshold

If the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to logprob_threshold, consider the segment as silence

Initial Prompt

Optional text to provide as a prompt for the first window

Supress Token

Comma-separated list of token ids to suppress during sampling. '-1' will suppress most special characters except common punctuations

Condition on previous text

if True, provide the previous output of the model as a prompt for the next window, disabling may make the text inconsistent across windows,

Raw Arguments

Command line arguments / parameters to be used. It has the same options as when using stable-ts with CLI but with some parameter removed because it is set in the app / GUI. All of the parameter are:

# [device]
* description: device to use for PyTorch inference (A Cuda compatible GPU and PyTorch with CUDA support are still required for GPU / CUDA)
* type: str, default cuda
* usage: --device cpu

# [cpu_preload]
* description: load model into CPU memory first then move model to specified device; this reduces GPU memory usage when loading model.
* type: bool, default True
* usage: --cpu_preload True

# [dynamic_quantization]
* description: whether to apply Dynamic Quantization to model to reduce memory usage (~half less) and increase inference speed at cost of slight decrease in accuracy; Only for CPU; NOTE: overhead might make inference slower for models smaller than 'large'
* type: bool, default False
* usage: --dynamic_quantization

# [prepend_punctuations]
* description: Punctuations to prepend to the next word
* type: str, default "'“¿([{-"
* usage: --prepend_punctuations "<punctuation>"

# [append_punctuations]
* description: Punctuations to append to the previous word
* type: str, default "\"'.。,，!！?？:：”)]}、"
* usage: --append_punctuations "<punctuation>"

# [gap_padding]
* description: padding to prepend to each segment for word timing alignment; used to reduce the probability of the model predicting timestamps earlier than the first utterance
* type: str, default " ..."
* usage: --gap_padding "padding"

# [word_timestamps]
* description: extract word-level timestamps using the cross-attention pattern and dynamic time warping, and include the timestamps for each word in each segment; disabling this will prevent segments from splitting/merging properly.
* type: bool, default True
* usage: --word_timestamps True

# [regroup]
* description: whether to regroup all words into segments with more natural boundaries; specify a string for customizing the regrouping algorithm; ignored if [word_timestamps]=False.
* type: str, default "True"
* usage: --regroup "regroup_option"

# [ts_num]
* description: number of extra inferences to perform to find the mean timestamps
* type: int, default 0
* usage: --ts_num <number>

# [ts_noise]
* description: percentage of noise to add to audio_features to perform inferences for [ts_num]
* type: float, default 0.1
* usage: --ts_noise 0.1

# [suppress_silence]
* description: whether to suppress timestamps where audio is silent at segment-level and word-level if [suppress_word_ts]=True
* type: bool, default True
* usage: --suppress_silence True

# [suppress_word_ts]
* description: whether to suppress timestamps where audio is silent at word-level; ignored if [suppress_silence]=False
* type: bool, default True
* usage: --suppress_word_ts True

# [suppress_ts_tokens]
* description: whether to use silence mask to suppress silent timestamp tokens during inference; increases word accuracy in some cases, but tends to reduce 'verbatimness' of the transcript; ignored if [suppress_silence]=False
* type: bool, default False
* usage: --suppress_ts_tokens True

# [q_levels]
* description: quantization levels for generating timestamp suppression mask; acts as a threshold to marking sound as silent; fewer levels will increase the threshold of volume at which to mark a sound as silent
* type: int, default 20
* usage: --q_levels <number>

# [k_size]
* description: Kernel size for average pooling waveform to generate suppression mask; recommend 5 or 3; higher sizes will reduce detection of silence
* type: int, default 5
* usage: --k_size 5

# [time_scale]
* description: factor for scaling audio duration for inference; greater than 1.0 'slows down' the audio; less than 1.0 'speeds up' the audio; 1.0 is no scaling
* type: float
* usage: --time_scale <value>

# [vad]
* description: whether to use Silero VAD to generate timestamp suppression mask; Silero VAD requires PyTorch 1.12.0+; Official repo: https://github.com/snakers4/silero-vad
* type: bool, default False
* usage: --vad True

# [vad_threshold]
* description: threshold for detecting speech with Silero VAD. (Default: 0.35); low threshold reduces false positives for silence detection
* type: float, default 0.35
* usage: --vad_threshold 0.35

# [vad_onnx]
* description: whether to use ONNX for Silero VAD
* type: bool, default False
* usage: --vad_onnx True

# [min_word_dur]
* description: only allow suppressing timestamps that result in word durations greater than this value
* type: float, default 0.1
* usage: --min_word_dur 0.1

# [max_chars]
* description: maximum number of characters allowed in each segment
* type: int
* usage: --max_chars <value>

# [max_words]
* description: maximum number of words allowed in each segment
* type: int
* usage: --max_words <value>

# [demucs]
* description: whether to reprocess the audio track with Demucs to isolate vocals/remove noise; Demucs official repo: https://github.com/facebookresearch/demucs
* type: bool, default False
* usage: --demucs True

# [only_voice_freq]
* description: whether to only use sound between 200 - 5000 Hz, where the majority of human speech is.
* type: bool
* usage: --only_voice_freq True

# [strip]
* description: whether to remove spaces before and after text on each segment for output
* type: bool, default True
* usage: --strip True

# [tag]
* description: a pair of tags used to change the properties of a word at its predicted time; SRT Default: '<font color=\"#00ff00\">', '</font>'; VTT Default: '<u>', '</u>'; ASS Default: '{\\1c&HFF00&}', '{\\r}'
* type: str
* usage: --tag "<start_tag> <end_tag>"

# [reverse_text]
* description: whether to reverse the order of words for each segment of text output
* type: bool, default False
* usage: --reverse_text True

# [font]
* description: word font for ASS output(s)
* type: str, default 'Arial'
* usage: --font "<font_name>"

# [font_size]
* description: word font size for ASS output(s)
* type: int, default 48
* usage: --font_size 48

# [karaoke]
* description: whether to use progressive filling highlights for karaoke effect (only for ASS outputs)
* type: bool, default False
* usage: --karaoke True

# [temperature]
* description: temperature to use for sampling
* type: float, default 0
* usage: --temperature <value>

# [best_of]
* description: number of candidates when sampling with non-zero temperature
* type: int
* usage: --best_of <value>

# [beam_size]
* description: number of beams in beam search, only applicable when temperature is zero
* type: int
* usage: --beam_size <value>

# [patience]
* description: optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search
* type: float
* usage: --patience <value>

# [length_penalty]
* description: optional token length penalty coefficient (alpha) as in https://arxiv.org/abs/1609.08144, uses simple length normalization by default
* type: float
* usage: --length_penalty <value>

# [fp16]
* description: whether to perform inference in fp16; True by default
* type: bool, default True
* usage: --fp16

# [compression_ratio_threshold]
* description: if the gzip compression ratio is higher than this value, treat the decoding as failed
* type: float
* usage: --compression_ratio_threshold <value>

# [logprob_threshold]
* description: if the average log probability is lower than this value, treat the decoding as failed
* type: float
* usage: --logprob_threshold <value>

# [no_speech_threshold]
* description: if the probability of the token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence
* type: float, default 0.6
* usage: --no_speech_threshold 0.6

# [threads]
* description: number of threads used by torch for CPU inference; supercedes MKL_NUM_THREADS/OMP_NUM_THREADS
* type: int
* usage: --threads <value>

# [mel_first]
* description: process the entire audio track into a log-Mel spectrogram first instead in chunks
* type: bool
* usage: --mel_first

# [demucs_option]
* description: Extra option(s) to use for Demucs; Replace True/False with 1/0; E.g. --demucs_option "shifts=3" --demucs_option "overlap=0.5"
* type: str
* usage: --demucs_option "<option>"

# [refine_option]
* description: Extra option(s) to use for refining timestamps; Replace True/False with 1/0; E.g. --refine_option "steps=sese" --refine_option "rel_prob_decrease=0.05"
* type: str
* usage: --refine_option "<option>"

# [model_option]
* description: Extra option(s) to use for loading the model; Replace True/False with 1/0; E.g. --model_option "in_memory=1" --model_option "cpu_threads=4"
* type: str
* usage: --model_option "<option>"

# [transcribe_option]
* description: Extra option(s) to use for transcribing/alignment; Replace True/False with 1/0; E.g. --transcribe_option "ignore_compatibility=1"
* type: str
* usage: --transcribe_option "<option>"

# [save_option]
* description: Extra option(s) to use for text outputs; Replace True/False with 1/0; E.g. --save_option "highlight_color=ffffff"
* type: str
* usage: --save_option "<option>"

Export

Mode

Set the mode for export. You can choose between the following:

Segment level
Word level

segment_level=True + word_level=True

00:00:07.760 --> 00:00:09.900
But<00:00:07.860> when<00:00:08.040> you<00:00:08.280> arrived<00:00:08.580> at<00:00:08.800> that<00:00:09.000> distant<00:00:09.400> world,

segment_level=True + word_level=False

00:00:07.760 --> 00:00:09.900
But when you arrived at that distant world,

segment_level=False + word_level=True

00:00:07.760 --> 00:00:07.860
But

00:00:07.860 --> 00:00:08.040
when

00:00:08.040 --> 00:00:08.280
you

00:00:08.280 --> 00:00:08.580
arrived

...

Export to

Can choose between the following:

Text
Json
SRT
ASS
VTT
TSV
CSV

It is recommended to have the json output always enabled just in case you want to further modify the results with the tool menu in the main menu.

Export folder

Set the export folder location

Auto open

Wether to auto open the export folder for file import

Slice file start

Amount to slice the filename from the start

Slice file end

Amount to slice the filename from the end

Export format

Set the filename export format. The following are the options for it:

Default value: %Y-%m-%d %H_%M {file}_{task}

Available parameters: 
{file}
Will be replaced with the file name

{task}
Will be replaced with the task name. (transcribe or translate)

{task-short}
Will be replaced with the task name but shorten. (tc or tl)

{lang-source}
Will be replaced with the source language

{lang-target}
Will be replaced with the target language

{model}
Will be replaced with the model name

{engine}
Will be replaced with the translation engine name

📌 Translate

Options

Proxies List

HTTPS

Set the proxies list for HTTPS. Each proxies is separated by new line tab or space.

HTTP

Set the proxies list for HTTP. Each proxies is separated by new line tab or space.

Libre Translate Setting

API Key

Set the API key for libre translate.

Host

Set the host for libre translate. If you are hosting it locally you can set it to localhost. If you are using the official instance for example https://libretranslate.com you need to input only the host name so it will be libretranslate.com.

Port

Set the port for libre translate.

Use HTTPS

Wether to use HTTPS or not.

Supress empty API key warning

Wether to supress the warning if the API key is empty.

📌 Textbox

Each Window Textbox

Max Length

Set the max character shown in the textbox.

Max Per Line

Set the max character shown per line in the textbox.

Font

Set the font for the textbox.

Colorize text based on confidence value when available

Wether to colorize the text based on confidence value when available. (Default checked)

Other

Confidence Setting

Low Confidence Color

Set the color for low confidence value. (Default #ff0000)

High Confidence Color

Set the color for high confidence value. (Default #00ff00)

Colorize per

Set the colorize per. You can choose between the following:

Segment
Word

You can only choose one of the option.

Parsing

Parse arabic character

Wether to parse arabic character to a unicode character for tkinter to show it properly. (Default checked)

Go to: Download Page - Wiki Home - Code