Skip to content

speech to text gui for different Whisper models and backends, including whisper.cpp, mlx-whisper, faster-whisper, ctranslate2

License

Notifications You must be signed in to change notification settings

CrispStrobe/Susurrus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Susurrus: Whisper Audio Transcription GUI

Susurrus is a flexible audio transcription frontend that leverages various AI models, mostly based on OpenAI Whisper, and backends to convert speech to text. It transcribes audio files, including online content, using a number of optional models and pipelines.

Features

  • Support for multiple transcription backends (mlx-whisper, OpenAI Whisper, faster-whisper, transformers, whisper.cpp, ctranslate2, whisper-jax, insanely-fast-whisper)
  • Audio file upload and URL input support
  • YouTube audio extraction and transcription
  • Proxy support for network requests
  • Language selection for targeted transcription
  • Transcription metrics and progress tracking
  • Graphical user interface
  • Advanced options including start/end time for transcription, max chunk length, and output format selection for whisper.cpp (enabling subtitle export)
  • Audio trimming functionality

Screenshot

Susurrus Interface

Installation

Prerequisites

  • Python 3.8 or higher
  • pip (Python package manager)
  • Git
  • C++ compiler (for whisper.cpp)
  • CMake (for whisper.cpp)
  • FFmpeg

Common Steps (macOS, Linux, and Windows)

  1. Clone the repository:

    git clone https://github.com/CrispStrobe/susurrus.git
    cd susurrus
    
  2. Create and activate a virtual environment:

    • macOS/Linux:
      python3 -m venv venv
      source venv/bin/activate
      
    • Windows:
      python -m venv venv
      venv\Scripts\activate
      
  3. Install the required packages:

    pip install -r requirements.txt
    
  4. Install additional backend-specific packages:

    pip install openai-whisper faster-whisper transformers ctranslate2 whisper-jax soundfile insanely-fast-whisper
    
  5. Install whisper.cpp:

    git clone https://github.com/ggerganov/whisper.cpp.git
    cd whisper.cpp
    mkdir build && cd build
    cmake ..
    cmake --build . --config Release
    cd ../..
    

or for windows:

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
mkdir build && cd build

# Configure with UTF-8 support
cmake -B . -DCMAKE_CXX_FLAGS="/utf-8" -DCMAKE_BUILD_TYPE=Release ..

# Build
cmake --build . --config Release
cd ../..
  1. Install FFmpeg:
    • macOS:
      brew install ffmpeg
      
    • Linux (Ubuntu/Debian):
      sudo apt-get update
      sudo apt-get install ffmpeg
      
    • Windows:

Additional Steps for Windows

  • Ensure you have a C++ compiler installed. You can use Visual Studio with C++ support or MinGW-w64.
  • Install CMake from https://cmake.org/download/ and add it to your system PATH.

Usage

  1. Activate the virtual environment (if not already activated):

    • macOS/Linux: source venv/bin/activate
    • Windows: venv\Scripts\activate
  2. Run the main application:

    python susurrus.py
    
  3. Use the graphical interface to:

    • Upload an audio file or provide a URL
    • Select the desired transcription backend and model
    • Configure advanced options if needed
    • Start the transcription process
  4. View the transcription results and metrics in the application window

  5. Save the transcription to a text file using the "Save" button

Running the Transcription Worker Script

The transcription worker script can be run separately for debugging or advanced usage:

python transcribe_worker.py --audio-input <audio_file> --audio-url <url> --model-id <model_id> --word-timestamps --language <lang> --backend <backend> --device <device> --pipeline-type <type> --max-chunk-length <length> --output-format <format> --quantization <quant_type> --batch-size <size> --preprocessor-path <path> --original-model-id <orig_id> --start-time <start> --end-time <end>

Example:

python transcribe_worker.py --audio-input input.wav --model-id mlx-community/whisper-large-v3-mlx --word-timestamps --language en --backend mlx-whisper --device auto --pipeline-type default --start-time 10 --end-time 60

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the Apache 2.0 License. See the LICENSE file for details.

Acknowledgements

About

speech to text gui for different Whisper models and backends, including whisper.cpp, mlx-whisper, faster-whisper, ctranslate2

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages