Skip to content

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper

License

Notifications You must be signed in to change notification settings

LaurentEsingle/whisper-diarization

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Speaker Diarization Using OpenAI Whisper

GitHub stars GitHub issues GitHub license Twitter Open in Colab

NOTE

This is a fork of MahmoudAshraf97/whisper-diarization

It creates a distributable package and a developer interface that can be used in a Python script.

Original README content below, slightly edited to add details about the extra features of this fork

Speaker Diarization pipeline based on OpenAI Whisper I'd like to thank @m-bain for Wav2Vec2 forced alignment, @mu4farooqi for punctuation realignment algorithm

What is it

This repository combines Whisper ASR capabilities with Voice Activity Detection (VAD) and Speaker Embedding to identify the speaker for each sentence in the transcription generated by Whisper. First, the vocals are extracted from the audio to increase the speaker embedding accuracy, then the transcription is generated using Whisper, then the timestamps are corrected and aligned using WhisperX to help minimize diarization error due to time shift. The audio is then passed into MarbleNet for VAD and segmentation to exclude silences, TitaNet is then used to extract speaker embeddings to identify the speaker for each segment, the result is then associated with the timestamps generated by WhisperX to detect the speaker for each word based on timestamps and then realigned using punctuation models to compensate for minor time shifts.

Whisper, WhisperX and NeMo parameters are coded into diarize.py and helpers.py, I will add the CLI arguments to change them later

Installation

FFMPEG and Cython are needed as prerequisites to install the requirements. wheel is also recommended to avoid pip warnings if using python 3.10

pip install wheel

AND

pip install cython

or

sudo apt update && sudo apt install cython3
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg
pip install -r requirements.txt

Usage

cd whisperdiarization
python diarize.py -a AUDIO_FILE_NAME

If your system has enough VRAM (>=10GB), you can use diarize_parallel.py instead, the difference is that it runs NeMo in parallel with Whisper, this can be beneficial in some cases and the result is the same since the two models are nondependent on each other. This is still experimental, so expect errors and sharp edges. Your feedback is welcome.

Command Line Options

  • -a AUDIO_FILE_NAME: The name of the audio file to be processed
  • --no-stem: Disables source separation
  • --whisper-model: The model to be used for ASR, default is medium.en
  • --suppress_numerals: Transcribes numbers in their pronounced letters instead of digits, improves alignment accuracy
  • --device: Choose which device to use, defaults to "cuda" if available
  • --language: Manually select language, useful if language detection failed
  • --batch-size: Batch size for batched inference, reduce if you run out of memory, set to 0 for non-batched inference

Using a Python script

Example for an audio file named debate.mp4

from whisperdiarization import Diarization
dr = Diarization(audiofilename='/home/lucas/audio/debate.mp4')
result = dr.Start()

print(result.ReturnCode)
print(result.Output)
print(result.Error)

Known Limitations

  • Overlapping speakers are yet to be addressed, a possible approach would be to separate the audio file and isolate only one speaker, then feed it into the pipeline but this will need much more computation
  • There might be some errors, please raise an issue if you encounter any.

Future Improvements

  • Implement a maximum length per sentence for SRT

Acknowledgements

Special Thanks for @adamjonas for supporting this project This work is based on OpenAI's Whisper , Faster Whisper , Nvidia NeMo , and Facebook's Demucs

About

Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 52.3%
  • Python 47.7%