Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend / Web Client: Set Speaker Audio Files for Primary Speaker Diarization #44

Open
ninjaa opened this issue Feb 28, 2024 · 0 comments

Comments

@ninjaa
Copy link

ninjaa commented Feb 28, 2024

I'm using the web client in my office on my laptop running in the background. Using Deepgram as the transcription provider.

My cofounder and I are routinely identified as "Bob" in the summaries, which is annoying. We're the only two in the office. I definitely want to have at minimum our voices identified.

When I use diarization myself in Deepgram it identifies ppl as Speaker 0, Speaker 1, etc so it has some features built in so I know that it can do it.

Orig conversation

NinjaA — Today at 1:39 PM
alright. Let me know how to upload voice embeddings so it recognizes me and a couple other key speakers
or even just me

etown — Today at 1:40 PM
You can put an audio file and then set the voice sample configuration

NinjaA — Today at 4:50 PM
I do not see instructions for setting this speaker_verification_audio path except in test code, and inside a file called async_whisper_transcription_server. Not sure if the latter is used when the transcription service provider is Deepgram. If you give me high level feedback possibly adding this functionality to the docs / code can be my first contribution. Can also open a gh issue, lmk

etown — Today at 4:52 PM
That would be amazing! It was not  integrated with everything but it should be
Right now, you can only specify one sample, and verification only happens the final transcription but only if you’re using whisper
At a high level:
We have two types of abstract  stt services: streaming and async

Streaming is done in real time and is mainly for the upcoming assistant/agent stuff 
But when a conversation ends it goes to async transcription
Both of these services can be configured to be different providers, such as whisper or deepgram
Here is where verification occurs:

https://github.com/OwlAIProject/Owl/blob/main/owl/services/stt/asynchronous/async_whisper/async_whisper_transcription_server.py#L103
GitHub
[Owl/owl/services/stt/asynchronous/async_whisper/async_whisper_trans...](https://github.com/OwlAIProject/Owl/blob/main/owl/services/stt/asynchronous/async_whisper/async_whisper_transcription_server.py)
A personal wearable AI that runs locally. Contribute to OwlAIProject/Owl development by creating an account on GitHub.
Owl/owl/services/stt/asynchronous/async_whisper/async_whisper_trans...
What it does is take each utterance and compute the embedding and compares it against the sample and if it reaches a threshold it overrides the generic speaker name from diarization
It uses speech brain but could use any voice embedding model
I think ideally we would have a separate service for speaker identification
It would probably take the audio file the transcript and then a list of known speakers (name, embedding) and then it would do the similarity and then return the transcript with the updated speaker names

etown — Today at 4:59 PM
This way it could for any provider (whisper/deepgram/etc) and for streaming and async transcription
It should not be hard to move it, but even a step in that direction would be amazing. Created ⁠stt we can discuss more there

https://github.com/pyannote/pyannote-audio/blob/develop/pyannote/audio/pipelines/speaker_verification.py

There are also a lot of other embedding models besides speech brain, it would be cool to test some more out for accuracy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant