Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speaker Identification #672

Open
wants to merge 26 commits into
base: main
Choose a base branch
from
Open

Conversation

EzraEllette
Copy link
Contributor

@EzraEllette EzraEllette commented Nov 13, 2024

description

This PR adds speaker identification to screenpipe. Audio is segmented by speaker then transcribed. transcriptions now have a speaker_id column. new table speakers was added with name and metadata columns. speaker_embeddings table was created with a one-to-many relationship for speaker and embeddings.

related issue: #

type of change

  • new feature

how to test

Run the speaker_identification test. run screenpipe-server/src/db.rs tests.
Use screenpipe.

Copy link

vercel bot commented Nov 13, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
screenpipe ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 13, 2024 5:47am

@louis030195
Copy link
Collaborator

louis030195 commented Nov 13, 2024

Screenshot 2024-11-13 at 9 51 05 AM

might be unrelated to this PR, keep testing

(running two screenpipe at once might be related)

@louis030195
Copy link
Collaborator

@EzraEllette did you test on windows btw?

would be good that someone test on windows

.join("pyannote")
.join("segmentation-3.0.onnx");

let embedding_extractor = EmbeddingExtractor::new(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any downside to load / unload the model at every chunk?

@louis030195
Copy link
Collaborator

1 GB / 37 GB), Total CPU: 17%, NPU: N/A 2024-11-13T19:00:44.580527Z INFO screenpipe_audio::stt: Preparing segments 2024-11-13T19:00:44.580547Z INFO screenpipe_audio::stt: device: MacBook Pro Microphone (input), resampling from 48000 Hz to 16000 Hz 2024-11-13T19:00:45.288077Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute. 2024-11-13T19:00:45.288101Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: Failed to compute speaker embedding: The frames array is empty. No features to compute. 2024-11-13T19:00:45.288174Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute. 2024-11-13T19:00:45.288177Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: Failed to compute speaker embedding: The frames array is empty. No features to compute. 2024-11-13T19:00:45.330388Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute. 2024-11-13T19:00:45.330408Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embeddi

@louis030195
Copy link
Collaborator

 INFO screenpipe_server::resource_monitor: Runtime: 4028s, Total Memory: 2% (1 GB / 37 GB), Total CPU: 16%, NPU: N/A
2024-11-13T19:00:14.578918Z  INFO screenpipe_audio::stt: Preparing segments    
2024-11-13T19:00:14.578965Z  INFO screenpipe_audio::stt: device: MacBook Pro Microphone (input), resampling from 48000 Hz to 16000 Hz    
2024-11-13T19:00:14.998152Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute.
2024-11-13T19:00:14.998174Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: Failed to compute speaker embedding: The frames array is empty. No features to compute.
2024-11-13T19:00:15.040711Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute.
2024-11-13T19:00:15.040735Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: Failed to compute speaker embedding: The frames array is empty. No features to compute.
2024-11-13T19:00:15.128138Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute.
2024-11-13T19:00:15.128158Z ERROR scr

@louis030195
Copy link
Collaborator

unrelated but fun:


2024-11-13T18:58:16.221108Z  INFO screenpipe_audio::multilingual: detected language: "fr"    
2024-11-13T18:58:16.532145Z  INFO screenpipe_server::resource_monitor: Runtime: 3917s, Total Memory: 2% (1 GB / 37 GB), Total CPU: 31%, NPU: N/A
2024-11-13T18:58:18.758847Z  INFO screenpipe_audio::whisper:   0.0s-0.0s:     
2024-11-13T18:58:18.758879Z  INFO screenpipe_audio::whisper:   0.0s-6.0s:  Parce que c'est tellement abstrait pour les gens dans la tech.    
2024-11-13T18:58:18.758891Z  INFO screenpipe_audio::whisper:   6.0s-8.0s:  Dans la startup.    
2024-11-13T18:58:18.758911Z  INFO screenpipe_audio::whisper:   10.0s-14.0s:  Si on se retrouve dans YC, qu'est-ce qui va passer avec l'évaluation ?    

whisper detect my voice in french (i spoke english)

@louis030195
Copy link
Collaborator

a/MacBook Pro Microphone (input)_2024-11-13_19-16-47.mp4"    
2024-11-13T19:16:48.483339Z  INFO screenpipe_audio::multilingual: detected language: "en"    
2024-11-13T19:16:49.955738Z  INFO screenpipe_audio::whisper:   0.0s-0.0s:     
2024-11-13T19:16:49.955759Z  INFO screenpipe_audio::whisper:   0.0s-1.8s:  Well, like we were previously    
2024-11-13T19:16:49.977495Z  INFO screenpipe_server::core: device MacBook Pro Microphone (input) received transcription Some(" Well, like we were previously\n")    
2024-11-13T19:16:49.978565Z  INFO screenpipe_server::core: Detected speaker: Speaker { id: 90, name: "", metadata: "" }    
2024-11-13T19:16:49.978582Z  INFO screenpipe_server::core: device MacBook Pro Microphone (input) inserting audio chunk: "/tmp/spp/data/MacBook Pro Microphone (input)_2024-11-13_19-16-49.mp4"    
2024-11-13T19:16:50.616236Z  INFO screenpipe_audio::multilingual: detected language: "en"    
2024-11-13T19:16:51.325953Z  INFO screenpipe_audio::whisper:   0.0s-0.0s:     
2024-11-13T19:16:51.325974Z  INFO screenpipe_audio::whisper:   0.0s-30.0s:  Thank you.    
2024-11-13T19:16:51.402875Z  INFO screenpipe_server::core: device MacBook Pro Microphone (input) received transcription Some(" Thank you.\n")    
2024-11-13T19:16:51.403004Z ERROR screenpipe_server::core: Error processing audio result: error returned from database: (code: 1) zero-length vectors are not supported. 

there are a few Thank you (something with VAD i suppose) but maybe not more than on main

@EzraEllette
Copy link
Contributor Author

Okay there are some bug fixes to make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants