Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speaker Identification #672

Merged
merged 31 commits into from
Nov 18, 2024
Merged

Speaker Identification #672

merged 31 commits into from
Nov 18, 2024

Conversation

EzraEllette
Copy link
Contributor

@EzraEllette EzraEllette commented Nov 13, 2024

description

This PR adds speaker identification to screenpipe. Audio is segmented by speaker then transcribed. transcriptions now have a speaker_id column. new table speakers was added with name and metadata columns. speaker_embeddings table was created with a one-to-many relationship for speaker and embeddings.

related issue: /claim #306

type of change

  • new feature

how to test

Run the speaker_identification test. run screenpipe-server/src/db.rs tests.
Use screenpipe.

Copy link

vercel bot commented Nov 13, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
screenpipe ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 18, 2024 9:02pm

@louis030195
Copy link
Collaborator

louis030195 commented Nov 13, 2024

Screenshot 2024-11-13 at 9 51 05 AM

might be unrelated to this PR, keep testing

(running two screenpipe at once might be related)

@louis030195
Copy link
Collaborator

1 GB / 37 GB), Total CPU: 17%, NPU: N/A 2024-11-13T19:00:44.580527Z INFO screenpipe_audio::stt: Preparing segments 2024-11-13T19:00:44.580547Z INFO screenpipe_audio::stt: device: MacBook Pro Microphone (input), resampling from 48000 Hz to 16000 Hz 2024-11-13T19:00:45.288077Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute. 2024-11-13T19:00:45.288101Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: Failed to compute speaker embedding: The frames array is empty. No features to compute. 2024-11-13T19:00:45.288174Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute. 2024-11-13T19:00:45.288177Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: Failed to compute speaker embedding: The frames array is empty. No features to compute. 2024-11-13T19:00:45.330388Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute. 2024-11-13T19:00:45.330408Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embeddi

@louis030195
Copy link
Collaborator

 INFO screenpipe_server::resource_monitor: Runtime: 4028s, Total Memory: 2% (1 GB / 37 GB), Total CPU: 16%, NPU: N/A
2024-11-13T19:00:14.578918Z  INFO screenpipe_audio::stt: Preparing segments    
2024-11-13T19:00:14.578965Z  INFO screenpipe_audio::stt: device: MacBook Pro Microphone (input), resampling from 48000 Hz to 16000 Hz    
2024-11-13T19:00:14.998152Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute.
2024-11-13T19:00:14.998174Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: Failed to compute speaker embedding: The frames array is empty. No features to compute.
2024-11-13T19:00:15.040711Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute.
2024-11-13T19:00:15.040735Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: Failed to compute speaker embedding: The frames array is empty. No features to compute.
2024-11-13T19:00:15.128138Z ERROR screenpipe_audio::pyannote::segment: Failed to compute speaker embedding: The frames array is empty. No features to compute.
2024-11-13T19:00:15.128158Z ERROR scr

@louis030195
Copy link
Collaborator

unrelated but fun:


2024-11-13T18:58:16.221108Z  INFO screenpipe_audio::multilingual: detected language: "fr"    
2024-11-13T18:58:16.532145Z  INFO screenpipe_server::resource_monitor: Runtime: 3917s, Total Memory: 2% (1 GB / 37 GB), Total CPU: 31%, NPU: N/A
2024-11-13T18:58:18.758847Z  INFO screenpipe_audio::whisper:   0.0s-0.0s:     
2024-11-13T18:58:18.758879Z  INFO screenpipe_audio::whisper:   0.0s-6.0s:  Parce que c'est tellement abstrait pour les gens dans la tech.    
2024-11-13T18:58:18.758891Z  INFO screenpipe_audio::whisper:   6.0s-8.0s:  Dans la startup.    
2024-11-13T18:58:18.758911Z  INFO screenpipe_audio::whisper:   10.0s-14.0s:  Si on se retrouve dans YC, qu'est-ce qui va passer avec l'évaluation ?    

whisper detect my voice in french (i spoke english)

@louis030195
Copy link
Collaborator

a/MacBook Pro Microphone (input)_2024-11-13_19-16-47.mp4"    
2024-11-13T19:16:48.483339Z  INFO screenpipe_audio::multilingual: detected language: "en"    
2024-11-13T19:16:49.955738Z  INFO screenpipe_audio::whisper:   0.0s-0.0s:     
2024-11-13T19:16:49.955759Z  INFO screenpipe_audio::whisper:   0.0s-1.8s:  Well, like we were previously    
2024-11-13T19:16:49.977495Z  INFO screenpipe_server::core: device MacBook Pro Microphone (input) received transcription Some(" Well, like we were previously\n")    
2024-11-13T19:16:49.978565Z  INFO screenpipe_server::core: Detected speaker: Speaker { id: 90, name: "", metadata: "" }    
2024-11-13T19:16:49.978582Z  INFO screenpipe_server::core: device MacBook Pro Microphone (input) inserting audio chunk: "/tmp/spp/data/MacBook Pro Microphone (input)_2024-11-13_19-16-49.mp4"    
2024-11-13T19:16:50.616236Z  INFO screenpipe_audio::multilingual: detected language: "en"    
2024-11-13T19:16:51.325953Z  INFO screenpipe_audio::whisper:   0.0s-0.0s:     
2024-11-13T19:16:51.325974Z  INFO screenpipe_audio::whisper:   0.0s-30.0s:  Thank you.    
2024-11-13T19:16:51.402875Z  INFO screenpipe_server::core: device MacBook Pro Microphone (input) received transcription Some(" Thank you.\n")    
2024-11-13T19:16:51.403004Z ERROR screenpipe_server::core: Error processing audio result: error returned from database: (code: 1) zero-length vectors are not supported. 

there are a few Thank you (something with VAD i suppose) but maybe not more than on main

@EzraEllette
Copy link
Contributor Author

Okay there are some bug fixes to make.

@EzraEllette
Copy link
Contributor Author

@louis030195 I was able to identify the source of the bug and fix it.

@louis030195
Copy link
Collaborator

looks great! @EzraEllette

i want to merge this ASAP i think there might be some things that we don't know yet changed, so make sense to merge and ask a few people to test it out and see if it works as before roughly

one last thing to fix before merging though:

https://youtu.be/vk711s6h8W4

there is an issue with the audio data encoded to disk for some reason, speed or something is changed, check the video

@EzraEllette
Copy link
Contributor Author

Okay, I don't have my computer with me right now but I have experienced something similar to this before. If you want to take a look at the sample rate that is passed to the stt function that's probably wrong because we have to use a 16000hz rate for segmentation and I'm probably not reflecting that change when STT is called. Sent from my phone at a concert so pls forgive the grammar

@louis030195 louis030195 mentioned this pull request Nov 17, 2024
4 tasks
@louis030195
Copy link
Collaborator

any news?

@EzraEllette
Copy link
Contributor Author

any news?

@louis030195 Making the UI today

@EzraEllette
Copy link
Contributor Author

This should be safe to merge once tested again. UI can come soon.

@EzraEllette
Copy link
Contributor Author

I fixed the audio storage issue.

@louis030195
Copy link
Collaborator

amazing

/approve

Copy link

algora-pbc bot commented Nov 18, 2024

@louis030195: The claim has been successfully added to reward-all. You can visit your dashboard to complete the payment.

@louis030195 louis030195 merged commit 478fb05 into mediar-ai:main Nov 18, 2024
3 of 7 checks passed
@louis030195
Copy link
Collaborator

@EzraEllette any suggestion next steps?

@EzraEllette
Copy link
Contributor Author

@EzraEllette any suggestion next steps?
Some ideas:

  • Update UI to include speakers (meetings, search, etc)
  • add a UI to manually identify speakers
    • Provide audio recording in UI and allow updating of name or searching and selecting a previously identified speaker.
    • Since one speaker can have multiple embeddings in the database, when a previously identified speaker is selected, we should update the speaker embeddings to point to the selected speaker, and remove the old speaker from the database. (this allows us to have lower thresholds for speaker search.)
  • Attempt to use LLM for identification through meeting context
  • utilize metadata

@louis030195
Copy link
Collaborator

lets continue here @EzraEllette

#695

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants