You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the voice agent i'm building I see phantom numeric user_transcript inputs come through when using the google STT library.
Some example below, you can see 1 being emitted which is detected outside of the user_started_speaking or user_stopped_speaking VAD .
11-18 11:03:25,816 - DEBUG livekit.agents.pipeline - user_started_speaking {"pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
2024-11-18 11:03:31,607 - DEBUG livekit.agents.pipeline - received user transcript {"user_transcript": "Uh, let me think about that. Yes.", "type": "final_transcript", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
sending state 0
2024-11-18 11:03:32,436 - DEBUG livekit.agents.pipeline - user_stopped_speaking {"pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
2024-11-18 11:03:32,813 - DEBUG livekit.agents.pipeline - validated agent reply {"speech_id": "67b1b92a6de2", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
2024-11-18 11:03:35,517 - DEBUG livekit.agents.pipeline - speech playout started {"speech_id": "67b1b92a6de2", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
sending state 2
2024-11-18 11:03:38,305 - DEBUG livekit.agents.pipeline - received user transcript {"user_transcript": "1", "type": "final_transcript", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
2024-11-18 11:03:39,807 - DEBUG livekit.agents.pipeline - skipping validation, agent is speaking and does not allow interruptions {"speech_id": "67b1b92a6de2", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
sending state 0
2024-11-18 11:03:55,241 - DEBUG livekit.agents.pipeline - speech playout finished {"speech_id": "67b1b92a6de2", "interrupted": false, "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
2024-11-18 11:03:55,242 - DEBUG livekit.agents.pipeline - committed agent speech {"agent_transcript": " Take your time., "interrupted": false, "speech_id": "67b1b92a6de2", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
You can see another instance where a series of numbers are emitted 1 2 3 7 0 9 1 2 0 0 0 0 0 which were not part of the transcript (I've added logging for interim transcripts which shows the additional "final_transcript" emitted just prior to the actual final input.
2024-11-18 11:16:08,133 - DEBUG livekit.agents.pipeline - user_started_speaking {"pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:08,869 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "uh", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,063 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "Sorry.", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,370 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry can", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,487 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry, can you", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,651 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry, can you", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,864 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry, can you repeat", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,948 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry, can you repeat", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:10,047 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry, can you repeat", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:10,164 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "Sorry, can you repeat that?", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:10,436 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "Sorry, can you repeat that?", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
sending state 0
2024-11-18 11:16:16,307 - DEBUG livekit.agents.pipeline - received user transcript {"user_transcript": "1 2 3 7 0 9 1 2 0 0 0 0 0", "type": "final_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:17,203 - DEBUG livekit.agents.pipeline - received user transcript {"user_transcript": "Sorry, can you repeat that?", "type": "final_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
I've captured the output from the the google speech response in same cases:
In most cases the confidence < 0.6 which indicates, so these could probably be ignored, but in other cases there are numeric inputs with high confidence but that has a duration of 0 (ie start/end offsets are equal).
This is seen when using google STT library with chirp_2 model. I've tested with the default long model and it doesn't happen at all /often.
So I think there is probably a fix required specific to google, but also I wonder if these should be some thresholding on the min confidence / duration of the text input returned, as typically these phantom inputs are either low confidence (<0.65) or very short (<0.05 seconds).
In the voice agent i'm building I see phantom numeric
user_transcript
inputs come through when using thegoogle
STT library.Some example below, you can see 1 being emitted which is detected outside of the user_started_speaking or user_stopped_speaking VAD .
You can see another instance where a series of numbers are emitted
1 2 3 7 0 9 1 2 0 0 0 0 0
which were not part of the transcript (I've added logging for interim transcripts which shows the additional "final_transcript" emitted just prior to the actual final input.I've captured the output from the the google speech response in same cases:
In most cases the confidence < 0.6 which indicates, so these could probably be ignored, but in other cases there are numeric inputs with high confidence but that has a duration of 0 (ie start/end offsets are equal).
The text was updated successfully, but these errors were encountered: