bug: phantom inputs from google STT library for low confidence / short transcripts #1103

brightsparc · 2024-11-18T02:06:06Z

In the voice agent i'm building I see phantom numeric user_transcript inputs come through when using the google STT library.

Some example below, you can see 1 being emitted which is detected outside of the user_started_speaking or user_stopped_speaking VAD .

11-18 11:03:25,816 - DEBUG livekit.agents.pipeline - user_started_speaking {"pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
2024-11-18 11:03:31,607 - DEBUG livekit.agents.pipeline - received user transcript {"user_transcript": "Uh, let me think about that. Yes.", "type": "final_transcript", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
sending state 0
2024-11-18 11:03:32,436 - DEBUG livekit.agents.pipeline - user_stopped_speaking {"pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
2024-11-18 11:03:32,813 - DEBUG livekit.agents.pipeline - validated agent reply {"speech_id": "67b1b92a6de2", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
2024-11-18 11:03:35,517 - DEBUG livekit.agents.pipeline - speech playout started {"speech_id": "67b1b92a6de2", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
sending state 2
2024-11-18 11:03:38,305 - DEBUG livekit.agents.pipeline - received user transcript {"user_transcript": "1", "type": "final_transcript", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
2024-11-18 11:03:39,807 - DEBUG livekit.agents.pipeline - skipping validation, agent is speaking and does not allow interruptions {"speech_id": "67b1b92a6de2", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
sending state 0
2024-11-18 11:03:55,241 - DEBUG livekit.agents.pipeline - speech playout finished {"speech_id": "67b1b92a6de2", "interrupted": false, "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
2024-11-18 11:03:55,242 - DEBUG livekit.agents.pipeline - committed agent speech {"agent_transcript": " Take your time., "interrupted": false, "speech_id": "67b1b92a6de2", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}

You can see another instance where a series of numbers are emitted 1 2 3 7 0 9 1 2 0 0 0 0 0 which were not part of the transcript (I've added logging for interim transcripts which shows the additional "final_transcript" emitted just prior to the actual final input.

2024-11-18 11:16:08,133 - DEBUG livekit.agents.pipeline - user_started_speaking {"pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:08,869 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "uh", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,063 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "Sorry.", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,370 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry can", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,487 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry, can you", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,651 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry, can you", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,864 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry, can you repeat", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,948 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry, can you repeat", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:10,047 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry, can you repeat", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:10,164 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "Sorry, can you repeat that?", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:10,436 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "Sorry, can you repeat that?", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
sending state 0
2024-11-18 11:16:16,307 - DEBUG livekit.agents.pipeline - received user transcript {"user_transcript": "1 2 3 7 0 9 1 2 0 0 0 0 0", "type": "final_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:17,203 - DEBUG livekit.agents.pipeline - received user transcript {"user_transcript": "Sorry, can you repeat that?", "type": "final_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}

I've captured the output from the the google speech response in same cases:

SPEECH RESP metadata {
  total_billed_duration {
    seconds: 176
  }
  request_id: "673b2846-0000-20ff-acb0-089e082cdb00"
}
results {
  alternatives {
    transcript: "3 9 2 0 8 0 8 0 0 0 0"
    confidence: 0.585391164
    words {
      start_offset {
        seconds: 24
        nanos: 470000000
      }
      end_offset {
        seconds: 24
        nanos: 550000000
      }
      word: "3"
      confidence: 0.33627677
    }
    words {
      start_offset {
        seconds: 24
        nanos: 550000000
      }
      end_offset {
        seconds: 24
        nanos: 670000000
      }
      word: "9"
      confidence: 0.626038194
    }
    words {
      start_offset {
        seconds: 24
        nanos: 670000000
      }
      end_offset {
        seconds: 24
        nanos: 750000000
      }
      word: "2"
      confidence: 0.79240793
    }
    words {
      start_offset {
        seconds: 24
        nanos: 750000000
      }
      end_offset {
        seconds: 24
        nanos: 870000000
      }
      word: "0"
      confidence: 0.633162379
    }
    words {
      start_offset {
        seconds: 24
        nanos: 870000000
      }
      end_offset {
        seconds: 24
        nanos: 990000000
      }
      word: "8"
      confidence: 0.604495645
    }
    words {
      start_offset {
        seconds: 24
        nanos: 990000000
      }
      end_offset {
        seconds: 25
        nanos: 110000000
      }
      word: "0"
      confidence: 0.695137799
    }
    words {
      start_offset {
        seconds: 25
        nanos: 110000000
      }
      end_offset {
        seconds: 25
        nanos: 190000000
      }
      word: "8"
      confidence: 0.635257244
    }
    words {
      start_offset {
        seconds: 25
        nanos: 190000000
      }
      end_offset {
        seconds: 25
        nanos: 270000000
      }
      word: "0"
      confidence: 0.615414381
    }
    words {
      start_offset {
        seconds: 25
        nanos: 270000000
      }
      end_offset {
        seconds: 25
        nanos: 390000000
      }
      word: "0"
      confidence: 0.609516263
    }
    words {
      start_offset {
        seconds: 25
        nanos: 390000000
      }
      end_offset {
        seconds: 25
        nanos: 510000000
      }
      word: "0"
      confidence: 0.546232462
    }
    words {
      start_offset {
        seconds: 25
        nanos: 510000000
      }
      end_offset {
        seconds: 25
        nanos: 590000000
      }
      word: "0"
      confidence: 0.591834068
    }
  }
  is_final: true
  result_end_offset {
    seconds: 10
    nanos: 630000114
  }
  language_code: "en-AU"
}

In most cases the confidence < 0.6 which indicates, so these could probably be ignored, but in other cases there are numeric inputs with high confidence but that has a duration of 0 (ie start/end offsets are equal).

SPEECH RESP metadata {
  total_billed_duration {
    seconds: 29
  }
  request_id: "674eec46-0000-2f17-b8b0-ac3eb15a0510"
}
results {
  alternatives {
    transcript: "3688866886886886868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686"
    confidence: 0.956959367
    words {
      start_offset {
        seconds: 34
        nanos: 330000000
      }
      end_offset {
        seconds: 34
        nanos: 330000000
      }
      word: "3688866886886886868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686"
      confidence: 0.960090339
    }
  }
  is_final: true
  result_end_offset {
    seconds: 21
    nanos: 370000839
  }
  language_code: "en-AU"
}

The text was updated successfully, but these errors were encountered:

brightsparc · 2024-11-18T03:05:59Z

This is seen when using google STT library with chirp_2 model. I've tested with the default long model and it doesn't happen at all /often.

So I think there is probably a fix required specific to google, but also I wonder if these should be some thresholding on the min confidence / duration of the text input returned, as typically these phantom inputs are either low confidence (<0.65) or very short (<0.05 seconds).

brightsparc added the bug Something isn't working label Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: phantom inputs from google STT library for low confidence / short transcripts #1103

bug: phantom inputs from google STT library for low confidence / short transcripts #1103

brightsparc commented Nov 18, 2024

brightsparc commented Nov 18, 2024 •

edited

Loading

bug: phantom inputs from google STT library for low confidence / short transcripts #1103

bug: phantom inputs from google STT library for low confidence / short transcripts #1103

Comments

brightsparc commented Nov 18, 2024

brightsparc commented Nov 18, 2024 • edited Loading

brightsparc commented Nov 18, 2024 •

edited

Loading