Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: phantom inputs from google STT library for low confidence / short transcripts #1103

Open
brightsparc opened this issue Nov 18, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@brightsparc
Copy link
Contributor

In the voice agent i'm building I see phantom numeric user_transcript inputs come through when using the google STT library.

Some example below, you can see 1 being emitted which is detected outside of the user_started_speaking or user_stopped_speaking VAD .

11-18 11:03:25,816 - DEBUG livekit.agents.pipeline - user_started_speaking {"pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
2024-11-18 11:03:31,607 - DEBUG livekit.agents.pipeline - received user transcript {"user_transcript": "Uh, let me think about that. Yes.", "type": "final_transcript", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
sending state 0
2024-11-18 11:03:32,436 - DEBUG livekit.agents.pipeline - user_stopped_speaking {"pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
2024-11-18 11:03:32,813 - DEBUG livekit.agents.pipeline - validated agent reply {"speech_id": "67b1b92a6de2", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
2024-11-18 11:03:35,517 - DEBUG livekit.agents.pipeline - speech playout started {"speech_id": "67b1b92a6de2", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
sending state 2
2024-11-18 11:03:38,305 - DEBUG livekit.agents.pipeline - received user transcript {"user_transcript": "1", "type": "final_transcript", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
2024-11-18 11:03:39,807 - DEBUG livekit.agents.pipeline - skipping validation, agent is speaking and does not allow interruptions {"speech_id": "67b1b92a6de2", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
sending state 0
2024-11-18 11:03:55,241 - DEBUG livekit.agents.pipeline - speech playout finished {"speech_id": "67b1b92a6de2", "interrupted": false, "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}
2024-11-18 11:03:55,242 - DEBUG livekit.agents.pipeline - committed agent speech {"agent_transcript": " Take your time., "interrupted": false, "speech_id": "67b1b92a6de2", "pid": 70217, "job_id": "AJ_LBHQeFbo38VK"}

You can see another instance where a series of numbers are emitted 1 2 3 7 0 9 1 2 0 0 0 0 0 which were not part of the transcript (I've added logging for interim transcripts which shows the additional "final_transcript" emitted just prior to the actual final input.

2024-11-18 11:16:08,133 - DEBUG livekit.agents.pipeline - user_started_speaking {"pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:08,869 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "uh", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,063 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "Sorry.", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,370 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry can", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,487 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry, can you", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,651 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry, can you", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,864 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry, can you repeat", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:09,948 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry, can you repeat", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:10,047 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "sorry, can you repeat", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:10,164 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "Sorry, can you repeat that?", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:10,436 - DEBUG livekit.agents.pipeline - received interim transcript {"user_transcript": "Sorry, can you repeat that?", "type": "interim_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
sending state 0
2024-11-18 11:16:16,307 - DEBUG livekit.agents.pipeline - received user transcript {"user_transcript": "1 2 3 7 0 9 1 2 0 0 0 0 0", "type": "final_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}
2024-11-18 11:16:17,203 - DEBUG livekit.agents.pipeline - received user transcript {"user_transcript": "Sorry, can you repeat that?", "type": "final_transcript", "pid": 70716, "job_id": "AJ_Kr4GyxgomYyM"}

I've captured the output from the the google speech response in same cases:

SPEECH RESP metadata {
  total_billed_duration {
    seconds: 176
  }
  request_id: "673b2846-0000-20ff-acb0-089e082cdb00"
}
results {
  alternatives {
    transcript: "3 9 2 0 8 0 8 0 0 0 0"
    confidence: 0.585391164
    words {
      start_offset {
        seconds: 24
        nanos: 470000000
      }
      end_offset {
        seconds: 24
        nanos: 550000000
      }
      word: "3"
      confidence: 0.33627677
    }
    words {
      start_offset {
        seconds: 24
        nanos: 550000000
      }
      end_offset {
        seconds: 24
        nanos: 670000000
      }
      word: "9"
      confidence: 0.626038194
    }
    words {
      start_offset {
        seconds: 24
        nanos: 670000000
      }
      end_offset {
        seconds: 24
        nanos: 750000000
      }
      word: "2"
      confidence: 0.79240793
    }
    words {
      start_offset {
        seconds: 24
        nanos: 750000000
      }
      end_offset {
        seconds: 24
        nanos: 870000000
      }
      word: "0"
      confidence: 0.633162379
    }
    words {
      start_offset {
        seconds: 24
        nanos: 870000000
      }
      end_offset {
        seconds: 24
        nanos: 990000000
      }
      word: "8"
      confidence: 0.604495645
    }
    words {
      start_offset {
        seconds: 24
        nanos: 990000000
      }
      end_offset {
        seconds: 25
        nanos: 110000000
      }
      word: "0"
      confidence: 0.695137799
    }
    words {
      start_offset {
        seconds: 25
        nanos: 110000000
      }
      end_offset {
        seconds: 25
        nanos: 190000000
      }
      word: "8"
      confidence: 0.635257244
    }
    words {
      start_offset {
        seconds: 25
        nanos: 190000000
      }
      end_offset {
        seconds: 25
        nanos: 270000000
      }
      word: "0"
      confidence: 0.615414381
    }
    words {
      start_offset {
        seconds: 25
        nanos: 270000000
      }
      end_offset {
        seconds: 25
        nanos: 390000000
      }
      word: "0"
      confidence: 0.609516263
    }
    words {
      start_offset {
        seconds: 25
        nanos: 390000000
      }
      end_offset {
        seconds: 25
        nanos: 510000000
      }
      word: "0"
      confidence: 0.546232462
    }
    words {
      start_offset {
        seconds: 25
        nanos: 510000000
      }
      end_offset {
        seconds: 25
        nanos: 590000000
      }
      word: "0"
      confidence: 0.591834068
    }
  }
  is_final: true
  result_end_offset {
    seconds: 10
    nanos: 630000114
  }
  language_code: "en-AU"
}

In most cases the confidence < 0.6 which indicates, so these could probably be ignored, but in other cases there are numeric inputs with high confidence but that has a duration of 0 (ie start/end offsets are equal).

SPEECH RESP metadata {
  total_billed_duration {
    seconds: 29
  }
  request_id: "674eec46-0000-2f17-b8b0-ac3eb15a0510"
}
results {
  alternatives {
    transcript: "3688866886886886868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686"
    confidence: 0.956959367
    words {
      start_offset {
        seconds: 34
        nanos: 330000000
      }
      end_offset {
        seconds: 34
        nanos: 330000000
      }
      word: "3688866886886886868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686868686"
      confidence: 0.960090339
    }
  }
  is_final: true
  result_end_offset {
    seconds: 21
    nanos: 370000839
  }
  language_code: "en-AU"
}
@brightsparc brightsparc added the bug Something isn't working label Nov 18, 2024
@brightsparc
Copy link
Contributor Author

brightsparc commented Nov 18, 2024

This is seen when using google STT library with chirp_2 model. I've tested with the default long model and it doesn't happen at all /often.

So I think there is probably a fix required specific to google, but also I wonder if these should be some thresholding on the min confidence / duration of the text input returned, as typically these phantom inputs are either low confidence (<0.65) or very short (<0.05 seconds).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant