Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding with time slots and with a confidence #713

Closed
AlexandderGorodetski opened this issue Nov 28, 2022 · 5 comments
Closed

Decoding with time slots and with a confidence #713

AlexandderGorodetski opened this issue Nov 28, 2022 · 5 comments

Comments

@AlexandderGorodetski
Copy link

Hello guys,

Is it possible in K2 to extract start time and end time of every decoded (hypothesized) word along with a recognition confidence of every word and every full utterance?

Thanks,
AlexG.

@marcoyang1998
Copy link
Collaborator

We already have the option to return timestamps during decoding.
I think it is also possible to get recognition confidence of every word. What do you mean by the utterance level confidence?

@AlexandderGorodetski
Copy link
Author

What option should I use to print timestamps and the confidence of every word?

Utterance level confidence is less important, it should be something equivalent to average of all word confidences in the utterance excluding silences and UNKs.

@marcoyang1998
Copy link
Collaborator

Please see this PR #598. It returns the timestamp for every word.
It does not support showing confidence yet, but it should be easy to implement.

@EmreOzkose
Copy link
Contributor

Hi, I tried to obtain confidences in Sherpa.

  1. In this line, probs are cumulated. I tried to obtain sinlge probs instead of accumulated ones.
    a. Define a variable log_probs_single which stores only ys_log_probs.
    b. Define values_single which stores probs that are not accumulated.
    c. add new_hyp.probs.push_back(exp(values_single[j]))
    where new_hyp.probs is a float vector.

  2. I also use accumulated probs where I only added new_hyp.probs.push_back(exp(values_acc[j])).

Observations:

  • In case 1, I obtained good results. Example (a sample from librispeech):
token | timestamps | probs
[0] Tokens: 
 when [0]  [0.999854]
 it [7]  [0.940455]
 was [11]  [0.579752]
 over [18]  [0.556515]
 the [25]  [0.732816]
 me [31]  [0.809726]
n [33]  [0.908284]
 are [37]  [0.539135]
 s [41]  [0.98156]
ti [42]  [0.992244]
ck [45]  [0.994111]
y [48]  [0.995734]
 if [51]  [0.999497]
 he [54]  [0.997224]
 could [57]  [0.988206]
 w [62]  [0.997363]
al [64]  [0.998087]
k [66]  [0.99982]
 a [69]  [0.972448]
 little [72]  [0.9689]
 way [80]  [0.979477]
 and [86]  [0.560109]
 when [92]  [0.941646]
 di [97]  [0.862076]
ck [100]  [0.828308]
y [103]  [0.638713]
 said [107]  [0.912162]
 he [116]  [0.999628]
 could [120]  [0.811822]
 they [130]  [0.927167]
 se [138]  [0.933728]
t [141]  [0.99964]
 out [145]  [0.993249]
 in [151]  [0.994451]
 the [154]  [0.690574]
 most [157]  [0.869845]
 friend [164]  [0.979762]
ly [169]  [0.982396]
 way [174]  [0.807062]
  [181]  [0.993485]
side [182]  [0.999703]
 by [189]  [0.802229]
  [195]  [0.630429]
side [196]  [0.999677]

[0] Words: 
when [0 - 40]  [0.999854]
it [280 - 320]  [0.940455]
was [440 - 480]  [0.579752]
over [720 - 760]  [0.556515]
the [1000 - 1040]  [0.732816]
men [1240 - 1360]  [0.859005]
are [1480 - 1520]  [0.539135]
sticky [1640 - 1960]  [0.990912]
if [2040 - 2080]  [0.999497]
he [2160 - 2200]  [0.997224]
could [2280 - 2320]  [0.988206]
walk [2480 - 2680]  [0.998423]
a [2760 - 2800]  [0.972448]
little [2880 - 2920]  [0.9689]
way [3200 - 3240]  [0.979477]
and [3440 - 3480]  [0.560109]
when [3680 - 3720]  [0.941646]
dicky [3880 - 4160]  [0.776366]
said [4280 - 4320]  [0.912162]
he [4640 - 4680]  [0.999628]
could [4800 - 4840]  [0.811822]
they [5200 - 5240]  [0.927167]
set [5520 - 5680]  [0.966684]
out [5800 - 5840]  [0.993249]
in [6040 - 6080]  [0.994451]
the [6160 - 6200]  [0.690574]
most [6280 - 6320]  [0.869845]
friendly [6560 - 6800]  [0.981079]
way  [6960 - 7280]  [0.900273]
side [7280 - 7320]  [0.999703]
by  [7560 - 7840]  [0.716329]
side [7840 - 7880]  [0.999677]

[0] Text : when it was over the men are sticky if he could walk a little way and when dicky said he could they set out in the most friendly way  side by  side
[0] Confidence : 0.879481

but when I use a model that is trained for another language:

[0] Tokens: 
  [0]  [0.763373]
n [1]  [0.54338]
or [8]  [0.872327]
ve [11]  [0.291157]
s [14]  [0.546196]
 a [16]  [0.406919]
u [18]  [0.503432]
va [22]  [0.565951]
 de [28]  [0.244049]
me [31]  [0.339609]
s [39]  [0.47594]
ta [43]  [0.480307]
ki [46]  [0.987201]
 fi [51]  [0.847064]
k [54]  [0.409512]
ır [56]  [0.556805]
t [58]  [0.715683]
  [60]  [0.361864]
v [61]  [0.805749]
o [62]  [0.874659]
k [65]  [0.70021]
er [67]  [0.319827]
  [70]  [0.843885]
l [71]  [0.277047]
o [73]  [0.7871]
ba [75]  [0.844455]
y [80]  [0.641547]
 he [88]  [0.464648]
me [90]  [0.635691]
n [94]  [0.700285]
t [98]  [0.502162]
 ise [103]  [0.651219]

[0] Words: 
norves [0 - 600]  [0.563265]
auva [640 - 920]  [0.492101]
demestaki [1120 - 1880]  [0.505421]
fikırt  [2040 - 2440]  [0.578186]
voker  [2440 - 2840]  [0.708866]
lobay [2840 - 3240]  [0.637537]
hement [3520 - 3960]  [0.575696]
ise [4120 - 4160]  [0.651219]

[0] Text : norves auva demestaki fikırt  voker  lobay hement ise
[0] Confidence : 0.589036

I would except lower probs. The problem is that wrong model cannot capture true tokens, but captures some tokens that are actually in audio acoustically. And these captured tokens may come with high prob.

For simplicity, a simple example:
ref: a a a b c d b b d
hyp: a b d

prob of ref comes ~0.8 and prob of hyp comes ~0.5. I think missing tokens have to be punished in some way.

  • In case 2, probs become very low when a word that is not in the language appers. It can be a name from another language. It is actually what I except to obtain. However since probs are accumulated, probs of next words also comes very lower.

@csukuangfj
Copy link
Collaborator

I think missing tokens have to be punished in some way.

Since we don't have ground truth available during decoding, it is hard to know when and where we have missed some tokens during decoding.

@JinZr JinZr closed this as completed Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants