Decoding with time slots and with a confidence #713

AlexandderGorodetski · 2022-11-28T16:28:57Z

Hello guys,

Is it possible in K2 to extract start time and end time of every decoded (hypothesized) word along with a recognition confidence of every word and every full utterance?

Thanks,
AlexG.

marcoyang1998 · 2022-11-28T16:38:11Z

We already have the option to return timestamps during decoding.
I think it is also possible to get recognition confidence of every word. What do you mean by the utterance level confidence?

AlexandderGorodetski · 2022-11-28T16:49:16Z

What option should I use to print timestamps and the confidence of every word?

Utterance level confidence is less important, it should be something equivalent to average of all word confidences in the utterance excluding silences and UNKs.

marcoyang1998 · 2022-11-29T01:37:48Z

Please see this PR #598. It returns the timestamp for every word.
It does not support showing confidence yet, but it should be easy to implement.

EmreOzkose · 2022-12-16T11:57:38Z

Hi, I tried to obtain confidences in Sherpa.

In this line, probs are cumulated. I tried to obtain sinlge probs instead of accumulated ones.
a. Define a variable log_probs_single which stores only ys_log_probs.
b. Define values_single which stores probs that are not accumulated.
c. add new_hyp.probs.push_back(exp(values_single[j]))
where new_hyp.probs is a float vector.
I also use accumulated probs where I only added new_hyp.probs.push_back(exp(values_acc[j])).

Observations:

In case 1, I obtained good results. Example (a sample from librispeech):

token | timestamps | probs
[0] Tokens: 
 when [0]  [0.999854]
 it [7]  [0.940455]
 was [11]  [0.579752]
 over [18]  [0.556515]
 the [25]  [0.732816]
 me [31]  [0.809726]
n [33]  [0.908284]
 are [37]  [0.539135]
 s [41]  [0.98156]
ti [42]  [0.992244]
ck [45]  [0.994111]
y [48]  [0.995734]
 if [51]  [0.999497]
 he [54]  [0.997224]
 could [57]  [0.988206]
 w [62]  [0.997363]
al [64]  [0.998087]
k [66]  [0.99982]
 a [69]  [0.972448]
 little [72]  [0.9689]
 way [80]  [0.979477]
 and [86]  [0.560109]
 when [92]  [0.941646]
 di [97]  [0.862076]
ck [100]  [0.828308]
y [103]  [0.638713]
 said [107]  [0.912162]
 he [116]  [0.999628]
 could [120]  [0.811822]
 they [130]  [0.927167]
 se [138]  [0.933728]
t [141]  [0.99964]
 out [145]  [0.993249]
 in [151]  [0.994451]
 the [154]  [0.690574]
 most [157]  [0.869845]
 friend [164]  [0.979762]
ly [169]  [0.982396]
 way [174]  [0.807062]
  [181]  [0.993485]
side [182]  [0.999703]
 by [189]  [0.802229]
  [195]  [0.630429]
side [196]  [0.999677]

[0] Words: 
when [0 - 40]  [0.999854]
it [280 - 320]  [0.940455]
was [440 - 480]  [0.579752]
over [720 - 760]  [0.556515]
the [1000 - 1040]  [0.732816]
men [1240 - 1360]  [0.859005]
are [1480 - 1520]  [0.539135]
sticky [1640 - 1960]  [0.990912]
if [2040 - 2080]  [0.999497]
he [2160 - 2200]  [0.997224]
could [2280 - 2320]  [0.988206]
walk [2480 - 2680]  [0.998423]
a [2760 - 2800]  [0.972448]
little [2880 - 2920]  [0.9689]
way [3200 - 3240]  [0.979477]
and [3440 - 3480]  [0.560109]
when [3680 - 3720]  [0.941646]
dicky [3880 - 4160]  [0.776366]
said [4280 - 4320]  [0.912162]
he [4640 - 4680]  [0.999628]
could [4800 - 4840]  [0.811822]
they [5200 - 5240]  [0.927167]
set [5520 - 5680]  [0.966684]
out [5800 - 5840]  [0.993249]
in [6040 - 6080]  [0.994451]
the [6160 - 6200]  [0.690574]
most [6280 - 6320]  [0.869845]
friendly [6560 - 6800]  [0.981079]
way  [6960 - 7280]  [0.900273]
side [7280 - 7320]  [0.999703]
by  [7560 - 7840]  [0.716329]
side [7840 - 7880]  [0.999677]

[0] Text : when it was over the men are sticky if he could walk a little way and when dicky said he could they set out in the most friendly way  side by  side
[0] Confidence : 0.879481

but when I use a model that is trained for another language:

[0] Tokens: 
  [0]  [0.763373]
n [1]  [0.54338]
or [8]  [0.872327]
ve [11]  [0.291157]
s [14]  [0.546196]
 a [16]  [0.406919]
u [18]  [0.503432]
va [22]  [0.565951]
 de [28]  [0.244049]
me [31]  [0.339609]
s [39]  [0.47594]
ta [43]  [0.480307]
ki [46]  [0.987201]
 fi [51]  [0.847064]
k [54]  [0.409512]
ır [56]  [0.556805]
t [58]  [0.715683]
  [60]  [0.361864]
v [61]  [0.805749]
o [62]  [0.874659]
k [65]  [0.70021]
er [67]  [0.319827]
  [70]  [0.843885]
l [71]  [0.277047]
o [73]  [0.7871]
ba [75]  [0.844455]
y [80]  [0.641547]
 he [88]  [0.464648]
me [90]  [0.635691]
n [94]  [0.700285]
t [98]  [0.502162]
 ise [103]  [0.651219]

[0] Words: 
norves [0 - 600]  [0.563265]
auva [640 - 920]  [0.492101]
demestaki [1120 - 1880]  [0.505421]
fikırt  [2040 - 2440]  [0.578186]
voker  [2440 - 2840]  [0.708866]
lobay [2840 - 3240]  [0.637537]
hement [3520 - 3960]  [0.575696]
ise [4120 - 4160]  [0.651219]

[0] Text : norves auva demestaki fikırt  voker  lobay hement ise
[0] Confidence : 0.589036

I would except lower probs. The problem is that wrong model cannot capture true tokens, but captures some tokens that are actually in audio acoustically. And these captured tokens may come with high prob.

For simplicity, a simple example:
ref: a a a b c d b b d
hyp: a b d

prob of ref comes ~0.8 and prob of hyp comes ~0.5. I think missing tokens have to be punished in some way.

In case 2, probs become very low when a word that is not in the language appers. It can be a name from another language. It is actually what I except to obtain. However since probs are accumulated, probs of next words also comes very lower.

csukuangfj · 2022-12-16T12:48:52Z

I think missing tokens have to be punished in some way.

Since we don't have ground truth available during decoding, it is hard to know when and where we have missed some tokens during decoding.

JinZr closed this as completed Feb 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoding with time slots and with a confidence #713

Decoding with time slots and with a confidence #713

AlexandderGorodetski commented Nov 28, 2022

marcoyang1998 commented Nov 28, 2022

AlexandderGorodetski commented Nov 28, 2022

marcoyang1998 commented Nov 29, 2022

EmreOzkose commented Dec 16, 2022

csukuangfj commented Dec 16, 2022

Decoding with time slots and with a confidence #713

Decoding with time slots and with a confidence #713

Comments

AlexandderGorodetski commented Nov 28, 2022

marcoyang1998 commented Nov 28, 2022

AlexandderGorodetski commented Nov 28, 2022

marcoyang1998 commented Nov 29, 2022

EmreOzkose commented Dec 16, 2022

csukuangfj commented Dec 16, 2022