p(blank symbol) >> p(non-blank symbol) during NN-CTC training #3

lifelongeek · 2015-06-24T12:53:32Z

Hi all

I want to discuss some issue regarding training DNN/CNN-CTC for speech recognition. (Wall Street Journal Corpus). I modeled output unit as characters.

I observed that CTC objective function was increasing and finally converged during training.

But I also observed that final NN outputs have clear tendency : p(blank symbol) >> p(non-blank symbol) for all speech time frame as following figure

In Alex Graves' paper, trained RNN should have high p(non-blank) at some point like following figure

Do you have same situation when you train NN-CTC for sequence labeling problem? I am suspecting that the reason is I use MLP/CNN instead of RNN, but I can't clearly explain why this can be a reason.
Any idea about this result?

Thank you for reading my question.

zxie · 2015-06-24T19:33:08Z

Hmmm, we definitely observe character probabilities spiking above blank probabilities at many time steps. Though there is an imbalance issue: blanks are much more frequent than all other characters.

Not sure why MLP+CNN wouldn't do as well w/o more details (are you providing sufficient temporal context?). That said, at convergence your negative log-likelihood cost looks too high; we get < 50 using about 20 million parameters.

amaas · 2015-06-24T19:44:12Z

I suspect this is due to underfitting. The network always learns first that
blank has high probability all the time. It's not until later in training
with a sufficiently expressive network that non-blank characters start
showing strong spikes.

On Wed, Jun 24, 2015 at 5:53 AM, gmkim90 notifications@github.com wrote:

Hi all

I want to discuss some issue regarding training DNN/CNN-CTC for speech
recognition. (Wall Street Journal Corpus). I modeled output unit as
characters.

I observed that CTC objective function was increasing and finally
converged during training.
[image: image]
https://cloud.githubusercontent.com/assets/10232337/8330191/8a6a7da4-1aba-11e5-92ce-752303d35503.png

But I also observed that final NN outputs have clear tendency : p(blank
symbol) >> p(non-blank symbol) for all speech time frame as following figure

[image: image]
https://cloud.githubusercontent.com/assets/10232337/8330115/1debd722-1aba-11e5-9772-81f099dd4fda.png

In Alex Graves' paper, trained RNN should have high p(non-blank) at some
point like following figure
[image: image]
https://cloud.githubusercontent.com/assets/10232337/8330169/6d3d2f92-1aba-11e5-8f8e-f70c2c8b9419.png

Do you have same situation when you train NN-CTC for sequence labeling
problem? I am suspecting that the reason is I use MLP/CNN instead of RNN,
but I can't clearly explain why this can be a reason.

Any idea about this result?

Thank you for reading my question.

—
Reply to this email directly or view it on GitHub
#3.

lifelongeek · 2015-06-25T01:24:25Z

Thanks for your comments.

@zxie, @amaas I use 21frames as context window (with frame length : 25ms/ frame shift size : 10ms ). And MLP architecture used is as follows : 840(40FBANK x 21CW) - 1024 - 1024 -1024 -31 (~3M params). I use normal regularizer such as momentum(0.9) and weight decay(0.0005) during training.

From your comments, it seems that trained network underfit. (average log-likelihood is is not high enough). What do you think to try next? Do I need to try MLP with more parameter? Do I need to try RNN(more expressive for sequential data)? or more training iteration?

zxie · 2015-06-25T05:24:34Z

Your MLP gives framewise predictions, correct? Could you detail how your cost is computed w.r.t. the desired character sequence? Are you just using (T - CW) CNN-MLPs (w/ shared parameters), where T denotes the number of input frames?

lifelongeek · 2015-06-25T05:35:23Z

Yes. MLP gives framewise character predictions.

I am basically using MLP-CTC (MLP : 840(40FBANK x 21CW) - 1024 - 1024 -1024 -31). And I also tried CNN instaed of MLP to produce framewise prediction.

Objective function(to be maximized) is log likelihood of transcription given Input per utterance.
Details are given as following formula :

amaas · 2015-06-25T05:42:28Z

As a sanity check I would try increasing layer sizes to 2048 and training
for longer, maybe twice as many iterations. If that network isn't much
better in terms of log likelihood and observed probabilities it may be a
problem with your optimization settings (e.g. step size or momentum too
small)

On Wed, Jun 24, 2015 at 10:35 PM, gmkim90 notifications@github.com wrote:

Yes. MLP gives framewise character predictions.

I am basically using MLP-CTC (MLP : 840(40FBANK x 21CW) - 1024 - 1024
-1024 -31). And I also tried CNN instaed of MLP to produce framewise
prediction.

Objective function(to be maximized) is log likelihood of transcription
given Input per utterance.
Details are given as following formula :

[image: image]
https://cloud.githubusercontent.com/assets/10232337/8347914/5b970c76-1b47-11e5-8a50-59ba3a11084f.png

—
Reply to this email directly or view it on GitHub
#3 (comment).

zxie · 2015-06-25T05:43:17Z

If I'm understanding correctly, not having recurrent connections could also be issue...it's a big ask to have each MLP produce the right prediction independent of the others without sequential reasoning.

saseptim · 2016-06-08T09:43:29Z

Did you ever solve the issue? I have the same problem at the moment, the network is outputting all blanks.
Interestingly, the example given by baidu in their warp-ctc also outputs blanks in the beginning, and then starts to learn

BancoLin · 2016-08-18T08:36:49Z

Do you scale your output with prior probabilities? The count of blank symbol is quite higher than others.

f0k mentioned this issue Jan 23, 2018

Implementation of CTC in pure theano with custom gradient Lasagne/Recipes#108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

p(blank symbol) >> p(non-blank symbol) during NN-CTC training #3

p(blank symbol) >> p(non-blank symbol) during NN-CTC training #3

lifelongeek commented Jun 24, 2015

zxie commented Jun 24, 2015

amaas commented Jun 24, 2015

lifelongeek commented Jun 25, 2015

zxie commented Jun 25, 2015

lifelongeek commented Jun 25, 2015

amaas commented Jun 25, 2015

zxie commented Jun 25, 2015

saseptim commented Jun 8, 2016

BancoLin commented Aug 18, 2016

p(blank symbol) >> p(non-blank symbol) during NN-CTC training #3

p(blank symbol) >> p(non-blank symbol) during NN-CTC training #3

Comments

lifelongeek commented Jun 24, 2015

zxie commented Jun 24, 2015

amaas commented Jun 24, 2015

lifelongeek commented Jun 25, 2015

zxie commented Jun 25, 2015

lifelongeek commented Jun 25, 2015

amaas commented Jun 25, 2015

zxie commented Jun 25, 2015

saseptim commented Jun 8, 2016

BancoLin commented Aug 18, 2016