Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p(blank symbol) >> p(non-blank symbol) during NN-CTC training #3

Open
lifelongeek opened this issue Jun 24, 2015 · 9 comments
Open

Comments

@lifelongeek
Copy link

Hi all

I want to discuss some issue regarding training DNN/CNN-CTC for speech recognition. (Wall Street Journal Corpus). I modeled output unit as characters.

I observed that CTC objective function was increasing and finally converged during training.
image

But I also observed that final NN outputs have clear tendency : p(blank symbol) >> p(non-blank symbol) for all speech time frame as following figure

image

In Alex Graves' paper, trained RNN should have high p(non-blank) at some point like following figure
image

Do you have same situation when you train NN-CTC for sequence labeling problem? I am suspecting that the reason is I use MLP/CNN instead of RNN, but I can't clearly explain why this can be a reason.
Any idea about this result?

Thank you for reading my question.

@zxie
Copy link
Collaborator

zxie commented Jun 24, 2015

Hmmm, we definitely observe character probabilities spiking above blank probabilities at many time steps. Though there is an imbalance issue: blanks are much more frequent than all other characters.

Not sure why MLP+CNN wouldn't do as well w/o more details (are you providing sufficient temporal context?). That said, at convergence your negative log-likelihood cost looks too high; we get < 50 using about 20 million parameters.

@amaas
Copy link
Owner

amaas commented Jun 24, 2015

I suspect this is due to underfitting. The network always learns first that
blank has high probability all the time. It's not until later in training
with a sufficiently expressive network that non-blank characters start
showing strong spikes.

On Wed, Jun 24, 2015 at 5:53 AM, gmkim90 notifications@github.com wrote:

Hi all

I want to discuss some issue regarding training DNN/CNN-CTC for speech
recognition. (Wall Street Journal Corpus). I modeled output unit as
characters.

I observed that CTC objective function was increasing and finally
converged during training.
[image: image]
https://cloud.githubusercontent.com/assets/10232337/8330191/8a6a7da4-1aba-11e5-92ce-752303d35503.png

But I also observed that final NN outputs have clear tendency : p(blank
symbol) >> p(non-blank symbol) for all speech time frame as following figure

[image: image]
https://cloud.githubusercontent.com/assets/10232337/8330115/1debd722-1aba-11e5-9772-81f099dd4fda.png

In Alex Graves' paper, trained RNN should have high p(non-blank) at some
point like following figure
[image: image]
https://cloud.githubusercontent.com/assets/10232337/8330169/6d3d2f92-1aba-11e5-8f8e-f70c2c8b9419.png

Do you have same situation when you train NN-CTC for sequence labeling
problem? I am suspecting that the reason is I use MLP/CNN instead of RNN,
but I can't clearly explain why this can be a reason.

Any idea about this result?

Thank you for reading my question.


Reply to this email directly or view it on GitHub
#3.

@lifelongeek
Copy link
Author

Thanks for your comments.

@zxie, @amaas I use 21frames as context window (with frame length : 25ms/ frame shift size : 10ms ). And MLP architecture used is as follows : 840(40FBANK x 21CW) - 1024 - 1024 -1024 -31 (~3M params). I use normal regularizer such as momentum(0.9) and weight decay(0.0005) during training.

From your comments, it seems that trained network underfit. (average log-likelihood is is not high enough). What do you think to try next? Do I need to try MLP with more parameter? Do I need to try RNN(more expressive for sequential data)? or more training iteration?

@zxie
Copy link
Collaborator

zxie commented Jun 25, 2015

Your MLP gives framewise predictions, correct? Could you detail how your cost is computed w.r.t. the desired character sequence? Are you just using (T - CW) CNN-MLPs (w/ shared parameters), where T denotes the number of input frames?

@lifelongeek
Copy link
Author

Yes. MLP gives framewise character predictions.

I am basically using MLP-CTC (MLP : 840(40FBANK x 21CW) - 1024 - 1024 -1024 -31). And I also tried CNN instaed of MLP to produce framewise prediction.

Objective function(to be maximized) is log likelihood of transcription given Input per utterance.
Details are given as following formula :

image

@amaas
Copy link
Owner

amaas commented Jun 25, 2015

As a sanity check I would try increasing layer sizes to 2048 and training
for longer, maybe twice as many iterations. If that network isn't much
better in terms of log likelihood and observed probabilities it may be a
problem with your optimization settings (e.g. step size or momentum too
small)

On Wed, Jun 24, 2015 at 10:35 PM, gmkim90 notifications@github.com wrote:

Yes. MLP gives framewise character predictions.

I am basically using MLP-CTC (MLP : 840(40FBANK x 21CW) - 1024 - 1024
-1024 -31). And I also tried CNN instaed of MLP to produce framewise
prediction.

Objective function(to be maximized) is log likelihood of transcription
given Input per utterance.
Details are given as following formula :

[image: image]
https://cloud.githubusercontent.com/assets/10232337/8347914/5b970c76-1b47-11e5-8a50-59ba3a11084f.png


Reply to this email directly or view it on GitHub
#3 (comment).

@zxie
Copy link
Collaborator

zxie commented Jun 25, 2015

If I'm understanding correctly, not having recurrent connections could also be issue...it's a big ask to have each MLP produce the right prediction independent of the others without sequential reasoning.

@saseptim
Copy link

saseptim commented Jun 8, 2016

Did you ever solve the issue? I have the same problem at the moment, the network is outputting all blanks.
Interestingly, the example given by baidu in their warp-ctc also outputs blanks in the beginning, and then starts to learn

@BancoLin
Copy link

Do you scale your output with prior probabilities? The count of blank symbol is quite higher than others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants