Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance of unregularized LSTM and its configuration #1

Open
Sunnydreamrain opened this issue Feb 17, 2017 · 2 comments
Open

performance of unregularized LSTM and its configuration #1

Sunnydreamrain opened this issue Feb 17, 2017 · 2 comments

Comments

@Sunnydreamrain
Copy link

Hi,

I have been trying to repeat the experiment on world level ptb using unregularized LSTM. As reported in your paper, it gets to 114.5 on the test set. However, I have tried many times and only get results around 138. I just want to check with you the configuration you used.

Hidden units: 200 or 256 or 1500?
layers: 1 layer or 2 layers
training method: sgd (lr=1) or adam (lr=1)?
weight initialization: uniformly [-0.04, 0.04] or [-0.1, 0.1]
dropout: no any kind of dropout used?
seq_len: 20 or 35?

It has been driving crazy not able to get the basic results. I would appreciate your help.
Thanks.

@teganmaharaj
Copy link
Owner

teganmaharaj commented Feb 18, 2017 via email

@Sunnydreamrain
Copy link
Author

Sunnydreamrain commented Feb 19, 2017

Thanks for the reply.
In your paper, for the unregularized LSTM, do you use this configuration? Because in Zaremba et al.'s paper, for unregularized LSTM, they used 2 layers and 200 units in each layer.

I just noticed something weird in your code.
in https://github.com/teganmaharaj/zoneout/blob/master/zoneout_word_ptb.py#L327
and https://github.com/teganmaharaj/zoneout/blob/master/zoneout_word_ptb.py#L354
Before every lstm, there is a linear layer. I understand that before the first LSTM layer, it should be an embedding layer. But there should be no linear layer between LSTM layers. Is this right?
Also for embedding layer, there should be no bias parameters.

"We based our code on a tensorflow implementation of Zaremba's original results. What code are you using?"
The word loading process is done based on the tensorflow implementation. The lstm models are based on Lasagne.

"Are you doing the lr schedule?"
Yes

"Are you doing truncated BPTT?"
Since the sequence is separated into batches and each batch is of fixed length 20 or 35, truncated BPTT should not matter. It only propagates through one batch.

"They/we do (copying final hidden states to initial hidden states) - when we first implemented in theano we didn't do this, and it seems to make a big difference."
Yes. I have produced the final hidden states and cell states, and used them as initial states for the next batch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants