performance of unregularized LSTM and its configuration #1

Sunnydreamrain · 2017-02-17T03:19:50Z

Hi,

I have been trying to repeat the experiment on world level ptb using unregularized LSTM. As reported in your paper, it gets to 114.5 on the test set. However, I have tried many times and only get results around 138. I just want to check with you the configuration you used.

Hidden units: 200 or 256 or 1500?
layers: 1 layer or 2 layers
training method: sgd (lr=1) or adam (lr=1)?
weight initialization: uniformly [-0.04, 0.04] or [-0.1, 0.1]
dropout: no any kind of dropout used?
seq_len: 20 or 35?

It has been driving crazy not able to get the basic results. I would appreciate your help.
Thanks.

teganmaharaj · 2017-02-18T22:29:45Z

Hey, here is the config for the large model (zoneout probs are 0 so just an LSTM): class LargeConfig(object): """Large config.""" init_scale = 0.04 learning_rate = 1. max_grad_norm = 10 num_layers = 2 num_steps = 35 hidden_size = 1500 max_epoch = 14 max_max_epoch = 55 keep_prob = 0.35 lr_decay = 1 / 1.15 batch_size = 20 vocab_size = 10000 zoneout_c_keep_prob = .0 zoneout_h_keep_prob = .0 weight_decay = 1e-7 optimizer = "sgd" The settings are replicated from Zaremba et al.'s paper: https://arxiv.org/pdf/1409.2329.pdf And described in ours: https://arxiv.org/pdf/1606.01305.pdf We based our code on a tensorflow implementation of Zaremba's original results. What code are you using? Are you doing the lr schedule? Are you doing truncated BPTT? They/we do (copying final hidden states to initial hidden states) - when we first implemented in theano we didn't do this, and it seems to make a big difference. Let me know if you have any other questions!

…

On Thu, Feb 16, 2017 at 10:19 PM, Sunnydreamrain ***@***.***> wrote: Hi, I have been trying to repeat the experiment on world level ptb using unregularized LSTM. As reported in your paper, it gets to 114.5 on the test set. However, I have tried many times and only get results around 138. I just want to check with you the configuration you used. Hidden units: 200 or 256 or 1500? layers: 1 layer or 2 layers training method: sgd (lr=1) or adam (lr=1)? weight initialization: uniformly [-0.04, 0.04] or [-0.1, 0.1] dropout: no any kind of dropout used? seq_len: 20 or 35? It has been driving crazy not able to get the basic results. I would appreciate your help. Thanks. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADN-cdSke-bEG-AoXeusntba3UuIzzG2ks5rdRHWgaJpZM4MD0_D> .

Sunnydreamrain · 2017-02-19T23:13:27Z

Thanks for the reply.
In your paper, for the unregularized LSTM, do you use this configuration? Because in Zaremba et al.'s paper, for unregularized LSTM, they used 2 layers and 200 units in each layer.

I just noticed something weird in your code.
in https://github.com/teganmaharaj/zoneout/blob/master/zoneout_word_ptb.py#L327
and https://github.com/teganmaharaj/zoneout/blob/master/zoneout_word_ptb.py#L354
Before every lstm, there is a linear layer. I understand that before the first LSTM layer, it should be an embedding layer. But there should be no linear layer between LSTM layers. Is this right?
Also for embedding layer, there should be no bias parameters.

"We based our code on a tensorflow implementation of Zaremba's original results. What code are you using?"
The word loading process is done based on the tensorflow implementation. The lstm models are based on Lasagne.

"Are you doing the lr schedule?"
Yes

"Are you doing truncated BPTT?"
Since the sequence is separated into batches and each batch is of fixed length 20 or 35, truncated BPTT should not matter. It only propagates through one batch.

"They/we do (copying final hidden states to initial hidden states) - when we first implemented in theano we didn't do this, and it seems to make a big difference."
Yes. I have produced the final hidden states and cell states, and used them as initial states for the next batch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance of unregularized LSTM and its configuration #1

performance of unregularized LSTM and its configuration #1

Sunnydreamrain commented Feb 17, 2017

teganmaharaj commented Feb 18, 2017 via email

Sunnydreamrain commented Feb 19, 2017 •

edited

Loading

performance of unregularized LSTM and its configuration #1

performance of unregularized LSTM and its configuration #1

Comments

Sunnydreamrain commented Feb 17, 2017

teganmaharaj commented Feb 18, 2017 via email

Sunnydreamrain commented Feb 19, 2017 • edited Loading

Sunnydreamrain commented Feb 19, 2017 •

edited

Loading