-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance of unregularized LSTM and its configuration #1
Comments
Hey, here is the config for the large model (zoneout probs are 0 so just an
LSTM):
class LargeConfig(object):
"""Large config."""
init_scale = 0.04
learning_rate = 1.
max_grad_norm = 10
num_layers = 2
num_steps = 35
hidden_size = 1500
max_epoch = 14
max_max_epoch = 55
keep_prob = 0.35
lr_decay = 1 / 1.15
batch_size = 20
vocab_size = 10000
zoneout_c_keep_prob = .0
zoneout_h_keep_prob = .0
weight_decay = 1e-7
optimizer = "sgd"
The settings are replicated from Zaremba et al.'s paper:
https://arxiv.org/pdf/1409.2329.pdf
And described in ours: https://arxiv.org/pdf/1606.01305.pdf
We based our code on a tensorflow implementation of Zaremba's original
results. What code are you using? Are you doing the lr schedule? Are you
doing truncated BPTT? They/we do (copying final hidden states to initial
hidden states) - when we first implemented in theano we didn't do this, and
it seems to make a big difference.
Let me know if you have any other questions!
…On Thu, Feb 16, 2017 at 10:19 PM, Sunnydreamrain ***@***.***> wrote:
Hi,
I have been trying to repeat the experiment on world level ptb using
unregularized LSTM. As reported in your paper, it gets to 114.5 on the test
set. However, I have tried many times and only get results around 138. I
just want to check with you the configuration you used.
Hidden units: 200 or 256 or 1500?
layers: 1 layer or 2 layers
training method: sgd (lr=1) or adam (lr=1)?
weight initialization: uniformly [-0.04, 0.04] or [-0.1, 0.1]
dropout: no any kind of dropout used?
seq_len: 20 or 35?
It has been driving crazy not able to get the basic results. I would
appreciate your help.
Thanks.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADN-cdSke-bEG-AoXeusntba3UuIzzG2ks5rdRHWgaJpZM4MD0_D>
.
|
Thanks for the reply. I just noticed something weird in your code. "We based our code on a tensorflow implementation of Zaremba's original results. What code are you using?" "Are you doing the lr schedule?" "Are you doing truncated BPTT?" "They/we do (copying final hidden states to initial hidden states) - when we first implemented in theano we didn't do this, and it seems to make a big difference." |
Hi,
I have been trying to repeat the experiment on world level ptb using unregularized LSTM. As reported in your paper, it gets to 114.5 on the test set. However, I have tried many times and only get results around 138. I just want to check with you the configuration you used.
Hidden units: 200 or 256 or 1500?
layers: 1 layer or 2 layers
training method: sgd (lr=1) or adam (lr=1)?
weight initialization: uniformly [-0.04, 0.04] or [-0.1, 0.1]
dropout: no any kind of dropout used?
seq_len: 20 or 35?
It has been driving crazy not able to get the basic results. I would appreciate your help.
Thanks.
The text was updated successfully, but these errors were encountered: