Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compare with the code in the original paper #13

Closed
wq343580510 opened this issue Sep 15, 2017 · 4 comments
Closed

compare with the code in the original paper #13

wq343580510 opened this issue Sep 15, 2017 · 4 comments

Comments

@wq343580510
Copy link

The difference might be because of differences in the way the word embeddings are updated.
Can you make it more specific?

@ryh95
Copy link

ryh95 commented Oct 24, 2017

I'm confused about why update word embeddings can cause performance decrease, may be some comparison experiment should be done with original lua code.

@dasguptar
Copy link
Owner

Hi @wq343580510 and @ryh95

Got stuck with some other work, did not have time to reply to you guys!

So one of the major differences is that in the original implementation, the embeddings are updated using plain SGD, while the rest of the model is updated using the optimizer. In my implementation, however, I use the same optimizer to update the entire model including the embeddings. This can potentially lead to slight differences across batches and epochs, resulting in different performance.

In the original implementation, this line updates word embeddings, while this line updates the remaining parameters.

Of course, other minor differences might still be present, such as differences in weight initialization. I have tried to use the hyperparameters as specified in the paper, but it is quite possible I missed something! I was trying to implement this model to learn more about dynamic neural networks in PyTorch, and the minor gap in results seemed acceptable to me at the time, so I never investigated further.

@dasguptar
Copy link
Owner

dasguptar commented Nov 28, 2017

Hi @wq343580510 and @ryh95 ,

The original implementation deals with word embeddings by updating them separately using plain SGD, as demonstrated by these lines:

However, as pointed out in another issue #17, the text of the original paper mentions freezing the word embeddings, which I have since incorporated as of this commit which adds the option of freezing the word embeddings during training. This results in a slight improvement to the metrics, and we can now reach Pearson's coefficient of 0.8674 and MSE of 0.2536.

Since we are now within ~0.0005 of the original paper, albeit with a different learning rate, I don't think I would be spending any more time optimizing this further. However, any contributions (PRs/bug fixes) are always welcome and will be merged. 🙂

@jeenakk
Copy link

jeenakk commented Jun 23, 2019

Can you please share the arguments namespace for which you got Pearson's coefficient as 0.8674 and MSE as 0.2536 ? In how many epochs did you achieve this? Was the random seed '123' ? Sorry, I am unable to reproduce the your exact result with this code. Any little help will be appreciated! Thank you in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants