Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pretrained word2vec model when train doc2vec #2703

Closed
wants to merge 27 commits into from

Conversation

maohbao
Copy link

@maohbao maohbao commented Dec 17, 2019

The default doc2vec model in gensim does't support pretrained word2vec model. But according to Jey Han Lau and Timothy Baldwin's paper, An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation(2016), using pretrained word2vec model usually gets better results in NLP tasks. The author also released a forked gensim verstion to perform pretrained embeddings, but it is from a very old gensim version, which can't be used in gensim 3.8.

Now I edited two files to support pretrained word embeddings for doc2vec.

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution. Left you some comments.

Please also add unit tests for your new functionality.

Let me know when you're ready for another review.

@@ -1,177 +1,61 @@
gensim – Topic Modelling in Python
doc2vec in gensim – support pretrained word2vec
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of replacing the top-level README.md file, you should put this documentation somewhere else. Ideally, it should be in a tutorial or a howto.

See https://radimrehurek.com/gensim/auto_examples/howtos/run_doc.html#sphx-glr-auto-examples-howtos-run-doc-py for more info.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my first time to do PR on github, I already followed your advice on word2vec.py and doc2vec.py, I also know that I should not change the top README.md file, but I really don't how and where to write the document, thank you for more advice!

[OpenBLAS]: http://xianyi.github.io/OpenBLAS/
[source tar.gz]: http://pypi.python.org/pypi/gensim
[documentation]: http://radimrehurek.com/gensim/install.html
This is a forked gensim version, which edits the default doc2vec model to support pretrained word2vec during training doc2vec. It forked from gensim 3.8.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write your documentation so that it's useful from the point of view of the reader.

"This is a forked gensim version" is not relevant to the user. Furthermore, it becomes misleading the moment we actually merge this PR.

[documentation]: http://radimrehurek.com/gensim/install.html
This is a forked gensim version, which edits the default doc2vec model to support pretrained word2vec during training doc2vec. It forked from gensim 3.8.

The default doc2vec model in gensim does't support pretrained word2vec model. But according to Jey Han Lau and Timothy Baldwin's paper, [An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation(2016)](https://arxiv.org/abs/1607.05368), using pretrained word2vec model usually gets better results in NLP tasks. The author also released a [forked gensim verstion](https://github.com/jhlau/gensim) to perform pretrained embeddings, but it is from a very old gensim version, which can't be used in gensim 3.8(the latest gensim version when I release this fork).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also irrelevant. This kind of information is good inside the PR, as motivation and background (it may already be there).

> pretrained_emb = "word2vec_pretrained.txt" # This is a pretrained word2vec model of C text format
>
> model = gensim.models.doc2vec.Doc2Vec(
corpus_train, # This is the documents corpus to be trained which should meet gensim's format
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hanging indent please.

gensim/models/doc2vec.py Outdated Show resolved Hide resolved
gensim/models/word2vec.py Show resolved Hide resolved
gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/models/word2vec.py Outdated Show resolved Hide resolved
@mpenkov
Copy link
Collaborator

mpenkov commented Dec 21, 2019

@gojomo Just saw your comments here: #1338

What do you think about this PR?

@gojomo
Copy link
Collaborator

gojomo commented Dec 24, 2019

I don't think an added __init__() option is the best way to achieve this.

I also think people who've read the Lau & Baldwin paper will have unrealistic expectations about the benefits of this approach, because the paper is a bit confused & contradictory in its analysis. (My comments on a prior issue outline some of the reasons.)

Unfortunately, the deeply misguided #1777 refactoring made experiments in this direction more difficult, by breaking two previous ways users could hook-into and modify a model's vocabulary-initialization. (The intersect_word2vec_format() stopped working in some models, and the prior decomposition of build_vocab() into 3 substeps – scan, prepare, finalize – was muddled.)

I also hope that a side-effect of re-factoring (& #1777-rollback) work I'm exploring in #2698 will eventually be to serve these kinds of needs in a more flexible & logical way. So, I'd not want this integrated until that work is further along. (Some of the classes this PR modifies, like the '_Trainables', should be going away.)

@mpenkov
Copy link
Collaborator

mpenkov commented Dec 29, 2019

OK, so it sounds like the best action at the moment is to close this PR and wait for a more appropriate time to work on it. @gojomo Do you agree?

@maohbao
Copy link
Author

maohbao commented Jan 8, 2020 via email

@maohbao maohbao closed this Jan 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants