-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrapper for Varembed Models #1067
Conversation
I've added wrapper to load varembed model into gensim and allowing keyedvectors functionalities. |
5b78e5d
to
6034e68
Compare
There seems to be a small issue specifically in the morfessor package. I've submitted a PR aalto-speech/morfessor#6 for the same. We'll probably have to wait to get it merged or find some other fix to get the code running in Python 2.6. Any suggestions? |
@tmylk Apart from the issue in Python 2.6, which I've already discussed. The code right now fails in other versions as well, which seems to me a problem in circular dependency in imports in morfessor. Though the tests run fine on my personal machine, it fails on Travis, which is something I don't understand. Any ideas, what's going wrong? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More tests, KeyedVectors and other small improvements needed.
gensim/models/wrappers/varembed.py
Outdated
""" | ||
Load the input-hidden weight matrix from the fast text output files. | ||
|
||
Note that due to limitations in the FastText API, you cannot continue training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct Docstring to be about varembed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gensim/models/wrappers/varembed.py
Outdated
logger = logging.getLogger(__name__) | ||
|
||
|
||
class VarEmbed(Word2Vec): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please subclass KeyedVectors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gensim/test/test_varembed_wrapper.py
Outdated
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html | ||
|
||
""" | ||
Automated tests for checking transformation algorithms (the models package). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change to VarEmbed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Had missed this. Thanks. Done.
gensim/test/test_varembed_wrapper.py
Outdated
"""Test ensembling of Morhpeme Embeddings""" | ||
model = varembed.VarEmbed.load_varembed_format(vectors=varembed_model_vector_file, | ||
morfessor_model=varembed_model_morfessor_file, use_morphemes=True) | ||
self.model_sanity(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please test that syn0 is different compared to non-morpheme model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, added now.
gensim/models/wrappers/varembed.py
Outdated
if use_morphemes: | ||
try: | ||
import morfessor | ||
morfessor_model = morfessor.MorfessorIO().read_binary_model_file(morfessor_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you raise an issue in varembed github as heads up that read_binary_model_file
will be deprecated by morfessor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Sounds good. I've put up an issue rguthrie3/MorphologicalPriorsForWordEmbeddings#3 to notify them about the new release on morfessor as well.
…sor version in travis script
…support is only provided in python 2.7 and above. Also added more comments
c0ce65b
to
ef17a12
Compare
ef17a12
to
bf57058
Compare
gensim/models/wrappers/varembed.py
Outdated
word_embeddings = D['word_embeddings'] | ||
morpho_embeddings = D['morpheme_embeddings'] | ||
result.load_word_embeddings(word_embeddings, word_to_ix) | ||
if use_morphemes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just if morfessor_model
is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sounds good. Done.
gensim/models/wrappers/varembed.py
Outdated
logger.info("Loaded matrix of %d size and %d dimensions", self.vocab_size, self.vector_size) | ||
|
||
|
||
def ensemble_morpheme_embeddings(self, morfessor_model, morpho_embeddings, morpho_to_ix): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add_morphemes_to_word_embeddings
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed method name now.
4f5e359
to
3095620
Compare
Will there be a tutorial ipynb? |
Yes, I'll just add one. |
@tmylk I've now added a tutorial on varembed model as well. Please review once, and I feel we could merge it then. |
gensim/test/test_varembed_wrapper.py
Outdated
Test only in Python 2.7 and above. Add Morphemes is not supported in earlier versions. | ||
""" | ||
model = varembed.VarEmbed.load_varembed_format(vectors=varembed_model_vector_file) | ||
model_with_morphemes = varembed.VarEmbed.load_varembed_format(vectors=varembed_model_vector_file, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: hanging indent please (not vertical indent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gensim/test/test_varembed_wrapper.py
Outdated
|
||
@unittest.skipUnless(sys.version_info < (2, 7), 'Test to check throwing exception in Python 2.6 and earlier') | ||
def testAddMorphemesThrowsExceptionInPython26(self): | ||
self.assertRaises(Exception, varembed.VarEmbed.load_varembed_format, vectors=varembed_model_vector_file, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed now.
gensim/models/wrappers/varembed.py
Outdated
|
||
This module allows ability to obtain word vectors for out-of-vocabulary words, for the Varembed model[2]. | ||
|
||
The wrapped model can NOT be updated with new documents for online training -- use gensim's `Word2Vec` for that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean that VarEmbed gensim wrapper doesn't support it? Also, someone might be confused that you are suggesting to load varembed and then train it as Word2Vec on new words which is incorrect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, you are right. It could be bit confusing earlier. Updated it now.
Thanks for adding the wrapper! It will be part of this week's release. Let's add a benchmark notebook and a blog in another PR. Will publicize when that is ready. |
Cool! Great! That's Awesome! :D |
This is an effort to integrate Varembed word-embedding Model into gensim.
To train varembed models, one can use the code put up by @rguthrie3 here.
Presently, the motive is to provide wrapper for loading varembed models with gensim and not training the word embeddings.
The first draft is based on Wordrank wrapper by @parulsethi
TODOs