-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make word2vec2tensor
script compatible with python3
#2147
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for PR @vsocrates, please continue 👍 you are on right way!
gensim/scripts/word2vec2tensor.py
Outdated
@@ -63,16 +67,15 @@ def word2vec2tensor(word2vec_model_path, tensor_filename, binary=False): | |||
True if input file in binary format. | |||
|
|||
""" | |||
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_model_path, binary=binary) | |||
model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_model_path, binary=binary, datatype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why np.float64
instead of np.float32
(that's default parameter)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, when the word2vec model is loaded, if we leave the default, using the test data file word2vec_pre_kv_c
I run into parsing issues. For instance, for the word the
, I get array([-0.56110603, -1.97569799, 1.66395497, -1.23224604, 0.75475103, 0.98576403, 2.26144099, -0.59829003, -0.47433099, -1.41610503], dtype=float32)
instead of -0.561106 -1.975698 1.663955 -1.232246 0.754751 0.985764 2.261441 -0.598290 -0.474331 -1.416105
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what's a problem here? I still don't catch, sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, should have been more clear. In the first dimension, as an example, the original word2vec model has -0.561106
but for some reason, without the np.float64
when it is read in, it displays -0.56110603
, so the test fails equality. I'm not sure on the reason, though I assume something to do with the load_word2vec_format
internals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's assertion you mean? can you link concrete line of code?
Also, you can fix assertion (slightly relax "almost equal" condition), but anyway, you shouldn't change dtype here.
gensim/test/test_scripts.py
Outdated
class TestWord2Vec2Tensor(unittest.TestCase): | ||
def setUp(self): | ||
self.datapath = datapath('word2vec_pre_kv_c') | ||
self.output_folder = get_tmpfile('') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to add an explicit name (for example w2v2t_test
)
gensim/test/test_scripts.py
Outdated
def testConversion(self): | ||
word2vec2tensor(word2vec_model_path=self.datapath, tensor_filename=self.output_folder) | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need try/except in current test. if exception raised - test failed, this is expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without try/except, should we give the developers more feedback and which part of test went wrong? If so, how would I do that without the except block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in the current case, stack trace (if something goes wrong) have enough information in current tests.
gensim/test/test_scripts.py
Outdated
first_line = f.readline().strip() | ||
|
||
number_words, vector_size = map(int, first_line.split(b' ')) | ||
if not len(metadata) == len(vectors) == number_words: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert <CONDITION>, <MESSAGE>
gensim/test/test_scripts.py
Outdated
number_words, vector_size = map(int, first_line.split(b' ')) | ||
if not len(metadata) == len(vectors) == number_words: | ||
self.fail( | ||
'Metadata file %s and tensor file %s \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we using 120 character limit, no reasons to make short lines
gensim/test/test_scripts.py
Outdated
imply different number of rows.' % (self.metadata_file, self.tensor_file) | ||
) | ||
|
||
# write word2vec to file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The strange part of the test, you copy part of the code from script to test this script, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm, sorry, could you be a bit more specific? Do you mean the lines from 154-157? If so, the idea was to take the metadata and tensor components that were separated and put the back together in w2v style without any gensim functions and make sure they are the same as what is created by the word2vec2tensor.py
script. I guess it is basically just tested to_utf8
though. Is it fine to remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to remove it, yes
word2vec2tensor
script compatible with python3
@menshikh-iv The changes you mentioned have been made. With regards to the |
@vsocrates ok, I see, simply "relax" |
Thank you @vsocrates, congrats on the first contribution 🥇 ! |
Fixes #1958. The word2vec2tensor script didn't support Python 3 due to unicode encoding. This was fixed and tests were added to ensure that the script functions correctly in the future. The usage of the script on the user side remains the same.