-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix datatype parameter for KeyedVectors.load_word2vec_format
. Fix #1682
#1819
Conversation
Hi @pushpankar, thanks for your PR, what's about "other cases" from #1682? |
Ping @pushpankar, how is going? You also need to write additional tests to check this behavior and prevent regression + please merge develop from upstream. |
Sorry. I thought my work is done. Should I write a test script in |
@pushpankar for two types already enough, also, use example from original issue |
gensim/test/test.kv.txt
Outdated
@@ -0,0 +1,3 @@ | |||
2 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be in gensim/test/test_data
gensim/test/test_datatype.py
Outdated
def test_datatype(self): | ||
path = os.path.join(os.path.dirname(__file__), 'test.kv.txt') | ||
kv = KeyedVectors.load_word2vec_format(path, datatype=np.float64) | ||
self.assertEqual(kv['horse.n.01'][0], -0.0008546282343595379) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- You should use assertAlmostEqual instead of
assertEqual
for float comparison - You also need to check datatype
gensim/test/test_datatype.py
Outdated
|
||
class TestDataType(unittest.TestCase): | ||
def test_datatype(self): | ||
path = os.path.join(os.path.dirname(__file__), 'test.kv.txt') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you move test.kv.txt
construction should look like
from gensim.test.utils import datapath
path = datapath('test.kv.txt')
KeyedVectors.load_word2vec_format
. Fix #1682
Hi @pushpankar thanks for the PR! Apart from the suggestions @menshikh-iv mentioned, it'd also be good to have a test for word2vec files in binary format. |
@@ -0,0 +1,18 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a file header, like in the other test files.
Saving model(loaded with |
Saving model also causes loss in precision in both text and binary format. |
I see many failed tests, @pushpankar please have a look into. |
@menshikh-iv I am working on it. Changing the way binary files are loaded is causing it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv @jayantj I need some help.
gensim/models/keyedvectors.py
Outdated
@@ -220,7 +220,7 @@ def add_word(word, weights): | |||
result.index2word.append(word) | |||
|
|||
if binary: | |||
binary_len = dtype(REAL).itemsize * vector_size | |||
binary_len = dtype(datatype).itemsize * vector_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having problem detecting binary_len
of vector saved with custom datatype. The only clue is that the next vector starts after a " "
but before the space comes a string(also converted in python bytes) which can be of any length. @menshikh-iv any suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried adding \n
at the end of each vector during saving in binary but that broke many other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pushpankar how it works if datatype=REAL
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since earlier float were only saved with only 32 bit precision, knowing the size of each vector in binary format was easy. Casting to lower precision is only done after loading vectors.
Please also note that in develop branch too, casting vector to lower precision and saving it in binary and then loading leads to some errors. This is because while loading float32 is being assumed but during saving it was saved with lower precision like float16. I am adding some code to make it more clear.
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('./test_data/test.kv.txt', datatype=np.float16)
print(model['horse.n.01'][0])
model.save_word2vec_format('./test_data/test.kv.bin', binary=True)
model2 = KeyedVectors.load_word2vec_format('./test_data/test.kv.bin', datatype=np.float32, binary=True)
print(model2['horse.n.01'][0])
Gives
Traceback (most recent call last):
File "convert2binary.py", line 7, in <module>
print(model2['horse.n.01'][0])
File "/home/pushpankar/gensim/gensim/models/keyedvectors.py", line 326, in __getitem__
return self.word_vec(words)
File "/home/pushpankar/gensim/gensim/models/keyedvectors.py", line 453, in word_vec
raise KeyError("word '%s' not in vocabulary" % word)
KeyError: "word 'horse.n.01' not in vocabulary"
This is because float32 was assumed while reading binary vector but originally it was saved with float16. Thus more than necessary bytes was read for every vector.
Let me know if I am not clear enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, probably the easiest solution for this case is read/write with REAL
type & cast it before the end of "load" process, wdyt @jayantj?
This fixes:
|
gensim/test/test_datatype.py
Outdated
def test_text(self): | ||
path = datapath('test.kv.txt') | ||
kv = KeyedVectors.load_word2vec_format(path, binary=False, | ||
datatype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's about different datatypes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will np.float16, np.float32, and np.float64 be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
gensim/models/keyedvectors.py
Outdated
@@ -147,9 +147,10 @@ def save_word2vec_format(self, fname, fvocab=None, binary=False, total_vec=None) | |||
for word, vocab in sorted(iteritems(self.vocab), key=lambda item: -item[1].count): | |||
row = self.syn0[vocab.index] | |||
if binary: | |||
row = row.astype(REAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('./test_data/test.kv.txt', datatype=np.float16)
print(model['horse.n.01'][0])
model.save_word2vec_format('./test_data/test.kv.bin', binary=True)
model2 = KeyedVectors.load_word2vec_format('./test_data/test.kv.bin', datatype=np.float32, binary=True)
this causes crash.
This is another bug that exists in develop branch.
Steps to reproduce error:
- load model with low precision.
- save that model in binary
- then finally try to load the binary model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for investigating that and reporting it. Do you have any ideas why that might be occurring? The fix seems a little hack-ish and might mask other genuine problems.
#1777 should be merged first |
Hello @pushpankar, can you resolve merge-conflict? |
69a4b2e
to
4d0661f
Compare
4d0661f
to
991bcb6
Compare
Good work @pushpankar, LGTM, wdyt @jayantj? |
gensim/test/test_datatype.py
Outdated
|
||
class TestDataType(unittest.TestCase): | ||
def load_model(self, datatype): | ||
path = datapath('test.kv.txt') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A slightly more descriptive name would be helpful, there's already a lot of test data and it can easily get confusing.
gensim/test/test_datatype.py
Outdated
model1.save_word2vec_format(binary_path, binary=True) | ||
model2 = KeyedVectors.load_word2vec_format(binary_path, datatype=np.float64, binary=True) | ||
self.assertAlmostEqual(model1["horse.n.01"][0], np.float16(model2["horse.n.01"][0])) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another test to verify that the type
of a value in the loaded array matches the datatype passed to load_word2vec_format
would explicitly confirm the new behaviour.
LGTM to me too (left some minor comments which should be taken care of, IMO), thanks for the fix @pushpankar |
@pushpankar good work, congratz with the first contribution! 👍 |
…iskvorky#1682 (piskvorky#1819) * load vector with high precision * Test changes * Fix flake8 error * Fix path error * Reformat code * Fix precision loss issue for binary word2vec * Fix precision loss during saving model in text format * Fix binary file loading issue * Test other datatypes as well. * Test type conversion * Fix build error * Use better names * Test type after conversion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor code style request.
class TestDataType(unittest.TestCase): | ||
def load_model(self, datatype): | ||
path = datapath('high_precision.kv.txt') | ||
kv = KeyedVectors.load_word2vec_format(path, binary=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent please (not vertical).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you are talking about line 22-23. I have merged them.
No description provided.