-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading and Saving LDA Models across Python 2 and 3. #913
Conversation
Thanks. Could you please add tests for these methods? Maybe we need to add 2 pickled models to the test_data folder, one for Python 2 and one for Python 3? |
Yes sure, I was gonna ask for suggestions on testing this actually. |
So, I added two saved LDA models, one in Python 2.7 and other in Python 3.5 environments in test_data folder. The method I used to create these models is in test_ldamodel.py. |
So I got this working. The LDA model loads fine now across both in Python 2 and 3. |
How different are they? Can you add a comparison with an epsilon of each other? |
Well, the epsilon is big for some indexes, I'm adding the exact arrays I printed. (Python 3.5 expElogbeta: [[ 0.00683436, 0.00685224, 0.01325946, 0.00895329, 0.09220773, 0.0764139, 0.09233712, 0.09220773, 0.006823, 0.14213993, 0.14239321, 0.09537042] Also, the id2word dictionary saved in both Python formats is not in the same order. |
if (isinstance(self.eta, six.string_types) and self.eta == 'auto') or len(self.eta.shape) != 1: | ||
separately_explicit.append('eta') | ||
# Merge separately_explicit with separately. | ||
if separately is not None and separately: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if separately
is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made changes as suggested.
@@ -1037,6 +1072,18 @@ def load(cls, fname, *args, **kwargs): | |||
""" | |||
kwargs['mmap'] = kwargs.get('mmap', None) | |||
result = super(LdaModel, cls).load(fname, *args, **kwargs) | |||
# Load the separately stored id2word dictionary saved in json. | |||
id2word_fname = utils.smart_extension(fname, '.json') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please make all files for one model in a special folder, so it is easy to keep track
This solution looks really complicated. Let's try something simpler. Let's save the Python version with the pickle. If the versions differ on load than raise an exception to use Dill. (assuming dill works across versions) |
I created the LDAModels with the same seed now and the tests pass. |
@tmylk Actually, I think no more changes need to be made for compatibility in word2vec or doc2vec models. The load/save methods work fine now for them. We could probably use the present pickle methods itself. |
Add an epsilon for equality comparison. |
c9d5be7
to
a15ef90
Compare
…ving LDA models across Pythong verions
a15ef90
to
63963c0
Compare
@@ -474,11 +487,11 @@ def testRNG(self): | |||
|
|||
def models_equal(self, model, model2): | |||
self.assertEqual(len(model.vocab), len(model2.vocab)) | |||
self.assertTrue(numpy.allclose(model.syn0, model2.syn0)) | |||
self.assertTrue(numpy.allclose(model.syn0, model2.syn0, atol=1e-4)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how big is the difference on your machine?
# logging.warning("Word2Vec model saved") | ||
|
||
def testModelCompatibilityWithPythonVersions(self): | ||
fname_model_2_7 = os.path.join(os.path.dirname(__file__), 'word2vecmodel_python_2_7') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use module_path
and datapath
defined above.
# Load the separately stored id2word dictionary saved in json. | ||
id2word_fname = utils.smart_extension(fname, '.json') | ||
try: | ||
with utils.smart_open(id2word_fname, 'r') as fin: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open file as binary, decode as necessary (if necessary).
# If id2word is not already in ignore, then saving it separately in json. | ||
id2word = None | ||
if self.id2word is not None and 'id2word' not in ignore: | ||
id2word = dict((k,v) for k,v in self.id2word.iteritems()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: space after comma.
id2word_fname = utils.smart_extension(fname, '.json') | ||
try: | ||
with utils.smart_open(id2word_fname, 'w', encoding='utf-8') as fout: | ||
json.dump(id2word, fout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better open the output as binary and write encoded utf8 to it.
Actually, the json
module already produces binary strings in dump
AFAIK, so what is this even for?
# Because of loading from S3 load can't be used (missing readline in smart_open) | ||
return _pickle.loads(f.read()) | ||
|
||
if sys.version_info > (3,0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: space after comma.
ce6d8b0
to
fbd5d6d
Compare
Merged in #1039 |
Modified load and save methods in ldamodel.py to manage compatibility issues when loading and saving models across Python 2 and 3.
This PR tackles Isssue #853