Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDA Diff and Convergence metrics: inefficient memory consumption and file size #2265

Closed
andifunke opened this issue Nov 14, 2018 · 2 comments
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix

Comments

@andifunke
Copy link

Description

The current implementation of the DiffMetric and ConvergenceMetric callbacks has some severe drawbacks with respect to memory consumption and file size when persisting an LdaModel.

Details: The two callback metrics call the LdaModel.diff() method which requires a second model as an argument. Therefore, the callbacks make a deep copy of the model when initializing the callbacks and also at each time they are called (refere to Callback.set_model() and Callback.on_epoch_end()). With a deep copy you not only copy the model's dictionary, you also make a copy of any other callback the model points to. Running both the DiffMetric and the ConvergenceMetric during training over several epochs you inherit quite a bit of recursion and redundancy.

Steps/Code/Corpus to Reproduce

You can easily see the issue when watching memory consumption growing over time during training, but also when serializing the model after training. Run the following tutorial notebook:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Training_visualizations.ipynb
After training save the model with and without callbacks and compare the file sizes:

model.save('model_with_cb')
# file size: 2.9 GB <- not ok

model.callbacks = None
model.save('model_without_cb')
# file size: 595.2 kB <- ok

Possible solutions

A workaround would be to set ldamodel.callbacks to None before saving the model. However, if you plan to update the model you'll have to remember to initialize new callbacks. Still, the memory consumption remains an issue during training. It would be better to avoid making a deep copy of the model in the first place.

Internally the LdaModel.diff() method only needs access to the previous topics, not to the entire model. Thus, it would be sufficient to backup just the previous topics instead of the entire model. In order to do so the diff method might need some re-writing, though.

Versions

Linux-4.13.0-38-generic-x86_64-with-debian-stretch-sid
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0]
NumPy 1.14.3
SciPy 1.0.1
gensim 3.5.0
FAST_VERSION 1

@menshikh-iv
Copy link
Contributor

Hello @andifunke, thanks for the report. I agree we should set callbacks to None always before saving the model (same for *2vec models of course).

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Dec 13, 2018
@menshikh-iv
Copy link
Contributor

I close this as "almost duplicate" of #2136, we'll fix it, don't worry @andifunke.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix
Projects
None yet
Development

No branches or pull requests

2 participants