LDA Diff and Convergence metrics: inefficient memory consumption and file size #2265

andifunke · 2018-11-14T23:21:28Z

Description

The current implementation of the DiffMetric and ConvergenceMetric callbacks has some severe drawbacks with respect to memory consumption and file size when persisting an LdaModel.

Details: The two callback metrics call the LdaModel.diff() method which requires a second model as an argument. Therefore, the callbacks make a deep copy of the model when initializing the callbacks and also at each time they are called (refere to Callback.set_model() and Callback.on_epoch_end()). With a deep copy you not only copy the model's dictionary, you also make a copy of any other callback the model points to. Running both the DiffMetric and the ConvergenceMetric during training over several epochs you inherit quite a bit of recursion and redundancy.

Steps/Code/Corpus to Reproduce

You can easily see the issue when watching memory consumption growing over time during training, but also when serializing the model after training. Run the following tutorial notebook:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Training_visualizations.ipynb
After training save the model with and without callbacks and compare the file sizes:

model.save('model_with_cb')
# file size: 2.9 GB <- not ok

model.callbacks = None
model.save('model_without_cb')
# file size: 595.2 kB <- ok

Possible solutions

A workaround would be to set ldamodel.callbacks to None before saving the model. However, if you plan to update the model you'll have to remember to initialize new callbacks. Still, the memory consumption remains an issue during training. It would be better to avoid making a deep copy of the model in the first place.

Internally the LdaModel.diff() method only needs access to the previous topics, not to the entire model. Thus, it would be sufficient to backup just the previous topics instead of the entire model. In order to do so the diff method might need some re-writing, though.

Versions

Linux-4.13.0-38-generic-x86_64-with-debian-stretch-sid
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0]
NumPy 1.14.3
SciPy 1.0.1
gensim 3.5.0
FAST_VERSION 1

The text was updated successfully, but these errors were encountered:

menshikh-iv · 2018-12-13T14:40:45Z

Hello @andifunke, thanks for the report. I agree we should set callbacks to None always before saving the model (same for *2vec models of course).

menshikh-iv · 2018-12-13T14:43:43Z

I close this as "almost duplicate" of #2136, we'll fix it, don't worry @andifunke.

menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix labels Dec 13, 2018

menshikh-iv closed this as completed Dec 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LDA Diff and Convergence metrics: inefficient memory consumption and file size #2265

LDA Diff and Convergence metrics: inefficient memory consumption and file size #2265

andifunke commented Nov 14, 2018

menshikh-iv commented Dec 13, 2018

menshikh-iv commented Dec 13, 2018

LDA Diff and Convergence metrics: inefficient memory consumption and file size #2265

LDA Diff and Convergence metrics: inefficient memory consumption and file size #2265

Comments

andifunke commented Nov 14, 2018

Description

Steps/Code/Corpus to Reproduce

Possible solutions

Versions

menshikh-iv commented Dec 13, 2018

menshikh-iv commented Dec 13, 2018