You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation of the DiffMetric and ConvergenceMetric callbacks has some severe drawbacks with respect to memory consumption and file size when persisting an LdaModel.
Details: The two callback metrics call the LdaModel.diff() method which requires a second model as an argument. Therefore, the callbacks make a deep copy of the model when initializing the callbacks and also at each time they are called (refere to Callback.set_model() and Callback.on_epoch_end()). With a deep copy you not only copy the model's dictionary, you also make a copy of any other callback the model points to. Running both the DiffMetric and the ConvergenceMetric during training over several epochs you inherit quite a bit of recursion and redundancy.
model.save('model_with_cb')
# file size: 2.9 GB <- not ok
model.callbacks = None
model.save('model_without_cb')
# file size: 595.2 kB <- ok
Possible solutions
A workaround would be to set ldamodel.callbacks to None before saving the model. However, if you plan to update the model you'll have to remember to initialize new callbacks. Still, the memory consumption remains an issue during training. It would be better to avoid making a deep copy of the model in the first place.
Internally the LdaModel.diff() method only needs access to the previous topics, not to the entire model. Thus, it would be sufficient to backup just the previous topics instead of the entire model. In order to do so the diff method might need some re-writing, though.
Description
The current implementation of the DiffMetric and ConvergenceMetric callbacks has some severe drawbacks with respect to memory consumption and file size when persisting an LdaModel.
Details: The two callback metrics call the LdaModel.diff() method which requires a second model as an argument. Therefore, the callbacks make a deep copy of the model when initializing the callbacks and also at each time they are called (refere to Callback.set_model() and Callback.on_epoch_end()). With a deep copy you not only copy the model's dictionary, you also make a copy of any other callback the model points to. Running both the DiffMetric and the ConvergenceMetric during training over several epochs you inherit quite a bit of recursion and redundancy.
Steps/Code/Corpus to Reproduce
You can easily see the issue when watching memory consumption growing over time during training, but also when serializing the model after training. Run the following tutorial notebook:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Training_visualizations.ipynb
After training save the model with and without callbacks and compare the file sizes:
Possible solutions
A workaround would be to set ldamodel.callbacks to None before saving the model. However, if you plan to update the model you'll have to remember to initialize new callbacks. Still, the memory consumption remains an issue during training. It would be better to avoid making a deep copy of the model in the first place.
Internally the LdaModel.diff() method only needs access to the previous topics, not to the entire model. Thus, it would be sufficient to backup just the previous topics instead of the entire model. In order to do so the diff method might need some re-writing, though.
Versions
Linux-4.13.0-38-generic-x86_64-with-debian-stretch-sid
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0]
NumPy 1.14.3
SciPy 1.0.1
gensim 3.5.0
FAST_VERSION 1
The text was updated successfully, but these errors were encountered: