Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to Dynamic Topic Models #840

Open
5 tasks
bhargavvader opened this issue Aug 25, 2016 · 5 comments
Open
5 tasks

Improvements to Dynamic Topic Models #840

bhargavvader opened this issue Aug 25, 2016 · 5 comments
Labels
difficulty hard Hard issue: required deep gensim understanding & high python/cython skills feature Issue described a new feature wishlist Feature request

Comments

@bhargavvader
Copy link
Contributor

bhargavvader commented Aug 25, 2016

Dynamic Topic Models is a variation of LDA by Blei et al which takes time-tagged data and allows one to Topic Model data over time-periods. I have described it more in a series of blogs here, and this is the PR (#739) recently merged which implements it.

While the code is functionally correct, it could use some more work to make it even better.
Some of the things which would be very useful for the same:

  • Include Document Influence Model (DIM) mode. Most of the infrastructure for this is in place.
  • See if LdaPost can be replaced by LdaModel completely without breaking anything.
    - in particular, a lot of DIM depends on LdaPost being in place.
  • Heavy lifting going on in the sslm class - efforts can be made to cythonise mathematical methods.
    - in particular, update_obs and the optimization takes a lot time.
  • Try and make it distributed, especially around the E and M step.
  • Get rid of all C/C++ coding styles if left behind.

What is also very useful is suggesting how the code can be made more user friendly, or alternate ways to take data as an input (for example, a dict or tuple such as {data/document : time-stamp}), and posting examples and results from training DTM on datasets.

PRs to implement any of the suggestions or issues on improving performance would be particularly useful.

@piskvorky , @tmylk , could you add the feature label and whatever else would be appropriate so this is easier to find when someone wishes to help?

@tmylk tmylk added the wishlist Feature request label Aug 25, 2016
@tmylk tmylk changed the title Helping with Dynamic Topic Models Improvements to Dynamic Topic Models Sep 28, 2016
@mjawa
Copy link

mjawa commented Oct 5, 2016

I have been going through the code of ldasemodel.py and had following questions:

  1. LdaSeqModel class doesn't have inferDIMseq method implemented yet. I also see that this method cannot be invoked by any codepath in current implementation. Please let me know if this assumption is right ?
  2. Also in this case e-step pretty much becomes inferDTMseq method, right ?

@mjawa
Copy link

mjawa commented Oct 5, 2016

#903 answers my question.

@bhargavvader
Copy link
Contributor Author

bhargavvader commented Oct 5, 2016

Yes, @mjawa , you're absolutely right with both points 1 and 2.
The way forward here would be to somehow replace the lda_post class with our existing ldamodel without breaking anything while giving similar results.
However I have concerns because some parts of lda_post are important for further implementing DIM later.

I think the steps ahead, in order, would be to first implement DIM, and then work on making things faster by integrating ldamodel and making it distributed.

@mjawa
Copy link

mjawa commented Oct 5, 2016

@bhargavvader : I see. I can start implementing DIM. I am new to gensim, can you please tell me some of the reasons why lda_post was used instead of ldamodel to begin with.

@bhargavvader
Copy link
Contributor Author

Mainly because I wanted to be sure to replicate the C code as much as possible to allow for easier testing and the option of including DIM. DIM uses methods in the ldapost class.

It'll need some investigation to see to what extent ldamodel can be used instead of ldapost.

@menshikh-iv menshikh-iv added feature Issue described a new feature difficulty hard Hard issue: required deep gensim understanding & high python/cython skills labels Oct 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty hard Hard issue: required deep gensim understanding & high python/cython skills feature Issue described a new feature wishlist Feature request
Projects
None yet
Development

No branches or pull requests

4 participants