Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed lda options #782

Merged
merged 5 commits into from
Jul 13, 2016
Merged

Conversation

menshikh-iv
Copy link
Contributor

Update distributed LDA support. Now we can run worker/dispatcher in different network segments (not reachable by network broadcast). Broadcast variant also saved.

If you want to use broadcast, reading tutorial https://radimrehurek.com/gensim/dist_lsi.html on official site.

If you want to use new feature, add some arguments when you run a code, for example

  1. Execute on all machines
    export PYRO_SERIALIZERS_ACCEPTED=pickle export PYRO_SERIALIZER=pickle'
  2. On NS server
    python -m Pyro4.naming --host 0.0.0.0 --port <NS_PORT> -x
  3. On workers
    python -m gensim.models.lda_worker --host <NS_HOSTNAME> --port <NS_PORT> --no-broadcast -v
  4. On dispatcher
    python -m gensim.models.lda_dispatcher --host <NS_HOSTNAME> --port <NS_PORT> --no-broadcast -v
  5. Create LdaModel
    lda = LdaModel(..., ns_conf={"host": NS_HOST, "port": NS_PORT, "broadcast": False})
  6. Train it!

@@ -15,14 +15,21 @@


from __future__ import with_statement
import os, sys, logging, threading, time
import argparse
Copy link
Owner

@piskvorky piskvorky Jul 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is py2.7 only. @tmylk I don't think we can drop support for py2.6 yet... is this import safe?

If it's triggered only on importing lda_dispatcher.py, it's probably fine... but we don't want py2.7+ imports in "core" gensim (at import gensim).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked, this triggered only on importing lda_dispatcher.py or lda_worker.py.
Backport for argparse in setup.py for python < 2.7 (proof)

@piskvorky
Copy link
Owner

Awesome! This is a great update, and nicely done too.

If you don't mind me asking, how do you use this distributed LDA @menshikh-iv? What is your usecase/goal?

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Jul 12, 2016

@piskvorky, I have two usecases:

  1. Content classification
  2. Similarity search

I need to train LDA on large corpus of 'webpages content' and vectorize all webpages. Train process of LDA are very long. I could use several dedicated servers for training, but they not in local network, therefore I modified distributed LDA for my case.

@piskvorky
Copy link
Owner

piskvorky commented Jul 12, 2016

Thanks, interesting! Is this a personal project, academic research or a commercial project? (We keep a list of gensim adopters.)

@menshikh-iv
Copy link
Contributor Author

@piskvorky personal research for now

@tmylk tmylk merged commit 6a289fe into piskvorky:develop Jul 13, 2016
@tmylk
Copy link
Contributor

tmylk commented Jul 13, 2016

@menshikh-iv Thanks for the PR! Could you add a short notebook-style tutorial for this feature and a note in the changelog?

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Jul 13, 2016

@tmylk, unfortunately notebook-style tutorial for this feature is useless, because in notebook I can't demonstrate this feature. Maybe I update this page in documentation with small examples (like this message) ?

About changelog, I should add record to 0.3.12 in CHANGELOG.md ?

And I shoud create new PR for this actions?

@tmylk
Copy link
Contributor

tmylk commented Jul 14, 2016

Hi @menshikh-iv, the 0.3.12 is the right version to use. A new small PR would be good.

Updating this page with instructions would be great:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/src/distributed.rst

@manojpandey
Copy link
Contributor

Documentation changed from rst to markdown here: #859

@menshikh-iv menshikh-iv deleted the distributed-lda-options branch February 19, 2018 04:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants