lda-frost - Distributed Latent Dirichlet Allocation with Python and Pyro, using token passing

Note

This project is currently used to benchmark the LDA inference process. Further work is to be done to be able to use the approximated posterior.

This project is an experiment to implement text-based LDA with modified token passing as the synchronization technique to keep the word-topic matrix consistent across any number of worker processes. Worker processes iteratively sample from a subset of the entire corpus when they receive a "token". Tokens are identified by a 2-tuple, where the first entry in the tuple is the token's ID and the second tuple is a row in the distributed word-topic matrix. The token ID is the location of a certain word in the vocabulary, and the row is the word-topic row that corresponds to that word. A dispatcher is used to route token requests amongst workers, and workers can only compute updates to the word-topic matrix for tokens that correspond to words in the word-topic matrix. Since the model is "owner computes", this model has the potential to be highly parallelizable.

Note: If this is the first time you're setting up the cluster, it'll take awhile to set up, since each node needs to clone this repo, install all Python requirements, etc. Currently it's only using Python 2, but that's mostly just because the print statements are for Python 2.

Directions:

Create a config.cfg, with each line containing the path to the node you're looking for (again, hopefully this will be available by IP later on)
Create a virtual host, with 'venv' as your directory
Run source to get into your virtual host environment
Install all Python requirements
Run setup_cluster.sh

If you want to rebuild any Cython modules (the samplers are written in Cython), include the '--rebuild-cython' flag when running setup_cluster.sh

When you run setup_cluster.sh, this is what happens:

Install all the same stuff on every node as you just did in steps 2-4 above
Start a Pyro nameserver on the current node
Start a dispatcher on the current node (to handle token request routing amongst workers)
Start a worker on every node listed in your config.cfg file

Example (from step 2 on):

virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
bash setup_cluster.sh

Example config.cfg file:

/users/Dave/
/users/Charlie/
/users/Mike/
/users/Charlotte/
/users/Betty/
/users/Susie/

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data		data
.gitignore		.gitignore
README.md		README.md
build.sh		build.sh
cfplustree.pyx		cfplustree.pyx
clean.py		clean.py
data_converter.py		data_converter.py
dispatcher.py		dispatcher.py
driver.py		driver.py
fplus.pyx		fplus.pyx
kill_workers.py		kill_workers.py
requirements.txt		requirements.txt
rsf_gibbs.pyx		rsf_gibbs.pyx
sample_tester.py		sample_tester.py
server.py		server.py
setup.py		setup.py
setup_cluster.sh		setup_cluster.sh
stopwords.txt		stopwords.txt
utils.py		utils.py
warp.pyx		warp.pyx
worker.py		worker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lda-frost - Distributed Latent Dirichlet Allocation with Python and Pyro, using token passing

Note

Directions:

When you run setup_cluster.sh, this is what happens:

Example (from step 2 on):

Example config.cfg file:

About

Releases

Packages

Languages

richiefrost/lda-frost

Folders and files

Latest commit

History

Repository files navigation

lda-frost - Distributed Latent Dirichlet Allocation with Python and Pyro, using token passing

Note

Directions:

When you run setup_cluster.sh, this is what happens:

Example (from step 2 on):

Example config.cfg file:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages