Skip to content

Latest commit

 

History

History
122 lines (77 loc) · 3.72 KB

README.md

File metadata and controls

122 lines (77 loc) · 3.72 KB

WarpLDA: Cache Efficient Implementation of Latent Dirichlet Allocation

Introduction

WarpLDA is a cache efficient implementation of Latent Dirichlet Allocation, which samples each token in O(1).

Installation

Prerequisites:

  • GCC (>=4.8.5)
  • CMake (>=2.8.12)
  • git
  • libnuma
    • CentOS: yum install libnuma-devel
    • Ubuntu: apt-get install libnuma-dev

Clone this project

git clone https://github.com/thu-ml/warplda

Install third-party dependency

./get_gflags.sh

Download some data, and split it as training and testing set

cd data
wget https://raw.githubusercontent.com/sudar/Yahoo_LDA/master/test/ydir_1k.txt
head -n 900 ydir_1k.txt > ydir_train.txt
tail -n 100 ydir_1k.txt > ydir_test.txt
cd ..

Compile the project

./build.sh
cd release/src
make -j

Quick-start

Format the data

./format -input ../../data/ydir_train.txt -prefix train
./format -input ../../data/ydir_test.txt -vocab_in train.vocab -test -prefix test

Train the model

./warplda --prefix train --k 100 --niter 300

Check the result. Each line is a topic, its id, number of tokens assigned to it, and ten most frequent words with their probabilities.

vim train.info.full.txt

Infer latent topics of some testing data.

./warplda --prefix test --model train.model --inference -niter 40 --perplexity 10

Data format

The data format is identical to Yahoo! LDA. The input data is a text file with a number of lines, where each line is a document. The format of each line is

id1 id2 word1 word2 word3 ...

id1, id2 are two string document identifiers, and each word is a string, separated by white space.

Output format

WarpLDA generates a number of files:

.vocab (generated by .format)

Each line of it is a word in the vocabulary.

.info.full.txt (generated by warplda -estimate)

The most frequent words for each topic. Each line is a topic, with its topic it, number of tokens assigned to it, and a number of most frequent words in the format (probability, word). The number of most frequent words is controlled by -ntop. .info.words.txt is a simpler version which only contains words.

.model (generated by warplda -estimate)

The word-topic count matrix. The first line contains four integers

<size of vocabulary> <number of topics> <alpha> <beta>

Each of the remaining lines is a row of the word-topic count matrix, represented in the libsvm sparse vector format,

<number of elements> index:count index:count ...

For example, 0:2 on the first line means that the first word in the vocabulary is assigned to topic 0 for 2 times.

.z.estimate (generated by warplda -estimate)

The topic assignments of each token in the libsvm format. Each line is a document,

<number of tokens> <word id>:<topic id> <word id>:<topic id> ...

.z.inference (generated by warplda -inference)

The format is the same as .z.estimate.

Other features

  • Use custom prefix for output -prefix myprefix

  • Output perplexity every 10 iterations -perplexity 10

  • Tune Dirichlet hyperparameters -alpha 10 -beta 0.1

  • Use UCI machine learning repository data

      wget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/vocab.nips.txt
      wget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nips.txt.gz
      gunzip docword.nips.txt.gz
      ./uci-to-yahoo docword.nips.txt vocab.nips.txt -o nips.txt
      head -n 1400 nips.txt > nips_train.txt
      tail -n 100 nips.txt > nips_test.txt
    

License

MIT

Reference

Please cite WarpLDA if you find it is useful!

@inproceedings{chen2016warplda,
  title={WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation},
  author={Chen, Jianfei and Li, Kaiwei and Zhu, Jun and Chen, Wenguang},
  booktitle={VLDB},
  year={2016}
}