The code for the paper "Inter and Intra Topic Structure Learning with Word Embeddings" in ICML 2018 PDF.
Key features:
- WEDTM is a deep topic model that discovers topic hierarchies.
- WEDTM is also able to discover "sub-topics" with the help of word embeddings.
- Excellent performance on perplexity, document classification, and topic coherence.
-
The code has been tested in MacOS and Linux (Ubuntu). To run it on Windows, you need to re-compile
GNBP_mex_collapsed_deep_WEDTM.c
with MEX and a C++ complier. -
Requirements: Matlab 2016b (or later) and the code of GBN.
-
Make sure GBN runs properly on your machine.
-
We have offered the WS dataset used in the paper, which is stored in MAT format, with the following contents:
- doc: a V by N count (sparse) matrix for N documents with V words in the vocabulary
- embeddings: a V by L matrix for the L dimensional word embeddings for V words
- vocabulary: the words in the vocabulary
- labels: the label matrix for the documents (only for document classification)
- label_names: the label names (only for document classification)
- train_idx: the indexes of documents for training (only for document classification)
- test_idx: the indexes of documents for testing (only for document classification)
Please prepare your own documents in the above format. If you want to use this dataset, please cite the original papers, which are cited in our paper.
- Run
demo_WEDTM.m
:
- Specify where the GBN code is installed and some model parameters.
- Follow the comments and run it.
- The code should yield the results reported in the paper.
- I've found that if you use more MCMC iterations, the model will have better performance than reported in the paper.😂
-
As WEDTM adapts GBN for a part of its model structure, the code heavily relies on GBN and basically follows the code structure of GBN.
-
For the Polya-Gamma sampler (
PolyaGamRnd_Gam.m
), I used Mingyuan Zhou's implementation, described in "Parsimonious Bayesian deep networks". If you want to use the sampler, please cite the paper. -
For the sampling of W, I partly referred to the implementation of DPFA by Gan Zhe.