Skip to content

Latest commit

 

History

History
92 lines (59 loc) · 4.1 KB

README.EN.md

File metadata and controls

92 lines (59 loc) · 4.1 KB
logo

Build Status License

Familia is an open-source project, which implements three popular topic models based on the large-scale industrial data. They are Latent Dirichlet Allocation(LDA)、SentenceLDA and Topical Word Embedding(TWE). In addition, Familia offers several tools including lda-infer and lda-query-doc-sim. Familia could be easily applied to many tasks, such as document classification, document clustering and personalized recommendation. Due to the high cost of model training, we will continue to release well-trained topic models based on the various types of large-scale data.

Introduction

The details of topic models implemented by Familia can be referred to papers on topic models.

Generally, the applications adopting topic models are categorized into two parts: Semantic Representation and Semantic Matching.

  • Semantic Representation

    Topic models are able to mine hidden dimensions (topics) from document collection and generate semantic representations of documents. These generated semantic representations can be used as features for document classification, document content analysis, and CTR prediction.

  • Semantic Matching

    We offer two methods to compute semantic similarity between documents:

    • Semantic similarity between short-long documents, which can be applied to keyword extraction and computing query-document semantic similarity.
    • Semantic similarity between long-long documents, which can be applied to computing semantic similarity between user profile and news article.

More details can be referred to Familia Wiki.

Compilation

The required third parties include gflags-2.0glogs-0.3.4protobuf-2.5.0. The complier should support C++11, g++ >= 4.8 and be compatible with linux and mac. The deps could be obtained and installed automatically by running the following script.

$ sh build.sh

Download

$ cd model
$ sh download_model.sh

More details can be referred to Models.

Demo

Familia demo includes the following functions:

  • Semantic Representation utilize topic models to infer the topic distribution of the input document.

  • Semantic Matching compute semantic similarity between short-long or long-long documents.

  • Topic Show demonstrate top words under each topic for users’ better understanding.

More details can be referred to Demos.

Tips

  • If libglog.so, libgflags.so and other dynamic libraries could not be found, please add third_party to the environmental parameter LD_LIBRARY_PATH.

    export LD_LIBRARY_PATH=./third_party/lib:$LD_LIBRARY_PATH

Contact

Github Issues

{familia} at baidu.com

Citation

The following article describes the Familia project and industrial cases powered by topic modeling. It bundles and translates the Chinese documentation of the website. We recommend citing this article as default.

Di Jiang, Zeyu Chen, Rongzhong Lian, Siqi Bao and Chen Li. 2017. Familia: An Open-Source Toolkit for Industrial Topic Modeling. arXiv preprint arXiv:1707.09823.

@article{jiang2017familia,
  author = {Di Jiang and Zeyu Chen and Rongzhong Lian and Siqi Bao and Chen Li},
  title = {{Familia: An Open-Source Toolkit for Industrial Topic Modeling}},
  journal = {arXiv preprint arXiv:1707.09823},
  year = {2017}
}

Copyright and License

Familia is provided under the BSD-3-Clause License.