Multilingual Latent Dirichlet Allocation (LDA) Pipeline

This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It can be adapted to many languages provided that the Snowball stemmer, a dependency of this project, supports it.

Usage

from artifici_lda.lda_service import train_lda_pipeline_default


FR_STOPWORDS = [
    "le", "les", "la", "un", "de", "en",
    "a", "b", "c", "s",
    "est", "sur", "tres", "donc", "sont",
    # even slang/texto stop words:
    "ya", "pis", "yer"]
# Note: this list of stop words is poor and is just as an example.

fr_comments = [
    "Un super-chat marche sur le trottoir",
    "Les super-chats aiment ronronner",
    "Les chats sont ronrons",
    "Un super-chien aboie",
    "Deux super-chiens",
    "Combien de chiens sont en train d'aboyer?"
]

transformed_comments, top_comments, _1_grams, _2_grams = train_lda_pipeline_default(
    fr_comments,
    n_topics=2,
    stopwords=FR_STOPWORDS,
    language='french')

print(transformed_comments)
print(top_comments)
print(_1_grams)
print(_2_grams)

Output:

array([[0.14218195, 0.85781805],
       [0.11032992, 0.88967008],
       [0.16960695, 0.83039305],
       [0.88967041, 0.11032959],
       [0.8578187 , 0.1421813 ],
       [0.83039303, 0.16960697]])

['Un super-chien aboie', 'Les super-chats aiment ronronner']

[[('chiens', 3.4911404011996545), ('super', 2.5000203653313933)],
 [('chats',  3.4911393765493255), ('super', 2.499979634668601 )]]

[[('super chiens', 2.4921035508342464)],
 [('super chats',  2.492102155345991 )]]

How it works

See Multilingual-LDA-Pipeline-Tutorial for an exhaustive example (intended to be read from top to bottom, not skimmed through). For more explanations on the Inverse Lemmatization, see Stemming-words-from-multiple-languages.

Supported Languages

Those languages are supported:

Danish
Dutch
English
Finnish
French
German
Hungarian
Italian
Norwegian
Porter
Portuguese
Romanian
Russian
Spanish
Swedish
Turkish

You need to bring your own list of stop words. That could be achieved by computing the Term Frequencies on your corpus (or on a bigger corpus of the same language) and to use some of the most common words as stop words.

Dependencies and their license

numpy==1.26.3           # BSD-3-Clause and BSD-2-Clause BSD-like and Zlib
scikit-learn==1.4.0     # BSD-3-Clause
PyStemmer==2.2.0.1      # BSD-3-Clause and MIT
snowballstemmer==2.2.0  # BSD-3-Clause and BSD-2-Clause
translitcodec==0.7.0    # MIT License
scipy==1.12.0           # BSD-3-Clause and MIT-like

Unit tests

Run pytest with ./run_tests.sh. Coverage:

----------- coverage: platform linux, python 3.6.7-final-0 -----------
Name                                       Stmts   Miss  Cover
--------------------------------------------------------------
artifici_lda/__init__.py                       0      0   100%
artifici_lda/data_utils.py                    39      0   100%
artifici_lda/lda_service.py                   31      0   100%
artifici_lda/logic/__init__.py                 0      0   100%
artifici_lda/logic/count_vectorizer.py         9      0   100%
artifici_lda/logic/lda.py                     23      7    70%
artifici_lda/logic/letter_splitter.py         36      4    89%
artifici_lda/logic/stemmer.py                 60      3    95%
artifici_lda/logic/stop_words_remover.py      61      5    92%
--------------------------------------------------------------
TOTAL                                        259     19    93%

License

This project is published under the MIT License (MIT).

Coded by Guillaume Chevalier at Neuraxio Inc.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
artifici_lda		artifici_lda
testing		testing
.gitignore		.gitignore
LICENSE		LICENSE
Multilingual-LDA-Pipeline-Tutorial.ipynb		Multilingual-LDA-Pipeline-Tutorial.ipynb
README.md		README.md
Stemming-words-from-multiple-languages.ipynb		Stemming-words-from-multiple-languages.ipynb
__init__.py		__init__.py
requirements.txt		requirements.txt
run_tests.sh		run_tests.sh
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual Latent Dirichlet Allocation (LDA) Pipeline

Usage

How it works

Supported Languages

Dependencies and their license

Unit tests

License

About

Releases

Packages

Contributors 2

Languages

License

ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA

Folders and files

Latest commit

History

Repository files navigation

Multilingual Latent Dirichlet Allocation (LDA) Pipeline

Usage

How it works

Supported Languages

Dependencies and their license

Unit tests

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages