polyglot_toolbox

Polyglot word embeddings and their use in unsupervised language identification and related tasks.

Dependencies

pip install -r requirements.txt

Workflow

We expect text to be 1 document per line, punctuation stripped, and whitespace separated tokens

We have a sample corpus created from a mixture of 21 European languages that can be downloaded at this link.

Create a vocabulary file:

python main.py vocab europarl/europarl_full_noneval.txt > europarl/europarl_full_noneval.vocab.txt

Next, train 100-dim FastText embeddings:

path/to/fasttext skipgram -input europarl/europarl_full_noneval.vocab.txt -output europarl/europarl_skipgram

Discover the appropriate value for k using either the silhouette heuristic or the elbow heuristic.

Silhouette plots for values of k from 2 through 30:

python main.py discover-silhouette europarl/europarl_full_noneval.txt europarl/europarl_skipgram.bin europarl/europarl_silhouettes 30

Here's a silhouette plot for k=21 which shows clear, well-separated clusters.

An elbow visualization plots the k-Means objective against values of k:

k=21 is consistently picked as the right k value.

Finally, a k-Means model can be trained with the discovered k value:

python main.py cluster-documents europarl/europarl_full_noneval.txt europarl/europarl_skipgram.bin europarl/europarl_languages 21

Which will save a model in europarl/europarl_languages_langid.joblib. This is a scikit-learn model and language identification is done using cluster assignment.

You can get cluster label assignments for a full file (I'm just using a 1000 document sample) using:

python main.py dump-pred europarl/europarl_full_noneval.1000.txt europarl/europarl_skipgram.bin europarl/europarl_languages_langid.joblib europarl/europarl_full_noneval.1000.prediction.txt

As a final step, you need a human to perform the mapping from cluster number to the actual language.

In the Wild

This technique has been successfully used in several recent papers. The involved analyses spanned multiple ethnicities, dozens of low-resource languages, and noisy social-media text.

Voice for the Voiceless: Active Sampling to Detect Comments Supporting the Rohingyas
Shriphani Palakodety, Ashiqur R. KhudaBukhsh, Jaime G. Carbonell
AAAI 2020

Hope Speech Detection: A Computational Analysis of the Voice of Peace
Shriphani Palakodety, Ashiqur R. KhudaBukhsh, Jaime G. Carbonell
ECAI 2020

Mining Insights from Large-scale Corpora Using Fine-tuned Language Models
Shriphani Palakodety, Ashiqur R. KhudaBukhsh, Jaime G. Carbonell
ECAI 2020

Cite

@inproceedings{kashmir,
  title={Hope Speech Detection: A Computational Analysis of the Voice of Peace},
  author={Palakodety, Shriphani and KhudaBukhsh, Ashiqur R. and Carbonell, Jaime G},
  booktitle={Proceedings of ECAI 2020},
  pages={To appear},
  year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
LICENSE.txt		LICENSE.txt
README.md		README.md
cluster_posts.py		cluster_posts.py
discover_elbow.py		discover_elbow.py
discover_silhouette.py		discover_silhouette.py
dump.py		dump.py
europarl_elbow.png		europarl_elbow.png
europarl_silhouettes21.png		europarl_silhouettes21.png
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

polyglot_toolbox

Dependencies

Workflow

In the Wild

Cite

About

Releases

Packages

Languages

License

shriphani/polyglot-toolbox

Folders and files

Latest commit

History

Repository files navigation

polyglot_toolbox

Dependencies

Workflow

In the Wild

Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages