An easy-to-use tool for semantic scaling of political text, based on word embeddings. Check out the working draft of our political science article (plus its online appendix) and the original NLP paper.
Clone or download the project, then go into the SemScale directory. The script scaler.py needs just the following inputs:
datadir -> A path to the directory containing the input text files for scaling (one score will be assigned per file).
embs -> A path to the file containing pre-trained word embeddings
output -> A file path to which to store the scaling results.
optional arguments:
-h, --help -> show this help message and exit
--stopwords STOPWORDS -> A file to the path containing stopwords
--emb_cutoff EMB_CUTOFF -> A cutoff on the vocabulary size of the embeddings.
The expected input is in the one-text-per-file format. Each text file in the referenced directory should contain a language (e.g., "en") in the first line, i.e., the format should be "language\ntext of the file".
For an easy set-up, we provide pre-trained FastText embeddings in a single file for the following five language: English, French, German, Italian and Spanish, that can be obtained from here.
Nonetheless, you can easily use the tool for texts in other languages or with different word embeddings, as long as you:
-
provide a (language-prefixed) word embedding file, the following way: for each word, abbreviation for the language plus double underscore plus word and then the word embedding. For instance, each word in a Bulgarian word embeddings file might be prefixed with "bg__")
-
in case you employ embeddings in a different language to the 5 listed above, update the list of supported languages in the beginning of the code file nlp.py and at the beginning of the task script you're using (e.g., scaler.py)
A simple .txt, which will be filled with filename, positional-score for each input file.
Stopwords can be automatically excluded, via this input file (one stop-word per line).
The script requires basic libraries from the Python scientific stack: numpy (tested with version 1.12.1), scipy (tested with version 0.19.0), and nltk (tested with version 3.2.3);
In the SemScale folder, just run the following command:
python scaler.py path-to-input-folder path-to-embeddings-file output.txt
To use the supervised scaling version of our approach (dubbed SemScores), just run:
python supervised-scaler.py
and add as final arguments the two pivot texts to be used.
We also offer a Python implementation of the famous Wordfish algorithm for text scaling. To know how to use it, just run:
python wordfish.py -h
Additional functionalities (classification, topical-scaling) are available in the main branch of this project.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
If you're using this tool, please cite the following paper:
@InProceedings{glavavs-nanni-ponzetto:2017:EACLshort,
author = {Glava\v{s}, Goran and Nanni, Federico and Ponzetto, Simone Paolo},
title = {Unsupervised Cross-Lingual Scaling of Political Texts},
booktitle = {Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
month = {April},
year = {2017},
address = {Valencia, Spain},
publisher = {Association for Computational Linguistics},
pages = {688--693},
url = {http://www.aclweb.org/anthology/E17-2109}
}