- Pablo Gamallo, CiTIUS, USC
- José Ramom Pichel, imaxin|software
- Iñaki Alegria, IXA, UPV/EHU
- Perl and Bash interperters
Storable
Perl module (you can use cpan to install)
Perplexity is used to compare the distance between languages. It is based on 7-grams models of characters and interpolation for smoothing. We provide three languages for train and test: galician (gz), portuguese (pt), spanish (es). The input texts are in the 'corpora' folder.
You can use the script RUN.sh
to run a test.
sh train.sh gz
sh train.sh pt
sh train.sh es
This generates the models for the three languages in the 'models' folder.
sh test.sh gz gz
sh test.sh gz pt
sh test.sh gz es
sh test.sh pt pt
sh test.sh pt gz
sh test.sh pt es
sh test.sh es es
sh test.sh es pt
sh test.sh es gz
If you wish add a new language, for instance english, you must copy in the corpora folder two new text files:
./corpora/train/en.txt
./corpora/test/en.txt
The test corpus should be shorter than the train corpus. For instance: 1Mb for the train and 25K for the test.
Then, you create the model as follows:
sh train.sh en
To compare english and portuguese:
sh test.sh en pt
In the folder ./resources
, you can find other corpora ready to be used for training.
This corpus has been collected from different open historical corpora and texts repositories, priorizing those who have original spelling. More information in the article:
Pichel, J-R., Pablo Gamallo, Iñaki Alegria (2018) Measuring language distance among historical varieties using perplexity. Application to European Portuguese. VARDIAL Workshop.
Gamallo, Pablo, José Ramom Pichel, Iñaki Alegria (2017) From language identification to language distance, Physica A Vol 484, pp. 162-172. DOI: 10.1016/j.physa.2017.05.011.
Pichel, J.R, Pablo Gamallo, Iñaki Alegria (2019). Measuring diachronic language distance using perplexity: Application to English, Portuguese, and Spanish, Natural Language Engineering first online. DOI: 10.1017/S1351324919000378
Pichel, J.R, Pablo Gamallo, Iñaki Alegria, Marco Neves (2020). A Methodology to Measure the Diachronic Language Distance between Three Languages Based on Perplexity, Journal of Quantitative Linguistics, first online 1 March. DOI: 10.1080/09296174.2020.1732177.