Welcome to the working repository of my PhD research on the automatic detection of syntactic difference (DeSDA). All tools that I developed, as well as datasets that I compiled, for the purposes of my PhD research have been uploaded here, along with relevant output. My dissertation is yet to be published.
This repository consists of three main folders, corresponding to the central chapters of my dissertation.
The folder Chapter 2 - Filter contains the tools and data described in Chapter 2 of my dissertation and in Kroon, Barbiers, Odijk and van der Pas (2019).
The folder contains two subfolders:
The Data/
folder contains two types of relevant files types:
*.raw
files (e.g.de-en.raw
) consist of 400 sentence pairs from Koehn's (2005) Europarl corpus (of the two languages indicated by the language abbreviations in the file name) separated by a tab. All words have been POS tagged (word|POS
) with the POS tags having been taken directly from the Europarl corpus metadata. These metadata tags have been translated into Universal Dependencies (Nivre et al. 2016) using the files inData/tagset_translations/
.*.train
files (e.g.de-en.train
) contain the 400 sentence pairs from the*.raw
files with a label (Y|N
) for whether the sentence pair is syntactically comparable or not.
The folder furthmore contains UDPipe_models/
, containing models for UDPipe (Straka and Straková 2017), a dependency parser for UD, for convenience.
The Tools/
folder contains all the relevant code to run the filters as described in Chapter 2 of my PhD dissertation and in Kroon, Barbiers, Odijk and van der Pas (2019)..
AUC_evaluator.py
is used to automatically find the best parameter settings of each individual filter based on the*.train
files (see above). The variables (which data to use and which UDPipe models to use) are changed within the file. The code reports on the AUC and the best threshold setting based on Youden's J statistic (Youden 1950) and the Euclidean distance for every parameter setup (described in Chapter 2). The script makes use of some multiprocessing, and relies onlevenshtein.py
,senlen_ratio.py
andnetworkx_GED
. Unfortunately, the output is too large to be uploaded.AUC_evaluator.logreg.py
is used to automatically find the best parameter settings of the logistic regression filter based on the*.train
files (see above). The variables (which data to use and which UDPipe models to use) are also changed within the file. This script reports on the AUC, but not the best threshold setting (which is always 50%; the AUC is calculated to be able to compare the results). The script also makes use of some multiprocessing, and relies onlevenshtein.py
,senlen_ratio.py
andnetworkx_GED.py
. Unfortunately, the output is too large to be uploaded.- as mentioned,
levenshtein.py
,senlen_ratio.py
andnetworkx_GED.py
are necessary to automatically find the best parameter setup for the filters using the two scipts described above. levenshtein_filter.py
,senlen_ratio_filter.py
andnetworkx_GED_filter.py
, on the other hand, take manually set parameters (changed in the file), and take the*.raw
files (see above) as input, outputting the dataset with syntactically incomparable sentence pairs filtered out.logreg_filter.py
is the logistic regression filter. It allows for the parameters of the filters it uses to be set manually (changed in the file), and uses*.train
files (see above) to train a classifier, and to filter out syntactically incomparable sentence pairs from it.
The folder Chapter 3 - MDL contains the tools and data described in Chapter 3 of my dissertation and in Kroon, Barbiers, Odijk and van der Pas (2020). The README describes clearly how to recreate the research.
In the folder one can find, among other things, MDL_difference_detector.py
, the main tool to detect syntactic differences using MDL. Variables are set within the Python file. Revelant are:
setup
, which sets how the script should be run: with or without filtered data (the first character), with or without superpattern subtraction (the second character).setup
must be(NN|NY|YN|YY)
.lang_a
andlang_b
, which correspond to the language abbreviations used in theData/
folder.
MDL_difference_detector.py
takes specifically formatted input. Please refer to the README to recreate the research.
The output of MDL_difference_detector.py
can be found in Output/
.
The folder Chapter 4 - Alignment contains the tools and data described in Chapter 4 of my dissertation.
The folder contains three subfolders. For more information, please refer to the README.
The Data/
folder contains:
en-hu
: contains data files relevant to word alignment witheflomal
(Östling and Tiedemann 2016), such as input and output;en-hu.eflomal.txt
: sentence pairs formatted for input;- the rest are output files;
python
: contains an English and Hungarian Bible (from Christodoulopoulos and Steedman 2015), with one verse ID and verse per line, but only those verses that are present in both versions of the Bible;xml_aligner.py
: can be used to align the XML Bibles from Christodouloupoulos and Steedman (2015) such that the output contains only the verses present in both translations.
The Tools/
folder contains the three main tools developed for Chapter 4, which can be used to detect syntactic differences:
- the Data Grouper for Attribute Exploration or DGAE;
- the Generalization Tree Inducer or GTI;
- the Affix-Attribute Associator or AAA.
The Output/
folder contains all the relevant output:
AAA_en-hu.txt
: the output of AAA;DGAE_en-hu_deprel.txt
: the output of DGAE grouping overdeprel
;DGAE_en-hu_pos.txt
: the output of DGAE grouping overpos
;DGAE_en-hu_pos_deprel.txt
: the output of DGAE grouping overpos
anddeprel
;GTI_en-hu_deprel.fragment.txt
: a fragment (first 50.000 lines) of the output of GTI pre-splitting overdeprel
;GTI_en-hu_pos.fragment.txt
: a fragment (first 50.000 lines) of the output of GTI pre-splitting overpos
;GTI_en-hu_pos_deprel.fragment.txt
: a fragment (first 50.000 lines) of the output of GTI pre-splitting overpos
anddeprel
.