This project aims at exploiting Knowledge Engineering technologies to compare translations across multiple languages by formal reasoning on them. Our goal is to understand which subtle modifications occur when going back and forth from a language to another using off-the-shelf translators (such as DeepL).
We have chosen the open-source Europarl parallel corpus for this task. In particular we have taken the first 6 sentences from the English dataset and we have subsequently translated them to German, Italian and Chinese, and then we have translated them back to English. The choice of the languages is not incidental: we aimed at observing whether more differences pop up by translating towards more and more exotic languages with respect to English, thus we picked up a West Germanic language - German - a Latin language - Italian - and a non-European language - Chinese.
Once we have translated the sentences as described above, we employed the machine reader FRED to encode them as knowledge graphs. Starting with 6 sentences translated from 4 languages we ended up with 24 different ontologies.
The attempts we made to address the problem of computing the semantic differences between a pair of ontologies are described in the following sections.
For further information about our project, read our report here or download it here.
All the code is available on GitHub.
This tool runs on Python 3.x. The minimum required version is 3.5, but we recommend to install the latest available version of the required packages (listed below in brackets), which requires Python ≥ 3.7.
The following packages are required to run the scripts:
Assuming that the command python
points to the installation of Python 3 to be used, this can be easily achieved through:
python -m pip install -U pip setuptools wheel --user
python -m pip install -U rdflib spacy nltk --user
(where the first command installs packages required by spacy
, the flag --user
is used for a local installation, and the flag -U
takes care of upgrading the specified packages to the latest available version)
In case of troubles, please refer to the documentation of each single package: pip, rdflib, spacy, nltk.
Before proceeding, we need to download the spacy
model (needed to perform the tokenization of words):
python -m spacy download en_core_web_sm
and the WordNet® database (needed to recognize synonyms) using nltk
- which requires opening a Python shell and executing:
>>> import nltk
>>> nltk.download("wordnet")
Both are one-off operations, i.e. they need to be run only once for all.
To deploy TranslationCoherence, firstly clone the repository (or download it as a .zip
archive):
git clone https://github.com/Lostefra/TranslationCoherence/
and make sure the current working directory is the root directory of the project:
cd TranslationCoherence
To inspect the example results provided with this repository, please follow the instructions included in the README file inside subfolder docker_container.
There are two Python scripts that allow for the comparison of ontologies: main_compare_pair.py
and main_compare_all.py
.
As the names suggest, the first one is designed to run on two ontologies, while the second one takes the ontologies stored in this folder and compares them all pairwise.
To start the scripts, simply run:
python main_compare_pair.py
(eventually having care of modifying it to compare the desired files), or
main_compare_all.py
to compare them all at once.