MANUAL FOR SETTING UP YOUR OWN VTA DICTIONARY DEVELOPMENT
We developed a tool to facilitate the translation of english CAMEO dictionary to other foreign languages. This tool is designed to provide the following support -
- A structured interface for translation. We are going to translate actions and and associated text pattern, known as "rule".
- Provide collaboration between translators. It becomes a channel between human translators (i.e coders.) who are usually geographically distant.
- Automatic generation of translated version of the CAMEO dictionary using the statistics of the user given feedbacks
There are three phases of the translation steps
Step 1: Data Collection
Step 2: Feedback Collection from human coders.
Step 3: Statistical Compilation of the translated version
Step 1: Data Collection
We started by extracting verbs that are associated with some rule in the CAMEO dictionary. Then we collected the synonym sets for each of the verbs from Wordnet. The synonym set contained synonyms in English and the foreign language we are translating (i.e. Spanish, Arabic). An example is as following
{
"conceptID": "abandon\_1",
"concept": "
",
"gloss": " stop maintaining or insisting on; of ideas or claims; \"He abandoned the thought of asking for her hand in marriage\"; \"Both sides have to give up some claims in these negotiations\" ",
"sets": [{
"lang": "en",
"words": ["abandon", "give up"]
}, {
"lang": "es",
"words": ["abandonar", "renunciar"]
}, { "lang" : "gr",
"words" : [list-of-synonyms-in Greek]
}]
}
Here sets attribute contains the synonyms in different languages (English and Spanish here). The verb is listed under concept attribute. There will be multiple entries found for the same verb. We identified them by conceptID. Each entry has gloss attribute and helps the coder to identify the context associated with the translated verbs.
We also extracted rules from CAMEO.2.0.txt hosted in github code repository of PETRARCH2 project. →
Step 2: Feedback Collection from human coders.
After collecting the data, we displayed them through the web UI for human validation. For the rules, we displayed the english rules and associated translated rules in foreign language. The coders will be able to verify the translation suggested by previous coders and add their own translations.
For verb translation we asked the coder to provide two-stage feedback. First, they will select the appropriate synonym sets of the verb that matches with the context of the CAMEO code. Then, within the selected synonym sets, translated words are marked as correct/incorrect/ambiguous. After this step we get the translated verbs approved by the coder for a CAMEO code and CAMEO rule.
Step 3: Statistical Compilation of the translated version
After we collected all the feedbacks, we list the translated rules in the translated version of the dictionary. For synonyms, we first collected feedbacks on synonym sets. Based on the content of the "gloss", coders verify whether a synonym set is appropriate in the context of the CAMEO code and rule (feedback on synonym set level). Once a synonym set is marked as appropriate, we consider feedback on the words it contains (feedbacks at word level). At the end we consider feedbacks from all coders for a particular synonym set and make a majority based decision to identify whether it is suitable for inclusion as the translated verb for a CAMEO code and rule pair.
** We collected feedbacks based on gloss on synonym set from Spanish Coders. For Arabic and later languages, we use those feedbacks to identify the appropriate synonym sets and show the words in respective language for feedback collection at word level.
Inclusion of new language:
We are building the system so that we can easily integrate new languages for translation and dictionary generation purpose. For the time being the system is limited to the languages supported by WordNet. Here is the steps to get a translated version of the dictionary in other languages -
- A verified user will put request on which language he/she wants the translated version of the dictionary.
- The system in background collects data from wordnet.
- Update database with new entries and link with existing entities.
- New coder starts working on the downloaded dataset and provide their feedbacks
Once feedback collection process is completed, an offline tool for dictionary creation will download data and populate the dictionary.
Linking with existing entity:
Linking with existing entity is very important step as it helps to correctly show the information to the user. To illustrate this we are using the example of a synonym set. This information is stored in the database in 3 different entities.
- Word - contains the english word
- SynsetEntry - contains gloss, examples of a synonym set for a particular english word
- SynsetWord - contains words along with the language code which are part of a particular synonym set.
Example from the database is as follows -
SynsetEntry:
Let us consider 1 entry from SynsetEntry table.
GQL Query: select * from SynsetEntry where __key__ = KEY(SynsetEntry, 4504727190503424) |
---|
Output:
Name/ID | gloss | idWord | source | submissionId | ||
---|---|---|---|---|---|---|
id=** 4504727190503424 ** | ** put into an upright position; "Can you stand the bookshelf up?" ** | ** 4712951634198528 ** | ** wordnet ** | ** null** |
Finding the corresponding english word
GQL Query: select * from Word where __key__ = KEY(Word, 4712951634198528) |
---|
Name/ID | text | ||
---|---|---|---|
id=4712951634198528 | stand |
Here is all the words that are part of the synonym set
GQL Query: select * from SynsetWord where idSynsetEntry = 4504727190503424 |
---|
Output:
Name/ID | idSynsetEntry | languageCode | submissionId | word | ||
---|---|---|---|---|---|---|
id=4704686271627264 | 4504727190503424 | ar | null | نهض | ||
id=4722041060065280 | 4504727190503424 | ar | null | قوم | ||
id=4816410148601856 | 4504727190503424 | en | null | place upright | ||
id=4845423759982592 | 4504727190503424 | ar | null | قاوم_البلى | ||
id=4862778548420608 | 4504727190503424 | ar | null | أبحر_في_إتجاه_معين | ||
id=5126898736693248 | 4504727190503424 | ar | null | وقف | ||
id=5144253525131264 | 4504727190503424 | ar | null | رشح | ||
id=5267636225048576 | 4504727190503424 | ar | null | قام | ||
id=5284991013486592 | 4504727190503424 | ar | null | رجع | ||
id=5408373713403904 | 4504727190503424 | ar | null | وقف_منتصبا | ||
id=5425728501841920 | 4504727190503424 | ar | null | اقم | ||
id=5689848690114560 | 4504727190503424 | ar | null | نصب | ||
id=5707203478552576 | 4504727190503424 | ar | null | حمل | ||
id=5830586178469888 | 4504727190503424 | ar | null | كان_في_موقف | ||
id=5847940966907904 | 4504727190503424 | ar | null | وجه | ||
id=5942310055444480 | 4504727190503424 | en | null | stand | ||
id=5971323666825216 | 4504727190503424 | ar | null | تزج | ||
id=5988678455263232 | 4504727190503424 | ar | null | بعد | ||
id=6026654354767872 | 4504727190503424 | en | null | stand up | ||
id=6252798643535872 | 4504727190503424 | ar | null | اطق | ||
id=6270153431973888 | 4504727190503424 | ar | null | إتخذ_موقف | ||
id=6534273620246528 | 4504727190503424 | ar | null | صطف | ||
id=6551628408684544 | 4504727190503424 | ar | null | ظل_قائما |
So whenever we collect words translated in new language we have to adjust the reference linking depicted here. All are handled inside our system so that the end user can get simplified user experience. For example here, suppose we want to add German synonym set words for the english word "stand". For that, we first download all the synsets for the word "stand" with german translations. Now we match the gloss with each existing SynsetEntry and use the id of the matched one to be included in the SynsetWord entry for the new German words.