GitHub - joshua-decoder/miniscale

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
bin		bin
config		config
demo_data		demo_data
wikipedia_data		wikipedia_data
README		README
addSymTrainDev.py		addSymTrainDev.py
evaluate.py		evaluate.py
find-oovs.perl		find-oovs.perl
sub-oovs.py		sub-oovs.py
test_helper.sh		test_helper.sh
train_helper.sh		train_helper.sh
train_main.sh		train_main.sh
translit_strings.sh		translit_strings.sh

Repository files navigation

Transliteration training and decoding with sample Urdu data

1. If using a mac, update SRILM call in train_helper (comment out line 72, uncomment line 73)
If using CLSP or COE clusters, do nothing.

2. Set JOSHUA, SRILM, and JAVA_HOME environment variables appropriately (or use my pointers at the top of train_helper and test_helper for coe machines)

3. To train a system, create directory with a file containing a "train" file and a "dev" file with transliteration pairs, then run (pointing to training directory):
./train_main.sh demo_data/urdu
Example directory is demo_data/urdu (note that these example train and dev files already have beginning of word and end of word symbols appended)

4. To decode a test set, run:
./translit_strings.sh demo_data/urdu/urdu_test demo_data/urdu
First arg: test file (list of single words to decode)
Second arg: pointer to same directory given in training step, above

Output in demo_data/urdu/urdu_test.answer
For this example, references are given in demo_data/urdu/urdu_test.ref for comparison

5. Evaluate your output:
python evaluate.py demo_data/urdu/urdu_test.answer demo_data/urdu/urdu_test.ref
This simple evaluation script reports the number and percent of perfect transliterations, the average edit distance, and the average normalized (normalized by the length of the reference) edit distance

-------------------------------------

Using your own data

train/dev data:
To use your own, just need to create directory with a train and dev file. Script to append beginning and word and end of word symbols: addSymTrainDev.py (run: python addSymTrainDev.py inputTrain)

language model data:
Two English language model files are included in demo_data/lm 
One is taken from Wikipedia person-page titles and the other from NE-labeled Gigaword corpus, and each includes relative frequency counts.

train your own LM:
Can use other LM, just change pointer in train_main.sh (and script for adding beginning of word and end of word symbols in demo_data/lm)
to build from monolingual corpus:
cd demo_data/lm
python make_lm.py my-input-text
Change LM pointer to my-input-text.wordModel

-------------------------------------

Using Wikipedia data:
See wikipedia_data/README

-------------------------------------

Substitute transliterations into Joshua MT output

To find OOVs in Joshua MT output: find-oovs.pl

After generating transliterations for all OOV words, create a dictionary file (oov word - tab - transliterated word), and substitute them into the Joshua output:
python sub-oovs.py tab-separated-oov-dictionary translation-output-file

-------------------------------------

If you use this system, please cite the following:
http://www.clsp.jhu.edu/~anni/papers/irvine_amta_translit.pdf

-------------------------------------

Questions:
annirvine@gmail.com