NMT: Usage

Setting up and running an experiment

The tools described in this section are the tools that are most commonly used in setting up and running an experiment.

config

The config tool can be used to set up a simple configuration file (config.yml) for an experiment. The configuration settings are specified on the command line, and the tool generates a valid config.yml file with those settings in the specified experiment subfolder (SIL_NLP_DATA_PATH > MT > experiments > <experiment>)

usage: config.py [-h] [--src-langs [lang [lang ...]]]
[--trg-langs [lang [lang ...]]] [--vocab-size VOCAB_SIZE]
[--src-vocab-size SRC_VOCAB_SIZE]
[--trg-vocab-size TRG_VOCAB_SIZE] [--parent PARENT]
[--mirror] [--force] [--seed SEED] [--model MODEL]
experiment

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--src-langs [lang [lang ...]]`	Source language files	The name of one (or more) files in the source language(s). Each file must be located in the `SIL_NLP_DATA_PATH > MT > corpora` folder or the `SIL_NLP_DATA_PATH > MT > scripture` folder. Only the base of the file name is specified; e.g., to use the file `abp-ABP.txt', specify` abp-ABP`.
`--trg-langs [lang [lang ...]]`	Target language files	The name of one (or more) files in the target language(s). Each file must be located in the `SIL_NLP_DATA_PATH > MT > corpora` folder or the `SIL_NLP_DATA_PATH > MT > scripture` folder. Only the base of the file name is specified; e.g., to use the file `en-ABPBTE.txt', specify` en-ABPBTE`.
`--vocab-size VOCAB_SIZE`	Shared vocabulary size	Specifies the size (e.g, '32000') of the shared SentencePiece vocabulary that will be constructed from the text in the source and target files.
`--src-vocab-size SRC_VOCAB_SIZE`	Source vocabulary size	Specifies the size (e.g., '32000') of a SentencePiece vocabulary that will be constructed from the text in the source files (only). This option should be used in combination with the `--trg-vocab-size` argument.
`--trg-vocab-size SRC_VOCAB_SIZE`	Target vocabulary size	Specifies the size (e.g., '32000') of a SentencePiece vocabulary that will be constructed from the text in the target files (only). This option should be used in combination with the `--src-vocab-size` argument.
`--parent PARENT`	Parent experiment name	The name of an experiment subfolder with a trained parent model. The subfolder must be located in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--mirror`	Mirror train and validation data sets (default: False)	Specifies that the training and validation data sets constructed from the source and target files should be mirrored. With mirroring, each source/target sentence pair is added to the training (or validation) data set as both a source/target pair and as a target/source pair. Without mirroring, each sentence pair is only added as a source/target pair.
`--force`	Overwrite existing config file	If a configuration file already exists in the specified experiment subfolder, the tool will report an error. If this argument is provided, the tool will overwrite the existing configuration file.
`--seed SEED`	Randomization seed	Specifies the randomization seed that will be used during preprocessing and training.
`--model MODEL`	Neural network model	Specifies the neural network model that will be trained. Options: TransformerBase (default), TransformerBig, SILTransformerBaseNoResidual, or SILTransformerBaseAlignmentEnhanced).

preprocess

The preprocess tool prepares the various data files needed to train a model. Preprocessing steps include:

creating SentencePiece vocabulary models from the experiment's source and target files;
splitting the source and target files into the training, validation, and test data sets;
writing the train/validate/test data sets to files in the subfolder;
adapting the parent model (if one is specified) to be used by this experiment.

usage: preprocess.py [-h] [--stats] experiment

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--stats`	Output corpus statistics	Using a statistical model, calculate an alignment score for the source and target texts. Use of this option requires the SIL.Machine library to be available.

train

The train tool trains a neural model for one or more specified experiments. The experiment's configuration file (config.yml) and the data files created by the preprocess tool are used to control the training process.

usage: train.py [-h] [--mixed-precision] [--memory-growth]
[--num-devices NUM_DEVICES] [--eager-execution]
experiments [experiments ...]

Arguments:

Argument	Purpose	Description
`experiments`	Experiment names	The names of the experiments to train. Each experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--mixed-precision`	Enable mixed precision
`--memory-growth`	Enable memory growth
`--num-devices NUM_DEVICES`	Number of devices to train on
`--eager-execution`	Enable Tensorflow eager execution

test

The test tool trains a neural model for one or more specified experiments. The experiment's configuration file (config.yml) and the data files created by the preprocess tool are used to control the training process.

usage: test.py [-h] [--memory-growth] [--checkpoint CHECKPOINT] [--last]
[--best] [--avg] [--ref-projects [project [project ...]]]
[--force-infer] [--scorers [scorer [scorer ...]]]
[--books [book [book ...]]] [--by-book]
experiment

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment to test. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--memory growth`	Enable memory growth
`--checkpoint CHECKPOINT`	Test specified checkpoint	Use the specified checkpoint (e.g., '--checkpoint 6000') to generate target language predictions from the test set. The specified checkpoint must be available in the `run` subfolder of the specified experiment.
`--last`	Test the last checkpoint	Use the last training checkpoint to generate target language predictions.
`--best`	Test the best checkpoint	Use the best training checkpoint to generate target language predictions. The best checkpoint must be available in the `run > export` subfolder of the specified experiment.
`--avg`	Test the averaged checkpoint	Use the averaged training checkpoint to generate target language predictions. The averaged checkpoint must be available in the 'run > avg' subfolder of the specified experiment. An averaged checkpoint can be automatically generated during training using the `train: average_last_checkpoints: _<n>_` option, or it can be manually generated after training by using the average_checkpoints tool.
`--ref-projects [project [project ...]]`	Reference projects	The generated target language predictions are typically scored using the target language test set as the reference. If multiple reference projects were configured, this option can be used to specify which of these reference projects should be considered when scoring the predictions.
`--force-infer`	Force inferencing	If the test tool has already been used to generate and score predictions for an experiment's checkpoint, it will only score the predictions when it is run again on that same checkpoint. This option can be used to force the tool to re-generate the target language predictions.
`--scorers [scorer [scorer ...]]`	List of scorers	Specifies the list of scorers to be used on the predictions. Options are 'bleu' (default), 'chrf3', 'meteor', 'ter', and 'wer'.
`--books [book [book ...]]`	Books to score	Specifies one or more books to be scored. When this option is used, the test tool will generate predictions for the entire target language test set, but provide a score only for the specified book(s). Book must be specified using the 3 character abbreviations from the USFM 3.0 standard (e.g., "GEN" for Genesis)
`--by-book`	Score individual books	In addition to providing an overall score for all the books in the test set, provide individual scores for each book in the test set. If this option is used in combination with the `--books` option, individual scores are provided for each of the specified books.

translate

The translate tool uses a trained neural model to translate text to a new language. Three translation scenarios are supported, with differing command line arguments for each scenario. The supported scenarios are:

Using a trained model to translate the text in a file from the source language to a target language.
Using a trained model to translate the text in a sequence of files into a target language.
Using Google Translate to translate a USFM-formatted book in a Paratext project into a target language.

The command line arguments for each of these scenarios are described below.

usage: translate.py [-h] [--memory-growth] [--checkpoint CHECKPOINT]
[--src SRC] [--trg TRG] [--src-prefix SRC_PREFIX]
[--trg-prefix TRG_PREFIX] [--start-seq START_SEQ]
[--end-seq END_SEQ] [--src-project SRC_PROJECT]
[--book BOOK] [--trg-lang TRG_LANG]
[--output-usfm OUTPUT_USFM] [--eager-execution]
experiment

Text file

Using the combination of command line arguments described in this section, the translate command will translate the sentences in a text file from the source language to the target language, using the requested checkpoint from a trained model.

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario).
`--memory growth`	Enable memory growth
`--eager-execution`	Enable Tensorflow's eager execution
`--checkpoint CHECKPOINT`	Test specified checkpoint	Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the `run` subfolder of the specified experiment.
`--src SRC`	Source file	Name of a text file with the source language sentences to be translated (one sentence per line). The translate tool looks for the file in the current working directory or, if a full/relative path is specified, it looks for the file in the specified folder. Each line in the specified source file is translated and written to the specified target file.
`--trg TRG`	Target file	Name of the text file where the translated sentences will be written (one per line).

Sequence of Text Files

Using the combination of command line arguments described in this section, the translate command will translate sentences from a sequence of source language text files. The sentences in these source language text files are translated to the target language using the requested checkpoint from a trained model, and written to a corresponding sequence of target language text files.

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario).
`--checkpoint CHECKPOINT`	Test specified checkpoint	Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the `run` subfolder of the specified experiment.
`--src-prefix SRC_PREFIX`	Source file prefix (e.g., de-news2019-)	The file name prefix for the source files. The translate tool looks for the sequence of source files in the current working directory.
`--trg-prefix TRG_PREFIX`	Target file prefix (e.g., en-news2019-)	The file name prefix for the target files. The translate tool will write the translated text to a series of files with this specified file name prefix; the translated files will be written to the current working directory.
`--start-seq START_SEQ`	Starting file sequence #	The first source language file to translate (e.g., '--start-seq 0'). The source files must use a 4 digit, zero-padded numbering sequence ('en-news2019-0000.txt', 'en-news2019-0001.txt', etc).
`--end-seq START_SEQ`	Ending file sequence #	The final source language file sequence number to translate.

Paratext book (USFM file)

Using the combination of command line arguments described in this section, the translate command will translate a book from a Paratext project into the requested target language. The translated text is written into a USFM-formatted file with markup that closely follows the markup in the source book. The Paratext project and the specified target language must be supported by Google Translate, and a Google Cloud account and credentials are required.

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiments to test. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--checkpoint CHECKPOINT`	Test specified checkpoint	Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the `run` subfolder of the specified experiment.
`--src-project SRC_PROJECT`	The source project to translate	The name of the source Paratext project. The project name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > Paratext > projects` folder.
`--book BOOK`	The book to translate	The 3 character abbreviation of the book in the source Paratext project to be translated (e.g., "GEN" for Genesis). Book identifiers should follow the USFM 3.0 standard.
`--trg-lang TRG_LANG`	The target language	The ISO-639-1 abbreviation of the target language that the book will be translated into. The specified target language must be supported by Google Translate
`--output-usfm OUTPUT_USFM`	The output USFM file path	Path for the USFM-formatted output file.

Analyzing the results of an experiment

analyze

check_train_val_test_split

After a model has been trained and used to generate predictions for the test set, the check_train_val_test_split tool can be used to analyze the word distributions across the train, validate, and test sets for the source and target corpora. By default, the tool will generate high-level statistics regarding the occurrence of "unknown" words (i.e., words that occur in the validation set or in the test set, but not in the training set). The tool can also be used to generate detailed listings of these unknown words and their occurrence counts. It is also possible to have the tool compare these unknown words to the valid words found in the training set to identify possible misspellings. Output is saved in the word_count.xlsx file in the specified experiment folder.

Arguments:

Argument	Purpose	Description
`experiment`	Experiment name	The name of the experiments to check. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--details`	Show detailed word lists	Generate detailed lists of validation set and test set words that are not found in the training set. Separate lists are generated for the source and target corpora. Occurrence counts are provided for each identified word.
`--similar-words`	Find similar words	Compare each unknown words to the valid words found in the training set and identify possible misspellings in the validation and test set. Levenshtein distance is used to identify the possible misspellings.
`--distance DISTANCE`	Maximum Levenshtein distance for word similarity	By default, a Levenshtein distance of 1 is used to identify similar words in the training set. This parameter can be used to specify a different distance.

diff_predictions

The diff_predictions tool can be used to compare the test set predictions between two experiments. The tool generates a spreadsheet (diff_predictions.xlsx) with multiple comparison tabs (experiment1 (best) vs experiment2 (best), experiment1 (best) vs experiment2 (last), etc). The comparison includes the test set source text, the target language reference text, both predictions, and the sentence-level BLEU scores for both predictions. Optionally, the tool can mark-up each prediction to identify the differences between the reference text and the prediction. The source text can also be marked up to highlight test set words that are not found in the training set. Optionally, the training set source / target sentence pairs can be included in the output spreadsheet on a separate tab.

Arguments:

Argument	Purpose	Description
`exp1`	Experiment 1 name	The name of the first experiment to compare. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`exp2`	Experiment 2 name	The name of the second experiment to compare. The experiment name must correspond to a subfolder in the `SIL_NLP_DATA_PATH > MT > experiments` folder.
`--show-diffs`	Show differences (predictions vs reference)	Mark up the predictions to indicate where they differ from the reference text.
`--show-unknown`	Show unknown words in source verse	Mark up the test set source sentences to indicate words that do not occur in the training set.
`--include-train`	Include the src/trg training corpora in the spreadsheet	Include the parallel source/target training sentence pairs in another tab in the spreadsheet.
`--preserve-case`	Score predictions with case preserved	Preserve case when calculating the sentence-level BLEU score for the source/target sentence pairs. By default, the tool will lower case the source and target. Note that this behavior is secondary to the source / target case settings specified in the config.yml file; if those settings specified lower casing, then this argument has no effect.
`--tokenize TOKENIZE`	Sacrebleu tokenizer (none,13a,intl,zh,ja-mecab,char)	Specifies the Sacrebleu tokenizer that will be used to calculate the sentence-level BLEU score for each source/target sentence pair. (Default: 13a)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NMT: Usage

Setting up and running an experiment

config

preprocess

train

test

translate

Text file

Sequence of Text Files

Paratext book (USFM file)

Analyzing the results of an experiment

analyze

check_train_val_test_split

diff_predictions

Miscellaneous commands

average_checkpoints

export_embeddings

Clone this wiki locally