Skip to content

NMT: Usage

Michael A. Martin edited this page Jun 17, 2021 · 23 revisions

Setting up and running an experiment

The tools described in this section are the tools that are most commonly used in setting up and running an experiment.

config

The config tool can be used to set up a simple configuration file (config.yml) for an experiment. The configuration settings are specified on the command line, and the tool generates a valid config.yml file with those settings in the specified experiment subfolder (SIL_NLP_DATA_PATH > MT > experiments > <experiment>)

usage: config.py [-h] [--src-langs [lang [lang ...]]]
[--trg-langs [lang [lang ...]]] [--vocab-size VOCAB_SIZE]
[--src-vocab-size SRC_VOCAB_SIZE]
[--trg-vocab-size TRG_VOCAB_SIZE] [--parent PARENT]
[--mirror] [--force] [--seed SEED] [--model MODEL]
experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder.
--src-langs [lang [lang ...]] Source language files The name of one (or more) files in the source language(s). Each file must be located in the SIL_NLP_DATA_PATH > MT > corpora folder or the SIL_NLP_DATA_PATH > MT > scripture folder. Only the base of the file name is specified; e.g., to use the file abp-ABP.txt', specify abp-ABP`.
--trg-langs [lang [lang ...]] Target language files The name of one (or more) files in the target language(s). Each file must be located in the SIL_NLP_DATA_PATH > MT > corpora folder or the SIL_NLP_DATA_PATH > MT > scripture folder. Only the base of the file name is specified; e.g., to use the file en-ABPBTE.txt', specify en-ABPBTE`.
--vocab-size VOCAB_SIZE Shared vocabulary size Specifies the size (e.g, '32000') of the shared SentencePiece vocabulary that will be constructed from the text in the source and target files.
--src-vocab-size SRC_VOCAB_SIZE Source vocabulary size Specifies the size (e.g., '32000') of a SentencePiece vocabulary that will be constructed from the text in the source files (only). This option should be used in combination with the --trg-vocab-size argument.
--trg-vocab-size SRC_VOCAB_SIZE Target vocabulary size Specifies the size (e.g., '32000') of a SentencePiece vocabulary that will be constructed from the text in the target files (only). This option should be used in combination with the --src-vocab-size argument.
--parent PARENT Parent experiment name The name of an experiment subfolder with a trained parent model. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder.
--mirror Mirror train and validation data sets (default: False) Specifies that the training and validation data sets constructed from the source and target files should be mirrored. With mirroring, each source/target sentence pair is added to the training (or validation) data set as both a source/target pair and as a target/source pair. Without mirroring, each sentence pair is only added as a source/target pair.
--force Overwrite existing config file If a configuration file already exists in the specified experiment subfolder, the tool will report an error. If this argument is provided, the tool will overwrite the existing configuration file.
--seed SEED Randomization seed Specifies the randomization seed that will be used during preprocessing and training.
--model MODEL Neural network model Specifies the neural network model that will be trained. Options: TransformerBase (default), TransformerBig, SILTransformerBaseNoResidual, or SILTransformerBaseAlignmentEnhanced).

preprocess

The preprocess tool prepares the various data files needed to train a model. Preprocessing steps include:

  • creating SentencePiece vocabulary models from the experiment's source and target files;
  • splitting the source and target files into the training, validation, and test data sets;
  • writing the train/validate/test data sets to files in the subfolder;
  • adapting the parent model (if one is specified) to be used by this experiment.

usage: preprocess.py [-h] [--stats] experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment subfolder where the configuration file will be generated. The subfolder must be located in the SIL_NLP_DATA_PATH > MT > experiments folder.
--stats Output corpus statistics Using a statistical model, calculate an alignment score for the source and target texts. Use of this option requires the SIL.Machine library to be available.

train

The train tool trains a neural model for one or more specified experiments. The experiment's configuration file (config.yml) and the data files created by the preprocess tool are used to control the training process.

usage: train.py [-h] [--mixed-precision] [--memory-growth]
[--num-devices NUM_DEVICES] [--eager-execution]
experiments [experiments ...]

Arguments:

Argument Purpose Description
experiments Experiment names The names of the experiments to train. Each experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--mixed-precision Enable mixed precision
--memory-growth Enable memory growth
--num-devices NUM_DEVICES Number of devices to train on
--eager-execution Enable Tensorflow eager execution

test

The test tool trains a neural model for one or more specified experiments. The experiment's configuration file (config.yml) and the data files created by the preprocess tool are used to control the training process.

usage: test.py [-h] [--memory-growth] [--checkpoint CHECKPOINT] [--last]
[--best] [--avg] [--ref-projects [project [project ...]]]
[--force-infer] [--scorers [scorer [scorer ...]]]
[--books [book [book ...]]] [--by-book]
experiment

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment to test. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--memory growth Enable memory growth
--checkpoint CHECKPOINT Test specified checkpoint Use the specified checkpoint (e.g., '--checkpoint 6000') to generate target language predictions from the test set. The specified checkpoint must be available in the run subfolder of the specified experiment.
--last Test the last checkpoint Use the last training checkpoint to generate target language predictions.
--best Test the best checkpoint Use the best training checkpoint to generate target language predictions. The best checkpoint must be available in the run > export subfolder of the specified experiment.
--avg Test the averaged checkpoint Use the averaged training checkpoint to generate target language predictions. The averaged checkpoint must be available in the 'run > avg' subfolder of the specified experiment. An averaged checkpoint can be automatically generated during training using the train: average_last_checkpoints: _<n>_ option, or it can be manually generated after training by using the average_checkpoints tool.
--ref-projects [project [project ...]] Reference projects The generated target language predictions are typically scored using the target language test set as the reference. If multiple reference projects were configured, this option can be used to specify which of these reference projects should be considered when scoring the predictions.
--force-infer Force inferencing If the test tool has already been used to generate and score predictions for an experiment's checkpoint, it will only score the predictions when it is run again on that same checkpoint. This option can be used to force the tool to re-generate the target language predictions.
--scorers [scorer [scorer ...]] List of scorers Specifies the list of scorers to be used on the predictions. Options are 'bleu' (default), 'chrf3', 'meteor', 'ter', and 'wer'.
--books [book [book ...]] Books to score Specifies one or more books to be scored. When this option is used, the test tool will generate predictions for the entire target language test set, but provide a score only for the specified book(s). Book must be specified using the 3 character abbreviations from the USFM 3.0 standard (e.g., "GEN" for Genesis)
--by-book Score individual books In addition to providing an overall score for all the books in the test set, provide individual scores for each book in the test set. If this option is used in combination with the --books option, individual scores are provided for each of the specified books.

translate

The translate tool uses a trained neural model to translate text to a new language. Three translation scenarios are supported, with differing command line arguments for each scenario. The supported scenarios are:

  1. Using a trained model to translate the text in a file from the source language to a target language.
  2. Using a trained model to translate the text in a sequence of files into a target language.
  3. Using Google Translate to translate a USFM-formatted book in a Paratext project into a target language.

The command line arguments for each of these scenarios are described below.

usage: translate.py [-h] [--memory-growth] [--checkpoint CHECKPOINT]
[--src SRC] [--trg TRG] [--src-prefix SRC_PREFIX]
[--trg-prefix TRG_PREFIX] [--start-seq START_SEQ]
[--end-seq END_SEQ] [--src-project SRC_PROJECT]
[--book BOOK] [--trg-lang TRG_LANG]
[--output-usfm OUTPUT_USFM] [--eager-execution]
experiment

Text file

Using the combination of command line arguments described in this section, the translate command will translate the sentences in a text file from the source language to the target language, using the requested checkpoint from a trained model.

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario).
--memory growth Enable memory growth
--eager-execution Enable Tensorflow's eager execution
--checkpoint CHECKPOINT Test specified checkpoint Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment.
--src SRC Source file Name of a text file with the source language sentences to be translated (one sentence per line). The translate tool looks for the file in the current working directory or, if a full/relative path is specified, it looks for the file in the specified folder. Each line in the specified source file is translated and written to the specified target file.
--trg TRG Target file Name of the text file where the translated sentences will be written (one per line).

Sequence of Text Files

Using the combination of command line arguments described in this section, the translate command will translate sentences from a sequence of source language text files. The sentences in these source language text files are translated to the target language using the requested checkpoint from a trained model, and written to a corresponding sequence of target language text files.

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiment folder with the model to be used for translating the source text. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder. The model must be one that supports a single target language (i.e., there is no target language argument for this scenario).
--checkpoint CHECKPOINT Test specified checkpoint Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment.
--src-prefix SRC_PREFIX Source file prefix (e.g., de-news2019-) The file name prefix for the source files. The translate tool looks for the sequence of source files in the current working directory.
--trg-prefix TRG_PREFIX Target file prefix (e.g., en-news2019-) The file name prefix for the target files. The translate tool will write the translated text to a series of files with this specified file name prefix; the translated files will be written to the current working directory.
--start-seq START_SEQ Starting file sequence # The first source language file to translate (e.g., '--start-seq 0'). The source files must use a 4 digit, zero-padded numbering sequence ('en-news2019-0000.txt', 'en-news2019-0001.txt', etc).
--end-seq START_SEQ Ending file sequence # The final source language file sequence number to translate.

Paratext book (USFM file)

Using the combination of command line arguments described in this section, the translate command will translate a book from a Paratext project into the requested target language. The translated text is written into a USFM-formatted file with markup that closely follows the markup in the source book. The Paratext project and the specified target language must be supported by Google Translate, and a Google Cloud account and credentials are required.

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiments to test. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--checkpoint CHECKPOINT Test specified checkpoint Use the specified checkpoint to generate target language predictions from the test set. A particular checkpoint number can specified (e.g., '--checkpoint 6000'), or logical checkpoint can be specified ('best', 'last', or 'avg'). The requested checkpoint must be available in the run subfolder of the specified experiment.
--src-project SRC_PROJECT The source project to translate The name of the source Paratext project. The project name must correspond to a subfolder in the SIL_NLP_DATA_PATH > Paratext > projects folder.
--book BOOK The book to translate The 3 character abbreviation of the book in the source Paratext project to be translated (e.g., "GEN" for Genesis). Book identifiers should follow the USFM 3.0 standard.
--trg-lang TRG_LANG The target language The ISO-639-1 abbreviation of the target language that the book will be translated into. The specified target language must be supported by Google Translate
--output-usfm OUTPUT_USFM The output USFM file path Path for the USFM-formatted output file.

Analyzing the results of an experiment

analyze

check_train_val_test_split

After a model has been trained and used to generate predictions for the test set, the check_train_val_test_split tool can be used to analyze the word distributions across the train, validate, and test sets for the source and target corpora. By default, the tool will generate high-level statistics regarding the occurrence of "unknown" words (i.e., words that occur in the validation set or in the test set, but not in the training set). The tool can also be used to generate detailed listings of these unknown words and their occurrence counts. It is also possible to have the tool compare these unknown words to the valid words found in the training set to identify possible misspellings. Output is saved in the word_count.xlsx file in the specified experiment folder.

Arguments:

Argument Purpose Description
experiment Experiment name The name of the experiments to check. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--details Show detailed word lists Generate detailed lists of validation set and test set words that are not found in the training set. Separate lists are generated for the source and target corpora. Occurrence counts are provided for each identified word.
--similar-words Find similar words Compare each unknown words to the valid words found in the training set and identify possible misspellings in the validation and test set. Levenshtein distance is used to identify the possible misspellings.
--distance DISTANCE Maximum Levenshtein distance for word similarity By default, a Levenshtein distance of 1 is used to identify similar words in the training set. This parameter can be used to specify a different distance.

diff_predictions

The diff_predictions tool can be used to compare the test set predictions between two experiments. The tool generates a spreadsheet (diff_predictions.xlsx) with multiple comparison tabs (experiment1 (best) vs experiment2 (best), experiment1 (best) vs experiment2 (last), etc). The comparison includes the test set source text, the target language reference text, both predictions, and the sentence-level BLEU scores for both predictions. Optionally, the tool can mark-up each prediction to identify the differences between the reference text and the prediction. The source text can also be marked up to highlight test set words that are not found in the training set. Optionally, the training set source / target sentence pairs can be included in the output spreadsheet on a separate tab.

Arguments:

Argument Purpose Description
exp1 Experiment 1 name The name of the first experiment to compare. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
exp2 Experiment 2 name The name of the second experiment to compare. The experiment name must correspond to a subfolder in the SIL_NLP_DATA_PATH > MT > experiments folder.
--show-diffs Show differences (predictions vs reference) Mark up the predictions to indicate where they differ from the reference text.
--show-unknown Show unknown words in source verse Mark up the test set source sentences to indicate words that do not occur in the training set.
--include-train Include the src/trg training corpora in the spreadsheet Include the parallel source/target training sentence pairs in another tab in the spreadsheet.
--preserve-case Score predictions with case preserved Preserve case when calculating the sentence-level BLEU score for the source/target sentence pairs. By default, the tool will lower case the source and target. Note that this behavior is secondary to the source / target case settings specified in the config.yml file; if those settings specified lower casing, then this argument has no effect.
--tokenize TOKENIZE Sacrebleu tokenizer (none,13a,intl,zh,ja-mecab,char) Specifies the Sacrebleu tokenizer that will be used to calculate the sentence-level BLEU score for each source/target sentence pair. (Default: 13a)

Miscellaneous commands

average_checkpoints

export_embeddings