If you want to,
- build a new bn-en training dataset from a noisy parallel corpora (by filtering / cleaning some pairs based on our heuristics) with corresponding vocabulary models or
- normalize a new dataset before evaluating on the model or
- remove all evaluation pairs from training pairs for a new set of training / test datasets
refer to here.
Note: This code has been refactored to support OpenNMT-py 2.0
$ cd seq2seq/
$ conda create python==3.7.9 pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.2 -c pytorch -p ./env
$ conda activate ./env # or source activate ./env (for older versions of anaconda)
$ pip install --upgrade -r requirements.txt
- Note: For newer NVIDIA GPUS such as A100 or 3090 use
cudatoolkit=11.0
.
$ cd seq2seq/
$ python pipeline.py -h
usage: pipeline.py [-h] --input_dir PATH --output_dir PATH --src_lang SRC_LANG
--tgt_lang TGT_LANG
[--validation_samples VALIDATION_SAMPLES]
[--src_seq_length SRC_SEQ_LENGTH]
[--tgt_seq_length TGT_SEQ_LENGTH]
[--model_prefix MODEL_PREFIX] [--eval_model PATH]
[--train_steps TRAIN_STEPS]
[--train_batch_size TRAIN_BATCH_SIZE]
[--eval_batch_size EVAL_BATCH_SIZE]
[--gradient_accum GRADIENT_ACCUM]
[--warmup_steps WARMUP_STEPS]
[--learning_rate LEARNING_RATE] [--layers LAYERS]
[--rnn_size RNN_SIZE] [--word_vec_size WORD_VEC_SIZE]
[--transformer_ff TRANSFORMER_FF] [--heads HEADS]
[--valid_steps VALID_STEPS]
[--save_checkpoint_steps SAVE_CHECKPOINT_STEPS]
[--average_last AVERAGE_LAST] [--world_size WORLD_SIZE]
[--gpu_ranks [GPU_RANKS [GPU_RANKS ...]]]
[--train_from TRAIN_FROM] [--do_train] [--do_eval]
[--nbest NBEST] [--alpha ALPHA]
optional arguments:
-h, --help show this help message and exit
--input_dir PATH, -i PATH
Input directory
--output_dir PATH, -o PATH
Output directory
--src_lang SRC_LANG Source language
--tgt_lang TGT_LANG Target language
--validation_samples VALIDATION_SAMPLES
no. of validation samples to take out from train
dataset when no validation data is present
--src_seq_length SRC_SEQ_LENGTH
maximum source sequence length
--tgt_seq_length TGT_SEQ_LENGTH
maximum target sequence length
--model_prefix MODEL_PREFIX
Prefix of the model to save
--eval_model PATH Path to the specific model to evaluate
--train_steps TRAIN_STEPS
no of training steps
--train_batch_size TRAIN_BATCH_SIZE
training batch size (in tokens)
--eval_batch_size EVAL_BATCH_SIZE
evaluation batch size (in sentences)
--gradient_accum GRADIENT_ACCUM
gradient accum
--warmup_steps WARMUP_STEPS
warmup steps
--learning_rate LEARNING_RATE
learning rate
--layers LAYERS layers
--rnn_size RNN_SIZE rnn size
--word_vec_size WORD_VEC_SIZE
word vector size
--transformer_ff TRANSFORMER_FF
transformer feed forward size
--heads HEADS no of heads
--valid_steps VALID_STEPS
validation interval
--save_checkpoint_steps SAVE_CHECKPOINT_STEPS
model saving interval
--average_last AVERAGE_LAST
average last X models
--world_size WORLD_SIZE
world size
--gpu_ranks [GPU_RANKS [GPU_RANKS ...]]
gpu ranks
--train_from TRAIN_FROM
start training from this checkpoint
--do_train Run training
--do_eval Run evaluation
--nbest NBEST sentencepiece nbest size
--alpha ALPHA sentencepiece alpha
-
Sample
input_dir
structure for bn2en training and evaluation:input_dir/ |---> data/ | |---> corpus.train.bn | |---> corpus.train.en | |---> RisingNews.valid.bn | |---> RisingNews.valid.en | |---> RisingNews.test.bn | |---> RisingNews.test.en | |---> sipc.test.bn | |---> sipc.test.en.0 | |---> sipc.test.en.1 | ... |---> vocab/ | |---> bn.model | |---> en.model
- Input data files inside the
data/
subdirectory must have the following format:X.type.lang(.count)
, whereX
is any common file prefix,type
is one of{train, valid, test}
andcount
is an optional integer (only applicable for thetarget_lang
, when there are multiple reference files). There can be multiple.train.
/.valid.
filepairs. In absence of.valid.
files,validation_samples
no of example pairs will be randomly sampled from the training files duringtraining
. - The
vocab
subdirectory must hold two sentencepiece vocabulary models formatted assrc_lang.model
andtgt_lang.model
- Input data files inside the
-
After training / evaluation, the
output_dir
will have the following subdirectories with these contents.Models
: All the saved modelsReports
: BLEU and SACREBLEU scores on the validation files for all saved models with the givenmodel_prefix
, and the scores on the test files for the giveneval_model
(if the corresponding reference files are present)Outputs
: Detokenized model predictions.data
: Merged training files after applying subword regularization.Preprocessed
: Training and validation data shards
To reproduce our results on an AWS p3.8xlarge ec2 instance, equipped with 4 Tesla V100 GPUs, run the script with the default hyperparameters. For example, for bn2en training,
$ export CUDA_VISIBLE_DEVICES=0,1,2,3
# for training
$ python pipeline.py \
--src_lang bn --tgt_lang en \
-i inputFolder/ -o outputFolder/ \
--model_prefix bn2en --do_train --do_eval
For single GPU training, additionally provide the following flags: --world_size 1
, --gpu_ranks 0
and update the effective batch size according to available GPU VRAM using the flags --train_batch_size X
and --gradient_accum X
.
For evaluating trained models on a single GPU on new test files, use the following snippet with appropriate arguments:
$ python pipeline.py
--src_lang bn --tgt_lang en \
-i inputFolder/ -o outputFolder/ \
--eval_model <path/to/model> \
--do_eval