Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Transformer base single gpu on WMT en-de result #1542

Open
stylelohan opened this issue Apr 12, 2019 · 11 comments
Open

Transformer base single gpu on WMT en-de result #1542

stylelohan opened this issue Apr 12, 2019 · 11 comments

Comments

@stylelohan
Copy link

stylelohan commented Apr 12, 2019

Description

Dear T2T team,

I was trying to reproduce the result of Transformer base model in the original paper "Attention Is All You Need"; however, I found the hyper-parameters you use in the current transformer_base_single_gpu hparams settings are a bit different from the original paper. So I would like to confirm that my result was correspond with yours. Many thanks for any help you provide! It will also be much helpful if anyone is willing to share related experiment results or comments. Thanks a lot!

Below are the result and commands I used:
For newstest2013 dev, uncased bleu = 23.09 and cased bleu = 22.63.

Training command:
CUDA_VISIBLE_DEVICES=0 t2t-trainer
--data_dir=$DATA_DIR
--model=transformer
--hparams_set=transformer_base_single_gpu
--problem=translate_ende_wmt32k
--output_dir=$TRAIN_DIR

Decoding command:
TMP_DIR = t2t_datagen/dev
DECODE_FILE=$TMP_DIR/newstest2013.en
REF_FILE=$TMP_DIR/newstest2013.de
t2t-decoder
--data_dir=$DATA_DIR
--problem=translate_ende_wmt32k
--model=transformer
--hparams_set=transformer_base_single_gpu
--output_dir=$TRAIN_DIR
--decode_hparams="beam_size=4,alpha=0.6"
--decode_from_file=$DECODE_FILE
--decode_to_file=translation.de
t2t-bleu --translation=translation.de --reference=$REF_FILE

Environment information

OS:
Distributor ID: Ubuntu
Description: Ubuntu 16.04.5 LTS
Release: 16.04
Codename: xenial

$ pip freeze | grep tensor
mesh-tensorflow==0.0.5
tensor2tensor==1.13.1
tensorboard==1.13.1
tensorflow-datasets==1.0.1
tensorflow-estimator==1.13.0
tensorflow-gpu==1.13.1
tensorflow-metadata==0.13.0
tensorflow-probability==0.6.0

$ python -V
Python 3.5.2

other info.:
GeForce GTX 1080 Ti
CUDA Version 10.0.130

@stylelohan
Copy link
Author

stylelohan commented Apr 12, 2019

Also, I wonder it makes sense that I achieve the bleu score reported in the paper by only changing the batch size. The results with different batch size I got are shown as below:

batch size = 2048, uncased bleu = 24.54 and cased bleu = 24.01
batch size = 4096, uncased bleu = 25.64 and cased bleu = 25.14
batch size = 9182, uncased bleu = 26.10 and cased bleu = 25.56
(These results are on newstest2013 development set, while the bleu score reported in the paper is 25.8.)

Training command I use:
BS = 2048 # or 4096, 9182
CUDA_VISIBLE_DEVICES=0 t2t-trainer
--data_dir=$DATA_DIR
--model=transformer
--hparams_set=transformer_base_single_gpu
--problem=translate_ende_wmt32k
--output_dir=$TRAIN_DIR
--hparams="batch_size=$BS"

Thanks for any help!

@vman049
Copy link

vman049 commented May 29, 2019

Description

I am attempting to reproduce the base Transformer results given in Vaswani et al. (2017) (training on WMT 2014 English-to-German dataset, validation on newstest2014 dataset). According to the README:

For all translation problems, we suggest to try the Transformer model: --model=transformer. At first it is best to try the base setting, --hparams_set=transformer_base. When trained on 8 GPUs for 300K steps this should reach a BLEU score of about 28 on the English-German data-set, which is close to state-of-the art. If training on a single GPU, try the --hparams_set=transformer_base_single_gpu setting. For very good results or larger data-sets (e.g., for English-French), try the big model with --hparams_set=transformer_big.

I took this to mean that to achieve a BLEU score of about 28 using a single GPU, I could do --hparams_set=transformer_base_single_gpu (the original paper reports a BLEU score of 27.3 on the newstest2014 using the base Transformer model). Following the walkthrough (replacing the decoding / evaluation toy text with newstest2014, given here), I performed the following:

$ pip install tensorflow-gpu
$ pip install tensor2tensor[tensorflow_gpu]
$ PROB=translate_ende_wmt32k
$ MODEL=transformer
$ HPARAMS=transformer_base_single_gpu
$ DATA_DIR=~/NLP/Data/T2T/t2t_data
$ TMP_DIR=~/NLP/Data/T2T/t2t_datagen
$ TRAIN_DIR=~/NLP/Data/T2T/t2t_train/$PROB/$MODEL-$HPARAMS
$ mkdir -p $DATA_DIR $TMP_DIR $TRAIN_DIR
$ t2t-datagen --data_dir=$DATA_DIR --tmp_dir=$TMP_DR --problem=$PROB
$ t2t-trainer --data_dir=$DATA_DIR --problem=$PROB --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR
$ BEAM_SIZE=4
$ ALPHA=0.6
$ DECODE_FILE=~/NLP/Data/T2T/newstest2014.en
$ t2t-decoder --data_dir=$DATA_DIR --problem=$PROB --model=$MODEL --hparams_set=$HPARAMS --output_dir=$TRAIN_DIR --decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" --decode_from_file=$DECODE_FILE --decode_to_file=~/NLP/Results/translation.en
$ t2t-bleu --translation=~/NLP/Results/translation.en --reference=~/NLP/Data/T2T/newstest2014.de

However, I was only able to achieve BLEU scores of BLEU_uncased = 19.58 and BLEU_cased = 19.22.

Questions

  1. Is the single-gpu version of tensor2tensor with hyperparameters given by HPARAMS=transformer_base_single_gpu meant to reproduce the BLEU score of 27.3 on newstest2014 given in Vaswani et al. (2017)?

Environment information

OS:
$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 7.6 (Maipo)
Release:        7.6
Codename:       Maipo

$ pip freeze | grep tensor
mesh-tensorflow==0.0.5
tensor2tensor==1.13.4
tensorboard==1.13.1
tensorflow-datasets==1.0.2
tensorflow-estimator==1.13.0
tensorflow-gpu==1.13.1
tensorflow-metadata==0.13.0
tensorflow-probability==0.6.0

$ python -V
Python 3.6.8 :: Anaconda, Inc.

GPU: NVIDIA Corporation GP102 [TITAN Xp]
CUDA version: 10.0.130

@vman049
Copy link

vman049 commented May 29, 2019

Also, @stylelohan, when I train the base model without changing the batch size (as in your first post) and test on newstest2013 instead of newstest2014 (EN-DE), I get BLEU scores of BLEU_uncased = 20.49 and BLEU_cased = 20.06. Do you know what might be causing the discrepancy of about 3 BLEU between your result and mine?

@jatinganhotra
Copy link

@vman049 @stylelohan

I am also trying to reproduce the results from the paper “Attention Is All You Need” (https://arxiv.org/pdf/1706.03762.pdf) as mentioned in https://github.com/tensorflow/tensor2tensor#walkthrough

I trained the model with the following settings as mentioned on Github -

PROBLEM=translate_ende_wmt32k
MODEL=transformer
HPARAMS=transformer_base_single_gpu

DATA_DIR=t2t_local_exp_runs_dir_master/t2t_data
TMP_DIR=t2t_local_exp_runs_dir_master/t2t_datagen
TRAIN_DIR=t2t_local_exp_runs_dir_master/t2t_train/$PROBLEM/$MODEL-$HPARAMS

Other settings are -

--keep_checkpoint_max=20
--local_eval_frequency=1000
--train_steps=250000

And evaluate on the newstest2014 dataset as follows -

SOURCE_TEST_TRANSLATE_DIR=t2t_local_exp_runs_dir_master/t2t_datagen/dev/newstest2014-deen-src.en.sgm
REFERENCE_TEST_TRANSLATE_DIR=t2t_local_exp_runs_dir_master/t2t_datagen/dev/newstest2014-deen-ref.en.sgm
BEAM_SIZE=4
ALPHA=0.6
TRANSLATIONS_DIR=t2t_local_exp_runs_dir_master/t2t_translations_dir
USR_DIR=t2t_local_exp_runs_dir_master/t2t_usr_dir
EVENT_DIR=t2t_local_exp_runs_dir_master/t2t_event_dir

To get the BLEU score, I followed Sec 4.1 and 4.2 here - https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/Transformer_translate.ipynb -

!t2t-translate-all \
  --source=$SOURCE_TEST_TRANSLATE_DIR \
  --model_dir=$TRAIN_DIR \
  --translations_dir=$TRANSLATIONS_DIR \
  --data_dir=$DATA_DIR \
  --problem=$PROBLEM \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR \
  --t2t_usr_dir=$USR_DIR \
  --beam_size=$BEAM_SIZE \
  --model=$MODEL

And then,

!t2t-bleu \
   --translations_dir=$TRANSLATIONS_DIR \
   --model_dir=$TRAIN_DIR \
   --data_dir=$DATA_DIR \
   --problem=$PROBLEM \
   --hparams_set=$HPARAMS \
   --source=$SOURCE_TEST_TRANSLATE_DIR \
   --reference=$REFERENCE_TEST_TRANSLATE_DIR \
   --event_dir=$EVENT_DIR

Sec 6.1 in the paper (https://arxiv.org/pdf/1706.03762.pdf) says - “For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints “

Since I don’t average the checkpoints and compete the BLEU score on all checkpoints, I get unexpected results -

INFO:tensorflow:Found 20 files with steps: 231000, 232000, 233000, 234000, 235000, 236000, 237000, 238000, 239000, 240000, 241000, 242000, 243000, 244000, 245000, 246000, 247000, 248000, 249000, 250000
INFO:tensorflow:Evaluating translate_ende_wmt32k-231000
INFO:tensorflow:translate_ende_wmt32k-231000: BLEU_uncased =   7.86
INFO:tensorflow:translate_ende_wmt32k-231000: BLEU_cased =   7.60
INFO:tensorflow:Evaluating translate_ende_wmt32k-232000
INFO:tensorflow:translate_ende_wmt32k-232000: BLEU_uncased =   6.83
INFO:tensorflow:translate_ende_wmt32k-232000: BLEU_cased =   6.52
INFO:tensorflow:Evaluating translate_ende_wmt32k-233000
INFO:tensorflow:translate_ende_wmt32k-233000: BLEU_uncased =  20.11
INFO:tensorflow:translate_ende_wmt32k-233000: BLEU_cased =  19.57
INFO:tensorflow:Evaluating translate_ende_wmt32k-234000
INFO:tensorflow:translate_ende_wmt32k-234000: BLEU_uncased =  21.78
INFO:tensorflow:translate_ende_wmt32k-234000: BLEU_cased =  21.00
INFO:tensorflow:Evaluating translate_ende_wmt32k-235000
INFO:tensorflow:translate_ende_wmt32k-235000: BLEU_uncased =  21.91
INFO:tensorflow:translate_ende_wmt32k-235000: BLEU_cased =  20.58
INFO:tensorflow:Evaluating translate_ende_wmt32k-236000
INFO:tensorflow:translate_ende_wmt32k-236000: BLEU_uncased =   6.13
INFO:tensorflow:translate_ende_wmt32k-236000: BLEU_cased =   5.87
INFO:tensorflow:Evaluating translate_ende_wmt32k-237000
INFO:tensorflow:translate_ende_wmt32k-237000: BLEU_uncased =  10.12
INFO:tensorflow:translate_ende_wmt32k-237000: BLEU_cased =   9.72
INFO:tensorflow:Evaluating translate_ende_wmt32k-238000
INFO:tensorflow:translate_ende_wmt32k-238000: BLEU_uncased =  17.86
INFO:tensorflow:translate_ende_wmt32k-238000: BLEU_cased =  17.47
INFO:tensorflow:Evaluating translate_ende_wmt32k-239000
INFO:tensorflow:translate_ende_wmt32k-239000: BLEU_uncased =  30.51
INFO:tensorflow:translate_ende_wmt32k-239000: BLEU_cased =  30.10
INFO:tensorflow:Evaluating translate_ende_wmt32k-240000
INFO:tensorflow:translate_ende_wmt32k-240000: BLEU_uncased =  21.67
INFO:tensorflow:translate_ende_wmt32k-240000: BLEU_cased =  21.29
INFO:tensorflow:Evaluating translate_ende_wmt32k-241000
INFO:tensorflow:translate_ende_wmt32k-241000: BLEU_uncased =  28.44
INFO:tensorflow:translate_ende_wmt32k-241000: BLEU_cased =  28.10
INFO:tensorflow:Evaluating translate_ende_wmt32k-242000
INFO:tensorflow:translate_ende_wmt32k-242000: BLEU_uncased =  25.22
INFO:tensorflow:translate_ende_wmt32k-242000: BLEU_cased =  24.89
INFO:tensorflow:Evaluating translate_ende_wmt32k-243000
INFO:tensorflow:translate_ende_wmt32k-243000: BLEU_uncased =   6.37
INFO:tensorflow:translate_ende_wmt32k-243000: BLEU_cased =   6.11
INFO:tensorflow:Evaluating translate_ende_wmt32k-244000
INFO:tensorflow:translate_ende_wmt32k-244000: BLEU_uncased =  13.57
INFO:tensorflow:translate_ende_wmt32k-244000: BLEU_cased =  13.04
INFO:tensorflow:Evaluating translate_ende_wmt32k-245000
INFO:tensorflow:translate_ende_wmt32k-245000: BLEU_uncased =  24.23
INFO:tensorflow:translate_ende_wmt32k-245000: BLEU_cased =  23.87
INFO:tensorflow:Evaluating translate_ende_wmt32k-246000
INFO:tensorflow:translate_ende_wmt32k-246000: BLEU_uncased =  29.42
INFO:tensorflow:translate_ende_wmt32k-246000: BLEU_cased =  29.07
INFO:tensorflow:Evaluating translate_ende_wmt32k-247000
INFO:tensorflow:translate_ende_wmt32k-247000: BLEU_uncased =  11.84
INFO:tensorflow:translate_ende_wmt32k-247000: BLEU_cased =  11.58
INFO:tensorflow:Evaluating translate_ende_wmt32k-248000
INFO:tensorflow:translate_ende_wmt32k-248000: BLEU_uncased =   9.34
INFO:tensorflow:translate_ende_wmt32k-248000: BLEU_cased =   9.05
INFO:tensorflow:Evaluating translate_ende_wmt32k-249000
INFO:tensorflow:translate_ende_wmt32k-249000: BLEU_uncased =   6.66
INFO:tensorflow:translate_ende_wmt32k-249000: BLEU_cased =   6.44
INFO:tensorflow:Evaluating translate_ende_wmt32k-250000
INFO:tensorflow:translate_ende_wmt32k-250000: BLEU_uncased =  13.37
INFO:tensorflow:translate_ende_wmt32k-250000: BLEU_cased =  13.11

The BLEU_uncased score is as low as 6.66 for checkpoint 249000, while as high as 30.51 for checkpoint 239000.
Table 2 in the paper (https://arxiv.org/pdf/1706.03762.pdf) for Transformer (base model) for EN-DE reports BLEU 27.3 for newstest2014

Is this correct behavior? Why does the BLEU score vary so much across different checkpoints?

Were you able to reproduce the results reported in the paper for the base model?

@martinpopel
Copy link
Contributor

@jatinganhotra you are using the *.sgm files as the source and reference, but these files contain extra markup (<doc> <seg>) and t2t-translate-all and t2t-bleu expect plaintext input (one sentence per line). Extract the plaintext from *sgm and try again.

@martinpopel
Copy link
Contributor

@stylelohan @vman049 It is well known that the final BLEU depends on the batch size and number of GPUs. You also have to increase the number of training steps (depending on your batch size and number of GPUs) if you want to get closer to the originally reported BLEU. For details see e.g. http://ufal.mff.cuni.cz/pbml/110/art-popel-bojar.pdf

@shizhediao
Copy link

shizhediao commented Oct 20, 2022

Hi, I am a little bit confused why should we set REFERENCE_TEST_TRANSLATE_DIR=t2t_local_exp_runs_dir_master/t2t_datagen/dev/newstest2014-deen-ref.en.sgm . because in my mind, the reference should be de.sgm. Do you have any idea? Thanks!

REFERENCE_TEST_TRANSLATE_DIR=t2t_local_exp_runs_dir_master/t2t_datagen/dev/newstest2014-deen-ref.en.sgm

@martinpopel
Copy link
Contributor

Yes, that was perhaps a typo in one of the previous posts.
This whole thread is about replicating English-to-German (en-de, PROBLEM=translate_ende_wmt32k) translation, so when evaluating it on WMT14 you need to use t2t-bleu with --reference=wmt14-ref.de.txt, i.e. the reference has to be in German, obviously. You can use sacrebleu -t wmt14/full -l en-de --echo ref > wmt14-ref.de.txt to download the reference file and similarly with --echo src to download the English source file. You can also use sacrebleu --tok intl instead of t2t-bleu (and it should return exactly the same scores if you don't forget the --tok intl option.)

If you want to train German-to-English, you would use PROBLEM=translate_ende_wmt32k_rev for training and then for evaluation with t2t-bleu on WMT14 --reference=wmt14-ref.en.txt.

I guess the typo/confusion stems from the fact that in WMT 2014 and older the same set of sentences was used as the test set for both directions en->de and de->en and "deen" is in the testset filenames even for the en->de direction. In WMT14 half of the sentences are from originally German newspapers/websites, the other half is originally English. You can use sacrebleu --origlang to separate those two halves (or sacrebleu --echo origlang src ref).
In WMT15 and later, the files have better names (distinguishing ende and deen depending on the direction) and in WMT19 and later, the set of sentences for each direction is different (en->de testset contains only original English sentences on the source side).

@shizhediao
Copy link

Yes, that was perhaps a typo in one of the previous posts. This whole thread is about replicating English-to-German (en-de, PROBLEM=translate_ende_wmt32k) translation, so when evaluating it on WMT14 you need to use t2t-bleu with --reference=wmt14-ref.de.txt, i.e. the reference has to be in German, obviously. You can use sacrebleu -t wmt14/full -l en-de --echo ref > wmt14-ref.de.txt to download the reference file and similarly with --echo src to download the English source file. You can also use sacrebleu --tok intl instead of t2t-bleu (and it should return exactly the same scores if you don't forget the --tok intl option.)

If you want to train German-to-English, you would use PROBLEM=translate_ende_wmt32k_rev for training and then for evaluation with t2t-bleu on WMT14 --reference=wmt14-ref.en.txt.

I guess the typo/confusion stems from the fact that in WMT 2014 and older the same set of sentences was used as the test set for both directions en->de and de->en and "deen" is in the testset filenames even for the en->de direction. In WMT14 half of the sentences are from originally German newspapers/websites, the other half is originally English. You can use sacrebleu --origlang to separate those two halves (or sacrebleu --echo origlang src ref). In WMT15 and later, the files have better names (distinguishing ende and deen depending on the direction) and in WMT19 and later, the set of sentences for each direction is different (en->de testset contains only original English sentences on the source side).

Thank you very much for your reply! Your reply perfectly resolves my confusion!

I am not familiar with sacrebleu. So does that mean, there are two ways to calculate bleu:

  1. sacrebleu --tok intl
  2. --translations_dir=~/t2t_translation \
    --model_dir=~/t2t_train/translate_ende_wmt32k \
    --data_dir=~/t2t_data \
    --problem=translate_ende_wmt32k \
    --hparams_set=transformer_base \
    --source= ./wmt14-src-en.txt \
    --reference=./wmt14-ref.de.txt \
    --event_dir=~/t2t_event ```
    
    

And these two commands will return the same scores, right?
In addition, I noticed that the reference file wmt14-ref.de.txt from sacrebleu is different from newstest2014-deen-ref.de.sgm. newstest2014-deen-ref.de.sgm includes many tags like <seg id="4">. I think we should not use it directly right? wmt14-ref.de.txt is the right one. please correct me if I am wrong.

Thanks!

@martinpopel
Copy link
Contributor

SacreBLEU is now the standard way of evaluating BLEU (and other metrics such as chrF). It provides a "signature" and allows reproducible BLEU scores - so it is the recommended way of reporting BLEU for papers (including the signature). SacreBLEU has several options, e.g. which tokenization to use (the default is called 13a, but there is also intl - international and e.g. flores200 SentencePiece-based). Scores computed with different tokenization are not comparable.

t2t-bleu tries to implement the same BLEU algorithm as in SacreBLEU, but the tokenization is fixed to intl and cannot be changed. It may be useful during development because it also has the option of continuously evaluating each checkpoint and storing the results in TensorBoard event dir.

newstest2014-deen-ref.de.sgm includes many tags like <seg id="4">

Yes, you should not use the sgm (SGML) files with t2t-bleu or any other tool which expects plain-text, as I wrote in on of my comments above. You need to extract the plain text (one sentence/segment per line) first, make sure to convert e.g. &lt; to < etc. - the easiest way is to use sacrebleu --echo for this.

@shizhediao
Copy link

shizhediao commented Oct 20, 2022

SacreBLEU is now the standard way of evaluating BLEU (and other metrics such as chrF). It provides a "signature" and allows reproducible BLEU scores - so it is the recommended way of reporting BLEU for papers (including the signature). SacreBLEU has several options, e.g. which tokenization to use (the default is called 13a, but there is also intl - international and e.g. flores200 SentencePiece-based). Scores computed with different tokenization are not comparable.

t2t-bleu tries to implement the same BLEU algorithm as in SacreBLEU, but the tokenization is fixed to intl and cannot be changed. It may be useful during development because it also has the option of continuously evaluating each checkpoint and storing the results in TensorBoard event dir.

newstest2014-deen-ref.de.sgm includes many tags like <seg id="4">

Yes, you should not use the sgm (SGML) files with t2t-bleu or any other tool which expects plain-text, as I wrote in on of my comments above. You need to extract the plain text (one sentence/segment per line) first, make sure to convert e.g. &lt; to < etc. - the easiest way is to use sacrebleu --echo for this.

Following your advice, I could obtain a reasonable BLEU score.
Thanks very much!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants