-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Transformer base single gpu on WMT en-de result #1542
Comments
Also, I wonder it makes sense that I achieve the bleu score reported in the paper by only changing the batch size. The results with different batch size I got are shown as below: batch size = 2048, uncased bleu = 24.54 and cased bleu = 24.01 Training command I use: Thanks for any help! |
DescriptionI am attempting to reproduce the base Transformer results given in Vaswani et al. (2017) (training on WMT 2014 English-to-German dataset, validation on newstest2014 dataset). According to the README:
I took this to mean that to achieve a BLEU score of about 28 using a single GPU, I could do
However, I was only able to achieve BLEU scores of Questions
Environment information
|
Also, @stylelohan, when I train the base model without changing the batch size (as in your first post) and test on newstest2013 instead of newstest2014 (EN-DE), I get BLEU scores of |
I am also trying to reproduce the results from the paper “Attention Is All You Need” (https://arxiv.org/pdf/1706.03762.pdf) as mentioned in https://github.com/tensorflow/tensor2tensor#walkthrough I trained the model with the following settings as mentioned on Github -
Other settings are -
And evaluate on the newstest2014 dataset as follows -
To get the BLEU score, I followed Sec 4.1 and 4.2 here - https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/Transformer_translate.ipynb -
And then,
Sec 6.1 in the paper (https://arxiv.org/pdf/1706.03762.pdf) says - “For the base models, we used a single model obtained by averaging the last 5 checkpoints, which were written at 10-minute intervals. For the big models, we averaged the last 20 checkpoints “ Since I don’t average the checkpoints and compete the BLEU score on all checkpoints, I get unexpected results -
The BLEU_uncased score is as low as Is this correct behavior? Why does the BLEU score vary so much across different checkpoints? Were you able to reproduce the results reported in the paper for the base model? |
@jatinganhotra you are using the *.sgm files as the source and reference, but these files contain extra markup ( |
@stylelohan @vman049 It is well known that the final BLEU depends on the batch size and number of GPUs. You also have to increase the number of training steps (depending on your batch size and number of GPUs) if you want to get closer to the originally reported BLEU. For details see e.g. http://ufal.mff.cuni.cz/pbml/110/art-popel-bojar.pdf |
Hi, I am a little bit confused why should we set
|
Yes, that was perhaps a typo in one of the previous posts. If you want to train German-to-English, you would use I guess the typo/confusion stems from the fact that in WMT 2014 and older the same set of sentences was used as the test set for both directions en->de and de->en and "deen" is in the testset filenames even for the en->de direction. In WMT14 half of the sentences are from originally German newspapers/websites, the other half is originally English. You can use |
Thank you very much for your reply! Your reply perfectly resolves my confusion! I am not familiar with
And these two commands will return the same scores, right? Thanks! |
SacreBLEU is now the standard way of evaluating BLEU (and other metrics such as chrF). It provides a "signature" and allows reproducible BLEU scores - so it is the recommended way of reporting BLEU for papers (including the signature). SacreBLEU has several options, e.g. which tokenization to use (the default is called
Yes, you should not use the sgm (SGML) files with |
Following your advice, I could obtain a reasonable BLEU score. |
Description
Dear T2T team,
I was trying to reproduce the result of Transformer base model in the original paper "Attention Is All You Need"; however, I found the hyper-parameters you use in the current transformer_base_single_gpu hparams settings are a bit different from the original paper. So I would like to confirm that my result was correspond with yours. Many thanks for any help you provide! It will also be much helpful if anyone is willing to share related experiment results or comments. Thanks a lot!
Below are the result and commands I used:
For newstest2013 dev, uncased bleu = 23.09 and cased bleu = 22.63.
Training command:
CUDA_VISIBLE_DEVICES=0 t2t-trainer
--data_dir=$DATA_DIR
--model=transformer
--hparams_set=transformer_base_single_gpu
--problem=translate_ende_wmt32k
--output_dir=$TRAIN_DIR
Decoding command:
TMP_DIR = t2t_datagen/dev
DECODE_FILE=$TMP_DIR/newstest2013.en
REF_FILE=$TMP_DIR/newstest2013.de
t2t-decoder
--data_dir=$DATA_DIR
--problem=translate_ende_wmt32k
--model=transformer
--hparams_set=transformer_base_single_gpu
--output_dir=$TRAIN_DIR
--decode_hparams="beam_size=4,alpha=0.6"
--decode_from_file=$DECODE_FILE
--decode_to_file=translation.de
t2t-bleu --translation=translation.de --reference=$REF_FILE
Environment information
OS:
Distributor ID: Ubuntu
Description: Ubuntu 16.04.5 LTS
Release: 16.04
Codename: xenial
$ pip freeze | grep tensor
mesh-tensorflow==0.0.5
tensor2tensor==1.13.1
tensorboard==1.13.1
tensorflow-datasets==1.0.1
tensorflow-estimator==1.13.0
tensorflow-gpu==1.13.1
tensorflow-metadata==0.13.0
tensorflow-probability==0.6.0
$ python -V
Python 3.5.2
other info.:
GeForce GTX 1080 Ti
CUDA Version 10.0.130
The text was updated successfully, but these errors were encountered: