DodecaDialogue model - EmpatheticDialogues task reproducing paper's results #3473

manzar96 · 2021-02-26T00:14:50Z

Hi,

I used your code in order to reproduce the results of the dodecadialogue paper on the EmpatheticDialogues task. However, I could not reproduce the results concerning the AVG BLEU metric.

Can you please report the exact decoding parameters (beam-size, beam-min-length, beam-block-ngram, beam-context-block-ngram, etc) that you used in the referenced paper for the MT+FT model on the EmpatheticDialogues task?

I am a little bit confused with the avg-BLEU metric, as I cannot reproduce the reported results on the paper (8.1 for the MT+FT and 8.4 for the MT)

klshuster · 2021-02-26T00:23:25Z

Hi there!

You can find the generation settings in Table 11 in the paper, but I'll just list them here:

--inference beam --beam-size 10 --beam-min-length 5 --beam-block-ngram 3 --beam-context-block-ngram -1

For the BLEU scores in the dodecadialogue paper, we computed tokenized-BLEU scores; the default BLEU output by ParlAI is based on the generated string and can result in slightly worse values. You can see the tokenized bleu scores by setting --compute-tokenized-bleu true at inference time.

One final note is that the avg-BLEU we report is the average of BLEU-1,2,3,4. Hope that helps!

manzar96 · 2021-02-26T20:28:25Z

Thanks a lot!

What do you mean when saying that you computed tokenized-BLEU? I understand the "default" BLEU (using the generated string), but I cannot understand how tokenized-BLEU works.

I tried to reproduce the results of the Paper for the MT+FT model on the EMpatheticDialogues task using the following command:
parlai eval_model -mf zoo:dodecadialogue/empathetic_dialogues_ft/model -t empathetic_dialogues --save-world-logs true --report-filename /home/manzar/projects/ParlAI/myoutputs2/test/ed.json --skip-generation false -dt test --metrics ppl,bleu --inference beam --beam-size 10 --beam-min-length 5 --beam-block-ngram 3 --beam-context-block-ngram 3 --compute-tokenized-bleu true

The avg-BLEU score according to the above results is approximately 4.8.

However, the reported one on the paper is 8.1.
Am I missing something?

stephenroller · 2021-02-27T20:18:23Z

Edit: didn't see all the other comments before I wrote this. Leaving for posterity.

I will mention that we have multiple implementations of BLEU. The default one uses a very naive tokenizer (not much more that split on spaces). The fairseq one uses the tokenizer of the model, and tends to give much better numbers (awards partial credit more fairly).

The latter is standard in many papers, and was used in DoDeca IIRC.

stephenroller · 2021-02-27T20:21:27Z

Can you "pip install fairseq" and then rerun?

manzar96 · 2021-02-27T22:01:06Z

Running the following command:
parlai eval_model -mf zoo:dodecadialogue/empathetic_dialogues_ft/model -t empathetic_dialogues --save-world-logs true --report-filename /home/manzar/projects/ParlAI/myoutputs2/test/ed.json --skip-generation false -dt test --metrics ppl,bleu --inference beam --beam-size 10 --beam-min-length 5 --beam-block-ngram 3 --beam-context-block-ngram 3 --compute-tokenized-bleu true
The given results are:

Averaging the "fairseq_bleu" scores the avg-BLEU is: 6.01

I would also like to ask your opinion about comparing the Dodeca model with another transformer-based model. Concerning the avg-BLEU score, I think that the best way to do this comparison is to use the same BLEU implementation (in order to make a reasonable comparison). However, the tokenizers that are used may differ (supposing that I use the fairseq implementation). So does reporting a BLEU score for both models using different tokenizers seem legit for you? Or do you think that I should follow another approach?

klshuster · 2021-03-01T14:25:06Z

Could you try that same command, but with --beam-context-block-ngram -1? As that is the option used to generate the scores in the paper.

Regarding comparing to another transformer model --> indeed, different tokenizers might yield different BLEU scores. You could use the default ParlAI BLEU computation, which is string based, if you want to compare across models, as you're comparing the raw generated text

manzar96 · 2021-03-02T01:19:44Z

Could you try that same command, but with --beam-context-block-ngram -1? As that is the option used to generate the scores in the paper.

Regarding comparing to another transformer model --> indeed, different tokenizers might yield different BLEU scores. You could use the default ParlAI BLEU computation, which is string based, if you want to compare across models, as you're comparing the raw generated text

The avg-BLEU is 6.2

stephenroller · 2021-03-02T02:45:19Z

Indeed, bleu-scores should only be compared if they use the same tokenization. the standard bleu scores (the 4% ones) should be the same across different tokenizers. Unfortunately, it's definitely common in the literature to use the token based ones.

@klshuster didn't we release a pretrained model for this? Their ppl and f1 is a little worse than ours. I'd suggest we try replicating with our released model next.

manzar96 · 2021-03-03T02:15:02Z

@stephenroller I already used the pre-trained model. To which pre-trained model are you referring?
Is there something else I can do?

stephenroller · 2021-03-04T18:46:51Z

Kurt has been digging into this a little bit. Finding discrepancies between our internal and external implementations of the metrics, but haven't tracked down the difference yet. We confirmed we can replicate using the internal implementation, and will be looking to fix externally.

stephenroller · 2021-03-05T15:24:50Z

Okay it looks like it's an issue with macro vs micro averages and a global correction statistic. We'll fix sometime next week.

manzar96 · 2021-03-10T19:29:43Z

Are there any updates on the issue?

klshuster · 2021-03-12T16:34:32Z

once #3518 lands it should be all good!

klshuster · 2021-03-17T14:38:33Z

going to go ahead and close this, please feel free to reopen if you run into more issues

klshuster self-assigned this Feb 26, 2021

klshuster mentioned this issue Mar 12, 2021

[Metrics] Fairseq BLEU Re-Implementation #3518

Merged

klshuster closed this as completed Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DodecaDialogue model - EmpatheticDialogues task reproducing paper's results #3473

DodecaDialogue model - EmpatheticDialogues task reproducing paper's results #3473

manzar96 commented Feb 26, 2021

klshuster commented Feb 26, 2021

manzar96 commented Feb 26, 2021 •

edited

Loading

stephenroller commented Feb 27, 2021 •

edited

Loading

stephenroller commented Feb 27, 2021

manzar96 commented Feb 27, 2021 •

edited

Loading

klshuster commented Mar 1, 2021

manzar96 commented Mar 2, 2021

stephenroller commented Mar 2, 2021

manzar96 commented Mar 3, 2021

stephenroller commented Mar 4, 2021

stephenroller commented Mar 5, 2021

manzar96 commented Mar 10, 2021

klshuster commented Mar 12, 2021

klshuster commented Mar 17, 2021

DodecaDialogue model - EmpatheticDialogues task reproducing paper's results #3473

DodecaDialogue model - EmpatheticDialogues task reproducing paper's results #3473

Comments

manzar96 commented Feb 26, 2021

klshuster commented Feb 26, 2021

manzar96 commented Feb 26, 2021 • edited Loading

stephenroller commented Feb 27, 2021 • edited Loading

stephenroller commented Feb 27, 2021

manzar96 commented Feb 27, 2021 • edited Loading

klshuster commented Mar 1, 2021

manzar96 commented Mar 2, 2021

stephenroller commented Mar 2, 2021

manzar96 commented Mar 3, 2021

stephenroller commented Mar 4, 2021

stephenroller commented Mar 5, 2021

manzar96 commented Mar 10, 2021

klshuster commented Mar 12, 2021

klshuster commented Mar 17, 2021

manzar96 commented Feb 26, 2021 •

edited

Loading

stephenroller commented Feb 27, 2021 •

edited

Loading

manzar96 commented Feb 27, 2021 •

edited

Loading