Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

DodecaDialogue model - EmpatheticDialogues task reproducing paper's results #3473

Closed
manzar96 opened this issue Feb 26, 2021 · 14 comments
Closed
Assignees

Comments

@manzar96
Copy link

Hi,

I used your code in order to reproduce the results of the dodecadialogue paper on the EmpatheticDialogues task. However, I could not reproduce the results concerning the AVG BLEU metric.

Can you please report the exact decoding parameters (beam-size, beam-min-length, beam-block-ngram, beam-context-block-ngram, etc) that you used in the referenced paper for the MT+FT model on the EmpatheticDialogues task?

I am a little bit confused with the avg-BLEU metric, as I cannot reproduce the reported results on the paper (8.1 for the MT+FT and 8.4 for the MT)

@klshuster klshuster self-assigned this Feb 26, 2021
@klshuster
Copy link
Contributor

Hi there!

You can find the generation settings in Table 11 in the paper, but I'll just list them here:

--inference beam --beam-size 10 --beam-min-length 5 --beam-block-ngram 3 --beam-context-block-ngram -1

For the BLEU scores in the dodecadialogue paper, we computed tokenized-BLEU scores; the default BLEU output by ParlAI is based on the generated string and can result in slightly worse values. You can see the tokenized bleu scores by setting --compute-tokenized-bleu true at inference time.

One final note is that the avg-BLEU we report is the average of BLEU-1,2,3,4. Hope that helps!

@manzar96
Copy link
Author

manzar96 commented Feb 26, 2021

Thanks a lot!

What do you mean when saying that you computed tokenized-BLEU? I understand the "default" BLEU (using the generated string), but I cannot understand how tokenized-BLEU works.

I tried to reproduce the results of the Paper for the MT+FT model on the EMpatheticDialogues task using the following command:
parlai eval_model -mf zoo:dodecadialogue/empathetic_dialogues_ft/model -t empathetic_dialogues --save-world-logs true --report-filename /home/manzar/projects/ParlAI/myoutputs2/test/ed.json --skip-generation false -dt test --metrics ppl,bleu --inference beam --beam-size 10 --beam-min-length 5 --beam-block-ngram 3 --beam-context-block-ngram 3 --compute-tokenized-bleu true
Screenshot from 2021-02-26 23-00-57

The avg-BLEU score according to the above results is approximately 4.8.

However, the reported one on the paper is 8.1.
Am I missing something?

@stephenroller
Copy link
Contributor

stephenroller commented Feb 27, 2021

Edit: didn't see all the other comments before I wrote this. Leaving for posterity.

I will mention that we have multiple implementations of BLEU. The default one uses a very naive tokenizer (not much more that split on spaces). The fairseq one uses the tokenizer of the model, and tends to give much better numbers (awards partial credit more fairly).

The latter is standard in many papers, and was used in DoDeca IIRC.

@stephenroller
Copy link
Contributor

Can you "pip install fairseq" and then rerun?

@manzar96
Copy link
Author

manzar96 commented Feb 27, 2021

Running the following command:
parlai eval_model -mf zoo:dodecadialogue/empathetic_dialogues_ft/model -t empathetic_dialogues --save-world-logs true --report-filename /home/manzar/projects/ParlAI/myoutputs2/test/ed.json --skip-generation false -dt test --metrics ppl,bleu --inference beam --beam-size 10 --beam-min-length 5 --beam-block-ngram 3 --beam-context-block-ngram 3 --compute-tokenized-bleu true
The given results are:

Screenshot from 2021-02-28 00-35-15

Averaging the "fairseq_bleu" scores the avg-BLEU is: 6.01

I would also like to ask your opinion about comparing the Dodeca model with another transformer-based model. Concerning the avg-BLEU score, I think that the best way to do this comparison is to use the same BLEU implementation (in order to make a reasonable comparison). However, the tokenizers that are used may differ (supposing that I use the fairseq implementation). So does reporting a BLEU score for both models using different tokenizers seem legit for you? Or do you think that I should follow another approach?

@klshuster
Copy link
Contributor

Could you try that same command, but with --beam-context-block-ngram -1? As that is the option used to generate the scores in the paper.

Regarding comparing to another transformer model --> indeed, different tokenizers might yield different BLEU scores. You could use the default ParlAI BLEU computation, which is string based, if you want to compare across models, as you're comparing the raw generated text

@manzar96
Copy link
Author

manzar96 commented Mar 2, 2021

Could you try that same command, but with --beam-context-block-ngram -1? As that is the option used to generate the scores in the paper.

Regarding comparing to another transformer model --> indeed, different tokenizers might yield different BLEU scores. You could use the default ParlAI BLEU computation, which is string based, if you want to compare across models, as you're comparing the raw generated text

Screenshot from 2021-03-02 03-18-51

The avg-BLEU is 6.2

@stephenroller
Copy link
Contributor

Indeed, bleu-scores should only be compared if they use the same tokenization. the standard bleu scores (the 4% ones) should be the same across different tokenizers. Unfortunately, it's definitely common in the literature to use the token based ones.

@klshuster didn't we release a pretrained model for this? Their ppl and f1 is a little worse than ours. I'd suggest we try replicating with our released model next.

@manzar96
Copy link
Author

manzar96 commented Mar 3, 2021

@stephenroller I already used the pre-trained model. To which pre-trained model are you referring?
Is there something else I can do?

@stephenroller
Copy link
Contributor

Kurt has been digging into this a little bit. Finding discrepancies between our internal and external implementations of the metrics, but haven't tracked down the difference yet. We confirmed we can replicate using the internal implementation, and will be looking to fix externally.

@stephenroller
Copy link
Contributor

Okay it looks like it's an issue with macro vs micro averages and a global correction statistic. We'll fix sometime next week.

@manzar96
Copy link
Author

Are there any updates on the issue?

@klshuster
Copy link
Contributor

once #3518 lands it should be all good!

@klshuster
Copy link
Contributor

going to go ahead and close this, please feel free to reopen if you run into more issues

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants