-
Notifications
You must be signed in to change notification settings - Fork 2.1k
DodecaDialogue model - EmpatheticDialogues task reproducing paper's results #3473
Comments
Hi there! You can find the generation settings in Table 11 in the paper, but I'll just list them here:
For the BLEU scores in the dodecadialogue paper, we computed tokenized-BLEU scores; the default BLEU output by ParlAI is based on the generated string and can result in slightly worse values. You can see the tokenized bleu scores by setting One final note is that the avg-BLEU we report is the average of BLEU-1,2,3,4. Hope that helps! |
Edit: didn't see all the other comments before I wrote this. Leaving for posterity. I will mention that we have multiple implementations of BLEU. The default one uses a very naive tokenizer (not much more that split on spaces). The fairseq one uses the tokenizer of the model, and tends to give much better numbers (awards partial credit more fairly). The latter is standard in many papers, and was used in DoDeca IIRC. |
Can you "pip install fairseq" and then rerun? |
Running the following command: Averaging the "fairseq_bleu" scores the avg-BLEU is: 6.01 I would also like to ask your opinion about comparing the Dodeca model with another transformer-based model. Concerning the avg-BLEU score, I think that the best way to do this comparison is to use the same BLEU implementation (in order to make a reasonable comparison). However, the tokenizers that are used may differ (supposing that I use the fairseq implementation). So does reporting a BLEU score for both models using different tokenizers seem legit for you? Or do you think that I should follow another approach? |
Could you try that same command, but with Regarding comparing to another transformer model --> indeed, different tokenizers might yield different BLEU scores. You could use the default ParlAI BLEU computation, which is string based, if you want to compare across models, as you're comparing the raw generated text |
The avg-BLEU is 6.2 |
Indeed, bleu-scores should only be compared if they use the same tokenization. the standard bleu scores (the 4% ones) should be the same across different tokenizers. Unfortunately, it's definitely common in the literature to use the token based ones. @klshuster didn't we release a pretrained model for this? Their ppl and f1 is a little worse than ours. I'd suggest we try replicating with our released model next. |
@stephenroller I already used the pre-trained model. To which pre-trained model are you referring? |
Kurt has been digging into this a little bit. Finding discrepancies between our internal and external implementations of the metrics, but haven't tracked down the difference yet. We confirmed we can replicate using the internal implementation, and will be looking to fix externally. |
Okay it looks like it's an issue with macro vs micro averages and a global correction statistic. We'll fix sometime next week. |
Are there any updates on the issue? |
once #3518 lands it should be all good! |
going to go ahead and close this, please feel free to reopen if you run into more issues |
Hi,
I used your code in order to reproduce the results of the dodecadialogue paper on the EmpatheticDialogues task. However, I could not reproduce the results concerning the AVG BLEU metric.
Can you please report the exact decoding parameters (beam-size, beam-min-length, beam-block-ngram, beam-context-block-ngram, etc) that you used in the referenced paper for the MT+FT model on the EmpatheticDialogues task?
I am a little bit confused with the avg-BLEU metric, as I cannot reproduce the reported results on the paper (8.1 for the MT+FT and 8.4 for the MT)
The text was updated successfully, but these errors were encountered: