This repository has been archived by the owner on Nov 3, 2023. It is now read-only.
[Metrics] Fairseq BLEU Re-Implementation #3518
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Patch description
Prior experiments for e.g. dodecadialogue have utilized fairseq's tokenized bleu scores for measuring BLEU. From inspection of the fairseq generation script, it becomes apparent that they calculate BLEU on a macro level, considering all predictions and references together. Our original implementation did not take into account the
brevity
score, which is a total across all predictions/references, and thus our scores were misaligned.I've thus re-implemented fairseq's bleu scoring with ParlAI's metrics system. I have verified that the computed BLEU score for e.g. the command referenced in #3473 yields nearly the same BLEU score as reported in the paper.
Testing steps
Added tests to CI:
Additionally: