Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

[Metrics] Fairseq BLEU Re-Implementation #3518

Merged
merged 8 commits into from
Mar 16, 2021
Merged

[Metrics] Fairseq BLEU Re-Implementation #3518

merged 8 commits into from
Mar 16, 2021

Conversation

klshuster
Copy link
Contributor

@klshuster klshuster commented Mar 12, 2021

Patch description
Prior experiments for e.g. dodecadialogue have utilized fairseq's tokenized bleu scores for measuring BLEU. From inspection of the fairseq generation script, it becomes apparent that they calculate BLEU on a macro level, considering all predictions and references together. Our original implementation did not take into account the brevity score, which is a total across all predictions/references, and thus our scores were misaligned.

I've thus re-implemented fairseq's bleu scoring with ParlAI's metrics system. I have verified that the computed BLEU score for e.g. the command referenced in #3473 yields nearly the same BLEU score as reported in the paper.

Testing steps
Added tests to CI:

$ pytest -k TestFairseqBleuMetric -v
====test session starts ====
...
collected 477 items / 476 deselected / 1 selected

test_metrics.py::TestFairseqBleuMetric::test_scorer PASSED                                                                                                                     [100%]

====slowest 10 durations ====
0.12s call     tests/test_metrics.py::TestFairseqBleuMetric::test_scorer

(2 durations < 0.005s hidden.  Use -vv to show these durations.)
====1 passed, 476 deselected, 3 warnings in 5.13s ====

Additionally:

$ parlai eval_model -mf zoo:dodecadialogue/empathetic_dialogues_ft/model -t empathetic_dialogues --skip-generation false -dt test --metrics ppl,bleu --inference beam --beam-size 10 --beam-min-length 5 --beam-block-ngram 3 --beam-context-block-ngram -1 --compute-tokenized-bleu true -bs 32
.
.
.
 fairseq_bleu1  fairseq_bleu2  fairseq_bleu3  fairseq_bleu4 
  15.61         8.018           5.159            3.485

Copy link
Contributor

@stephenroller stephenroller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great sleuthing

parlai/core/metrics.py Outdated Show resolved Hide resolved
parlai/core/metrics.py Outdated Show resolved Hide resolved
tests/test_metrics.py Outdated Show resolved Hide resolved
tests/test_metrics.py Show resolved Hide resolved
@klshuster klshuster merged commit fe543b5 into master Mar 16, 2021
@klshuster klshuster deleted the fairseq_bleu branch March 16, 2021 22:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants