[Metrics] Fairseq BLEU Re-Implementation #3518

klshuster · 2021-03-12T16:30:36Z

Patch description
Prior experiments for e.g. dodecadialogue have utilized fairseq's tokenized bleu scores for measuring BLEU. From inspection of the fairseq generation script, it becomes apparent that they calculate BLEU on a macro level, considering all predictions and references together. Our original implementation did not take into account the brevity score, which is a total across all predictions/references, and thus our scores were misaligned.

I've thus re-implemented fairseq's bleu scoring with ParlAI's metrics system. I have verified that the computed BLEU score for e.g. the command referenced in #3473 yields nearly the same BLEU score as reported in the paper.

Testing steps
Added tests to CI:

$ pytest -k TestFairseqBleuMetric -v
====test session starts ====
...
collected 477 items / 476 deselected / 1 selected

test_metrics.py::TestFairseqBleuMetric::test_scorer PASSED                                                                                                                     [100%]

====slowest 10 durations ====
0.12s call     tests/test_metrics.py::TestFairseqBleuMetric::test_scorer

(2 durations < 0.005s hidden.  Use -vv to show these durations.)
====1 passed, 476 deselected, 3 warnings in 5.13s ====

Additionally:

$ parlai eval_model -mf zoo:dodecadialogue/empathetic_dialogues_ft/model -t empathetic_dialogues --skip-generation false -dt test --metrics ppl,bleu --inference beam --beam-size 10 --beam-min-length 5 --beam-block-ngram 3 --beam-context-block-ngram -1 --compute-tokenized-bleu true -bs 32
.
.
.
 fairseq_bleu1  fairseq_bleu2  fairseq_bleu3  fairseq_bleu4 
  15.61         8.018           5.159            3.485

stephenroller

Great sleuthing

parlai/core/metrics.py

tests/test_metrics.py

klshuster added 2 commits March 12, 2021 11:22

fairseq blee

587d015

uncomment

6172e35

klshuster requested a review from stephenroller March 12, 2021 16:30

facebook-github-bot added the CLA Signed label Mar 12, 2021

klshuster added 2 commits March 12, 2021 11:32

lint

52eb9d2

silent lint

2bcc6ad

klshuster mentioned this pull request Mar 12, 2021

DodecaDialogue model - EmpatheticDialogues task reproducing paper's results #3473

Closed

klshuster added 2 commits March 12, 2021 11:52

fix test for old fairseq

e7c3a9f

Merge branch 'master' into fairseq_bleu

0265020

stephenroller approved these changes Mar 13, 2021

View reviewed changes

parlai/core/metrics.py Outdated Show resolved Hide resolved

parlai/core/metrics.py Outdated Show resolved Hide resolved

tests/test_metrics.py Outdated Show resolved Hide resolved

tests/test_metrics.py Show resolved Hide resolved

klshuster added 2 commits March 15, 2021 16:46

Merge branch 'master' into fairseq_bleu

faa2825

address stephen comments

a268f23

klshuster merged commit fe543b5 into master Mar 16, 2021

klshuster deleted the fairseq_bleu branch March 16, 2021 22:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Metrics] Fairseq BLEU Re-Implementation #3518

[Metrics] Fairseq BLEU Re-Implementation #3518

klshuster commented Mar 12, 2021 •

edited

Loading

stephenroller left a comment

[Metrics] Fairseq BLEU Re-Implementation #3518

[Metrics] Fairseq BLEU Re-Implementation #3518

Conversation

klshuster commented Mar 12, 2021 • edited Loading

stephenroller left a comment

Choose a reason for hiding this comment

klshuster commented Mar 12, 2021 •

edited

Loading