Possible errors in ROUGE-L evaluation #28

grusky · 2023-07-17T02:22:44Z

ROUGE-L is sensitive to sentence tokenization.

The data format used by the newsroom-run -> newsroom-score -> newsroom-tables evaluation pipeline does not appear to keep track of sentence tokenization. When sentence tokenization is not provided to ROUGE-1.5.5, multi-sentence references and hypotheses are evaluated as one long sentence. As a result, ROUGE-L scores produced by this evaluation pipeline (1) may be lower than expected compared to more standard ROUGE evaluation that uses tokenized sentences, and (2) probably do not match the ROUGE-L scores in the Newsroom paper, which are computed using sentence tokenization.

I would recommend adding a notice to the README that the evaluation pipeline does not keep track of sentence tokenization, which may result in lower-than-expected ROUGE-L scores, and that the newsroom-run -> newsroom-score -> newsroom-tables evaluation pipeline should not be used for publishable evaluation.

The text was updated successfully, but these errors were encountered:

yoavartzi · 2023-08-08T16:53:41Z

Updated the README at the root and added a note on the evaluation directory. Thanks

yoavartzi closed this as completed Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible errors in ROUGE-L evaluation #28

Possible errors in ROUGE-L evaluation #28

grusky commented Jul 17, 2023

yoavartzi commented Aug 8, 2023

Possible errors in ROUGE-L evaluation #28

Possible errors in ROUGE-L evaluation #28

Comments

grusky commented Jul 17, 2023

yoavartzi commented Aug 8, 2023