Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible errors in ROUGE-L evaluation #28

Closed
grusky opened this issue Jul 17, 2023 · 1 comment
Closed

Possible errors in ROUGE-L evaluation #28

grusky opened this issue Jul 17, 2023 · 1 comment

Comments

@grusky
Copy link
Contributor

grusky commented Jul 17, 2023

ROUGE-L is sensitive to sentence tokenization.

The data format used by the newsroom-run -> newsroom-score -> newsroom-tables evaluation pipeline does not appear to keep track of sentence tokenization. When sentence tokenization is not provided to ROUGE-1.5.5, multi-sentence references and hypotheses are evaluated as one long sentence. As a result, ROUGE-L scores produced by this evaluation pipeline (1) may be lower than expected compared to more standard ROUGE evaluation that uses tokenized sentences, and (2) probably do not match the ROUGE-L scores in the Newsroom paper, which are computed using sentence tokenization.

I would recommend adding a notice to the README that the evaluation pipeline does not keep track of sentence tokenization, which may result in lower-than-expected ROUGE-L scores, and that the newsroom-run -> newsroom-score -> newsroom-tables evaluation pipeline should not be used for publishable evaluation.

@yoavartzi
Copy link
Member

Updated the README at the root and added a note on the evaluation directory. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants