BLEU: add COCO/PTB tokenization mode (tokenizer_name) for parity with pycocoevalcap #698

bavi404 · 2025-08-13T08:49:11Z

fixes #693

Summary

Add tokenizer_name option to bleu metric.
New modes: "coco"/"ptb" (COCO PTBTokenizer; requires pycocoevalcap) and "whitespace".
Default behavior unchanged (tokenizer_13a).

Motivation

Address discrepancy with pycocoevalcap BLEU caused by PTB vs 13a tokenization differences, esp. around commas/periods.

Changes

metrics/bleu/tokenizer_13a.py: add CocoPTBTokenizer, WhitespaceTokenizer.
metrics/bleu/bleu.py: new tokenizer_name param; accepts "coco"|"ptb"|"whitespace".
metrics/bleu/README.md: document new option and example.
tests/test_bleu_coco_tokenization.py: focused tests; skipped if pycocoevalcap not installed.

Backwards compatibility

Notes

…e tokenizer; add focused tests and docs

bleu: add tokenizer_name option with COCO/PTB tokenizer and whitespac…

8f0fc2d

…e tokenizer; add focused tests and docs

Provide feedback