BLEU and CHRF reports wrong scores when any hypothesis is empty #239

SantiagoEG · 2023-09-12T11:06:11Z

Hello,

Thank you for your contribution with this library. I am experimenting a problem computing BLEU and CHRF when some hypothesis are empty strings. The code to reproduce the problem is the following:

import sacrebleu as s

print("Version:", s.version)
bleu = s.BLEU()
chrf = s.CHRF()

hypothesis_1 = ['A B C', 'B C D', 'C D E']
hypothesis_2 = ['', 'B C D', 'C D E']
hypothesis_3 = ['A B C', '', 'C D E']
hypothesis_4 = ['A B C', '', '']

refs = [['A B C'], ['B C D'], ['C D E']]

print()
print("hypothesis_1 CHRF:", chrf.corpus_score(hypothesis_1, refs).score)
print("hypothesis_1 BLEU:", chrf.corpus_score(hypothesis_1, refs).score)
print()
print("hypothesis_2 CHRF:", chrf.corpus_score(hypothesis_2, refs).score)
print("hypothesis_2 BLEU:", chrf.corpus_score(hypothesis_2, refs).score)
print()
print("hypothesis_3 CHRF:", chrf.corpus_score(hypothesis_3, refs).score)
print("hypothesis_3 BLEU:", chrf.corpus_score(hypothesis_3, refs).score)
print()
print("hypothesis_4 CHRF:", chrf.corpus_score(hypothesis_4, refs).score)
print("hypothesis_4 BLEU:", chrf.corpus_score(hypothesis_4, refs).score)

This code produces the following outputs:

Version: 2.3.1

hypothesis_1 CHRF: 100.0
hypothesis_1 BLEU: 100.0

hypothesis_2 CHRF: 0.0
hypothesis_2 BLEU: 0.0

hypothesis_3 CHRF: 100.0
hypothesis_3 BLEU: 100.0

hypothesis_4 CHRF: 100.0
hypothesis_4 BLEU: 100.0

I have not experienced this problem for TER.
Do you recommend me to use metrics at sentence level and compute the mean?

Best

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BLEU and CHRF reports wrong scores when any hypothesis is empty #239

BLEU and CHRF reports wrong scores when any hypothesis is empty #239

SantiagoEG commented Sep 12, 2023

BLEU and CHRF reports wrong scores when any hypothesis is empty #239

BLEU and CHRF reports wrong scores when any hypothesis is empty #239

Comments

SantiagoEG commented Sep 12, 2023