Bleuscore from torchmetric give different results compare to NLTK #1074

icedpanda · 2022-06-07T08:17:03Z

🐛 Bug

To Reproduce

Bleu's score from torchmetric and nltk is different.

I can only get the same result if k = 1. Otherwise, it returns a different Bleu score

Code sample

from nltk.translate.bleu_score import sentence_bleu
from torchmetrics.functional import bleu_score

k = 3
predictions = "I am handsome and i love animals"
truth = "pad I am smart and i love animals pad"

def compute_blue(preds: str, answers, k: int):

    weights = [0] * 4
    weights[k - 1] = 1
    # need to tokenize sentence first for nltk bleu
    preds = preds.split(" ")
    answers = [a.split(" ") for a in answers]
    return sentence_bleu(references=answers, hypothesis=preds, weights=weights)

# torchmetric bleu
print("Bleu from torchmetrics: ", bleu_score(preds=predictions, target=[truth], n_gram=k))
# nltk bleu
print("Bleu from nltk: ", compute_blue(predictions, [truth], k=k))

#  output k=3
# Bleu from torchmetrics:  tensor(0.4595)
# Bleu from nltk:  0.3005909172301144
# output k=1
# Bleu from torchmetrics:  tensor(0.6441)
# Bleu from nltk:  0.6441233940645307

Expected behavior

I would expect the same bleu score from torchmetric and nltk

Environment

TorchMetrics version: 0.9.0
nltk: 3.4.5
Python: 3.7
Linux

Additional context

The text was updated successfully, but these errors were encountered:

github-actions · 2022-06-07T08:17:41Z

Hi! thanks for your contribution!, great first issue!

SkafteNicki · 2022-06-07T08:35:45Z

cc: @stancld is probably the best to answer this but from out code I can tell this is the comparison function we use for our implementation:
https://github.com/PyTorchLightning/metrics/blob/3856db48f65c8c7cd17bda4fceaa3584ceeab593/tests/text/test_bleu.py#L30-L35
where corpus_bleu comes from:

from nltk.translate.bleu_score import corpus_bleu

stancld · 2022-06-07T08:36:05Z

Hello @icedpanda, thanks for raising this issue. The difference lies in the fact that torchmetrics uses a uniform weights 1/n for weighting n-gram BLEU score. That means if you set weights for nltk like [1/3, 1/3, 1/3, 0], you obtain the same results.

Bleu from torchmetrics:  tensor(0.4595)
Bleu from nltk:  0.45946931172542343

Question for @Borda & @SkafteNicki -> Don't we wanna allow to set weights manually, with the default behaviour following the current implementation?

SkafteNicki · 2022-06-07T08:43:12Z

@stancld I would be fine with adding a new argument:

weights: Optional[List[float]] = None

where None is uniform weight or else the user can provide a list of floats.

icedpanda · 2022-06-07T08:54:25Z

Thanks for the prompt reply, make sense now

stancld · 2022-06-07T08:59:26Z

@SkafteNicki I'll send a PR

icedpanda added bug / fix Something isn't working help wanted Extra attention is needed labels Jun 7, 2022

stancld self-assigned this Jun 7, 2022

stancld mentioned this issue Jun 7, 2022

bleuscore: Add weights argument to allow non-uniform n-gram weights #1075

Merged

4 tasks

Borda closed this as completed in #1075 Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bleuscore from torchmetric give different results compare to NLTK #1074

Bleuscore from torchmetric give different results compare to NLTK #1074

icedpanda commented Jun 7, 2022

github-actions bot commented Jun 7, 2022

SkafteNicki commented Jun 7, 2022

stancld commented Jun 7, 2022 •

edited

Loading

SkafteNicki commented Jun 7, 2022

icedpanda commented Jun 7, 2022

stancld commented Jun 7, 2022

Bleuscore from torchmetric give different results compare to NLTK #1074

Bleuscore from torchmetric give different results compare to NLTK #1074

Comments

icedpanda commented Jun 7, 2022

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

github-actions bot commented Jun 7, 2022

SkafteNicki commented Jun 7, 2022

stancld commented Jun 7, 2022 • edited Loading

SkafteNicki commented Jun 7, 2022

icedpanda commented Jun 7, 2022

stancld commented Jun 7, 2022

stancld commented Jun 7, 2022 •

edited

Loading