Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bleuscore from torchmetric give different results compare to NLTK #1074

Closed
icedpanda opened this issue Jun 7, 2022 · 6 comments · Fixed by #1075
Closed

Bleuscore from torchmetric give different results compare to NLTK #1074

icedpanda opened this issue Jun 7, 2022 · 6 comments · Fixed by #1075
Assignees
Labels
bug / fix Something isn't working help wanted Extra attention is needed

Comments

@icedpanda
Copy link

🐛 Bug

To Reproduce

Bleu's score from torchmetric and nltk is different.

I can only get the same result if k = 1. Otherwise, it returns a different Bleu score

Code sample

from nltk.translate.bleu_score import sentence_bleu
from torchmetrics.functional import bleu_score

k = 3
predictions = "I am handsome and i love animals"
truth = "pad I am smart and i love animals pad"

def compute_blue(preds: str, answers, k: int):

    weights = [0] * 4
    weights[k - 1] = 1
    # need to tokenize sentence first for nltk bleu
    preds = preds.split(" ")
    answers = [a.split(" ") for a in answers]
    return sentence_bleu(references=answers, hypothesis=preds, weights=weights)

# torchmetric bleu
print("Bleu from torchmetrics: ", bleu_score(preds=predictions, target=[truth], n_gram=k))
# nltk bleu
print("Bleu from nltk: ", compute_blue(predictions, [truth], k=k))

#  output k=3
# Bleu from torchmetrics:  tensor(0.4595)
# Bleu from nltk:  0.3005909172301144
# output k=1
# Bleu from torchmetrics:  tensor(0.6441)
# Bleu from nltk:  0.6441233940645307

Expected behavior

I would expect the same bleu score from torchmetric and nltk

Environment

  • TorchMetrics version: 0.9.0
  • nltk: 3.4.5
  • Python: 3.7
  • Linux

Additional context

@icedpanda icedpanda added bug / fix Something isn't working help wanted Extra attention is needed labels Jun 7, 2022
@github-actions
Copy link

github-actions bot commented Jun 7, 2022

Hi! thanks for your contribution!, great first issue!

@SkafteNicki
Copy link
Member

cc: @stancld is probably the best to answer this but from out code I can tell this is the comparison function we use for our implementation:
https://github.com/PyTorchLightning/metrics/blob/3856db48f65c8c7cd17bda4fceaa3584ceeab593/tests/text/test_bleu.py#L30-L35
where corpus_bleu comes from:

from nltk.translate.bleu_score import corpus_bleu

@stancld
Copy link
Contributor

stancld commented Jun 7, 2022

Hello @icedpanda, thanks for raising this issue. The difference lies in the fact that torchmetrics uses a uniform weights 1/n for weighting n-gram BLEU score. That means if you set weights for nltk like [1/3, 1/3, 1/3, 0], you obtain the same results.

Bleu from torchmetrics:  tensor(0.4595)
Bleu from nltk:  0.45946931172542343

Question for @Borda & @SkafteNicki -> Don't we wanna allow to set weights manually, with the default behaviour following the current implementation?

@SkafteNicki
Copy link
Member

@stancld I would be fine with adding a new argument:

weights: Optional[List[float]] = None

where None is uniform weight or else the user can provide a list of floats.

@stancld stancld self-assigned this Jun 7, 2022
@icedpanda
Copy link
Author

Thanks for the prompt reply, make sense now

@stancld
Copy link
Contributor

stancld commented Jun 7, 2022

@SkafteNicki I'll send a PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug / fix Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants