Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify input order of ROUGEScore and BERTScore with other NLG metrics #687

Closed
wants to merge 10 commits into from

Conversation

ashutoshml
Copy link
Contributor

@ashutoshml ashutoshml commented Dec 18, 2021

What does this PR do?

Fixes #686

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃
Yes

@codecov
Copy link

codecov bot commented Dec 18, 2021

Codecov Report

Merging #687 (92c637a) into master (293af54) will decrease coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff          @@
##           master   #687   +/-   ##
=====================================
- Coverage      95%    95%   -0%     
=====================================
  Files         166    166           
  Lines        6413   6413           
=====================================
- Hits         6105   6103    -2     
- Misses        308    310    +2     

Copy link
Contributor

@stancld stancld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

tests/text/test_bertscore.py Outdated Show resolved Hide resolved
torchmetrics/text/bert.py Outdated Show resolved Hide resolved
_inputs_error_rate_batch_size_1 = Input(**ERROR_RATES_BATCHES_1)

_inputs_error_rate_batch_size_2 = Input(**ERROR_RATES_BATCHES_2)

_inputs_multiple_sentences_multiple_reference = Input(**ARTICLES_INPUT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there's a single reference for a given hypothesis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Should I call it _inputs_multiple_sentences_single_reference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly we can. Or maybe we can leave the references in the test file for now. We aim to adjust BERTScore in a way to handle multiple references #647 (similar updates as you made for ROUGEScore) so we can eventually use already defined _inputs_multiple_references.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. For now, I'll keep it as is in the current PR. We can rename it when issue #647 is completed.

I additionally had concerns that we should standardize naming conventions for preds (in some places hypothesis) and targets (in some places references) in the entirety of NLG metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should go with predictions and targets everywhere, since this is then more consistent with metrics in other domains.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree here :]

Co-authored-by: Daniel Stancl <46073029+stancld@users.noreply.github.com>
@Borda Borda added API / design refactoring refactoring and code health labels Dec 18, 2021
@Borda Borda added this to the v0.7 milestone Dec 18, 2021
Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Not familiar with these metrics in particular, but I assume this change can be done because metric(pred, target) = metric(target, pred)

@ashutoshml
Copy link
Contributor Author

ashutoshml commented Dec 19, 2021

LGTM. Not familiar with these metrics in particular, but I assume this change can be done because metric(pred, target) = metric(target, pred)

Symmetricity might not be true completely. Basically, the precision and recall values flip in ROUGEScore calculations and in BERTScore sending X [SEP] Y results in a different score than Y [SEP] X. Also since ROUGEScore now allows for multi-reference inputs, the API will throw an error if we interchange preds with targets. Multi-reference is also being planned for BERTScore.

Copy link
Member

@justusschock justusschock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major concerns on backwards compatibility here. From the logic itself it's fine though

_inputs_error_rate_batch_size_1 = Input(**ERROR_RATES_BATCHES_1)

_inputs_error_rate_batch_size_2 = Input(**ERROR_RATES_BATCHES_2)

_inputs_multiple_sentences_multiple_reference = Input(**ARTICLES_INPUT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should go with predictions and targets everywhere, since this is then more consistent with metrics in other domains.

Comment on lines 453 to +454
references: Union[List[str], Dict[str, Tensor]],
predictions: Union[List[str], Dict[str, Tensor]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a breaking change. Not sure if we can do it that easily

targets: Union[str, Sequence[str], Sequence[Sequence[str]]],
preds: Union[str, Sequence[str]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same concerns about breaking change

@@ -192,15 +192,15 @@ def __init__(
self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
self.user_tokenizer = False

def update(self, predictions: List[str], references: List[str]) -> None: # type: ignore
def update(self, references: List[str], predictions: List[str]) -> None: # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also a BC

@@ -126,7 +126,7 @@ def __init__(
self.add_state(f"{rouge_key}_{score}", [], dist_reduce_fx=None)

def update( # type: ignore
self, preds: Union[str, Sequence[str]], targets: Union[str, Sequence[str], Sequence[Sequence[str]]]
self, targets: Union[str, Sequence[str], Sequence[Sequence[str]]], preds: Union[str, Sequence[str]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also a BC

@mergify mergify bot removed the has conflicts label Dec 22, 2021
@SkafteNicki
Copy link
Member

Changes are not backward compatible as @justusschock mentions. Maybe worth inserting a warning in the __init__ of the class and the functional version:

import warnings
warnings.warn("Input order of preds and targets were changed to target firsts and predictions second in v0.7. Warning will be removed in v0.8")

@ashutoshml
Copy link
Contributor Author

Since the requirements of this PR have been redesigned, it would require complete rework. I'll submit a new PR soon. Closing this one.

@ashutoshml ashutoshml closed this Dec 23, 2021
@ashutoshml ashutoshml deleted the orderfixation branch December 24, 2021 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unify the input order for text (NLG) metrics
7 participants