-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Inconsistent NER Grouping (Pipeline) #4987
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4987 +/- ##
==========================================
- Coverage 77.83% 76.43% -1.41%
==========================================
Files 141 141
Lines 24634 24638 +4
==========================================
- Hits 19175 18832 -343
- Misses 5459 5806 +347
Continue to review full report at Codecov.
|
Looks great, but this sort of code/feature also looks like a perfect candidate for more unit-testing coverage. What do you think? |
Agree, can add these as test cases in test_pipelines. |
That would be great @enzoampil! |
@julien-c I've added the original issue bug as a test case (Number 1 in the original post). Do note that I only included it in the For future PRs to add new test cases coming from issues found on top of this (e.g. those from issue #5077), I was hoping to get some guidance on how we'd include them to the test coverage without making it too heavy. For context, different cases are typically based on different models, which means we'll have to run separate models to add them as test cases. |
I think we should try to make the tests more unitary, meaning that for instance you would feed them fixed model outputs (no actual forward pass) and check that the actual formatted output is correct. This might require splitting the call method in smaller more testable functions, which is totally fine IMO. |
I see what you mean. Yes, that makes more sense than running different models. Will work on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I was looking for this !
It worked using a Camembert model with 8 different entities, but only when there are entities, else it raises an IndexError
.
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-65-2315ccc0d111> in <module>()
1 seq6 = "De nombreux particuliers s’interrogent sur la meilleure manière de se dessaisir d’un véhicule accidenté."
----> 2 ner(seq6)
/usr/local/lib/python3.6/dist-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
948 if self.model.config.id2label[label_idx] not in self.ignore_labels
949 ]
--> 950 last_idx, _ = filtered_labels_idx[-1]
951
952 for idx, label_idx in filtered_labels_idx:
IndexError: list index out of range
It runs by setting ignore_labels=[]
in the pipeline instance.
[{'entity_group': 'O',
'score': 0.969656412601471,
'word': '<s> De nombreux particuliers s’interrogent sur la meilleure manière de se dessaisir d’un véhicule accidenté.</s>'}]
So I suggest adding a little condition.
Hope it could help !
Co-authored-by: ColleterVi <36503688+ColleterVi@users.noreply.github.com>
…splitting call method to testable functions)
@julien-c @LysandreJik I've performed the following adjustments to the PR:
The test simply confirms if the expected formatted output (grouped) is equivalent to the actual formatted output given the raw model outputs. For the test cases, I used the two samples from the original PR post. It should be straight forward to continue adding test cases moving forward. Please do let me know what you guys think! 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very clean! What do you think @julien-c, @mfuntowicz? (Let's wait for v3.0.2 before merging)
Yes, looks good. I would add some typings to (at least) the |
@LysandreJik @julien-c Thanks for the feedback. I've added typings for the |
This PR solves issue #4816 by:
B
andI
)Running the sample script below (based on reference issue #4816) returns the expected results. Do note that the
entity_group
is based on theentity_type
of the first entity in the group.I also ran another test to ensure that number 2 (separate entity at the last index) is working properly. I confirmed that it is working properly now.
You can test these out yourself in this colab notebook.
cc @dav009 @mfuntowicz