Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make transformer_ner continue processing other entities after the first non-matching #309

Merged
merged 1 commit into from
Feb 20, 2023

Conversation

baixiac
Copy link
Member

@baixiac baixiac commented Feb 14, 2023

Currently, transformer_ner stops processing as soon as the span of a recognised entity (e.g., a subword) does not match any text tokens so the other remaining entities won’t be inspected or added to the Doc object. This PR fixed that.

@mart-r
Copy link
Collaborator

mart-r commented Feb 14, 2023

The reason it previously (somewhat) silently failed was because of the overly eager except clause catching the ValueError when trying to get the minimum of an empty list.

So this PR prevents that from happening by making sure the list isn't empty before attempting to call min on it, which is great.

However, perhaps this would be a good time to tackle the underlying catch-(almost)-all exception issue as well?

Do we really want to catch any Exception in here?
Especially since we don't really handle them. Since we're expecting a broad type of exception, we may leave our code in an unexpected state which may cause it to fail somewhere else down the line and it might be a lot more difficult to get back to the culprit from there.

Do we know what type of exceptions we are actually expecting to see (and catch) in this block?
We should identify and deal with those and (re)raise all other exceptions.

Or perhaps we can identify the part of the code that we want to catch exceptions for?
It could be the ner_pipe, the iteration over its results, creating main annotations, making pretty labels, or mapping of the entities to groups. And if we think multiple parts may raise exceptions, perhaps we'd be better off creating multiple try-except blocks for each part?

It's entirely possible that you don't know the answers to these questions (I certainly don't). But I thought I'd bring them up in case we know of a better approach.

@baixiac
Copy link
Member Author

baixiac commented Feb 14, 2023

Good questions. I tried to fix the bug with minimum change and avoided making assumptions about what the original developer(s) were thinking. It is possible that their goal was simply to log the "forgivable" error and move on to the next doc.

In terms of exception handling, it's generally recommended to create and handle specific types of exceptions when necessary. Given catch-all exception handling is not uncommon throughout the code base, it may be worthwhile to improve them altogether on a dedicated ticket, following a broader discussion.

@mart-r
Copy link
Collaborator

mart-r commented Feb 14, 2023

Good questions. I tried to fix the bug with minimum change and avoided making assumptions about what the original developer(s) were thinking.

Fair enough. It might not be worth the time to look into it in this specific context.

In terms of exception handling, it's generally recommended to create and handle specific types of exceptions when necessary. Given catch-all exception handling is not uncommon throughout the code base, it may be worthwhile to improve them altogether on a dedicated ticket, following a broader discussion.

I think that's a good point as well. I'm just somewhat afraid we might end up in a situation where we don't feel this is worth the time. It's not ideal, but such is life.

@baixiac
Copy link
Member Author

baixiac commented Feb 15, 2023

Alright. I have logged a new ticket and feel free to add any thoughts/observations.

Copy link
Collaborator

@mart-r mart-r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Achieves its goal.

@tomolopolis tomolopolis merged commit 8de6af5 into master Feb 20, 2023
@tomolopolis tomolopolis deleted the CU-8677craqe branch February 20, 2023 20:22
mart-r pushed a commit to mart-r/MedCAT that referenced this pull request Jun 14, 2023
Make transformer_ner continue processing other entities after the first non-matching
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants