-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for biomedical datasets with multiple entity types #3387
Conversation
…: Directly training on multiple BigBio datasets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WangXII thank you very much for implementing all the changes and going through all datasets.
Concerning the integration of the task description, however, I would opt for another solution (and not materialzing the prompts into the conll files), since at the end the user should be able to use the following api:
ner_tagger = Classifier.load("hunflair2")
sentence = Sentence("TP53 is an onco-gene.")
ner_tagger.predict(sentence)
This means, the end-user should not responsible / bothered with prepending the task prompts etc. I will commit a suggestion for an implementation right away.
flair/datasets/biomedical.py
Outdated
@@ -46,6 +46,9 @@ | |||
|
|||
SENTENCE_TAG = "[__SENT__]" | |||
|
|||
MULTI_TASK_LEARNING = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a global flag isn't a proper solution. We should refactor this, so that it becomes a parameter of the ConllWriter. Moreover, I'm not completely convinced by the strategy to materialize the task description, i.e., write the task prompt to the conll file.
flair/datasets/biomedical.py
Outdated
@@ -46,6 +46,9 @@ | |||
|
|||
SENTENCE_TAG = "[__SENT__]" | |||
|
|||
MULTI_TASK_LEARNING = False | |||
IGNORE_NEGATIVE_SAMPLES = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm getting it right this flag is used to remove sentences without any labels. Why is this necessary? Are their (strong) performance differences?
…ith task description-augmented sentences.
…move to own package
Cool new feature! Thanks @WangXII and @mariosaenger for adding this! |
In this pull request, we've added support for datasets annotating multiple entities like BC5CDR for chemical and disease entities.
This enables training of models which can recognize multiple entites at once without the need for specialized model for every entity type. Exemplary for this implementation is the class HUNER_ALL_BIORED, which tags genes, chemicals, cell lines, diseases and species in a single corpora.