This release adds major new support for biomedical text analytics! It adds improved biomedical NER and a state-of-the-art model for biomedical entity linking. Other new features include (1) support for parameter-efficient fine-tuning and (2) various new datasets, bug fixes and enhancements! We also removed a few dependencies, so Flair should install faster and take up less space!
Biomedical NER and Entity Linking
With Flair 0.14.0, you can now detect and normalize biomedical entities in text.
For example, to analyze the sentence "We correlate genetic variants in IFNAR2 and POLG with long-COVID syndrome
", use this code snippet:
from flair.models import EntityMentionLinker
from flair.nn import Classifier
from flair.data import Sentence
# A sentence from biomedical literature
sentence = Sentence("We correlate genetic variants in IFNAR2 and POLG with long-COVID syndrome.")
# Tag named entities in the text
ner_tagger = Classifier.load("hunflair2")
ner_tagger.predict(sentence)
# Normalize disease names
disease_linker = EntityMentionLinker.load("gene-linker")
disease_linker.predict(sentence)
# Normalize gene names
gene_linker = EntityMentionLinker.load("disease-linker")
gene_linker.predict(sentence)
# Iterate over predicted entities and print
for label in sentence.get_labels():
print(label)
This should print out:
Span[5:6]: "IFNAR2" → Gene (1.0)
Span[5:6]: "IFNAR2" → 3455/name=IFNAR2
Span[7:8]: "POLG" → Gene (1.0)
Span[7:8]: "POLG" → 5428/name=POLG
Span[9:11]: "long-COVID syndrome" → Disease (1.0)
Span[9:11]: "long-COVID syndrome" → MESH:D000094024/name=Post-Acute COVID-19 Syndrome
The printout shows that:
-
"IFNAR2" is a gene. Further, it is recognized as gene 3455 ("interferon alpha and beta receptor subunit 2") in the NCBI database.
-
"POLG" is a gene. Further, it is recognized as gene 5428 ("DNA polymerase gamma, catalytic subunit") in the NCBI database.
-
"long-COVID syndrome" is a disease. Further, it is uniquely linked to "Post-Acute COVID-19 Syndrome" in the MESH database.
Big thanks to @sg-wbi @WangXII @mariosaenger @helpmefindaname for all their work:
- Entity Mention Linker by @helpmefindaname in #3388
- Support for biomedical datasets with multiple entity types by @WangXII in #3387
- Update documentation for Hunflair2 release by @mariosaenger in #3410
- Improve nel tutorial by @helpmefindaname in #3369
- Incorporate hunflair2 docs to docpage by @helpmefindaname in #3442
Parameter-Efficient Fine-Tuning
Flair 0.14.0 also adds support for PEFT.
For instance, to fine-tune a BERT model on the TREC question classification task using LoRA, use the following snippet:
from flair.data import Corpus
from flair.datasets import TREC_6
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
# Note: you need to install peft to use this feature!
from peft import LoraConfig, TaskType
# Get corpus and make label dictionary
corpus: Corpus = TREC_6()
label_type = "question_class"
label_dict = corpus.make_label_dictionary(label_type=label_type)
# Define embeddings with LoRA fine-tuning
document_embeddings = TransformerDocumentEmbeddings(
"bert-base-uncased",
fine_tune=True,
# set LoRA config
peft_config=LoraConfig(
task_type=TaskType.FEATURE_EXTRACTION,
inference_mode=False,
),
)
# define model
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict, label_type=label_type)
# train model
trainer = ModelTrainer(classifier, corpus)
trainer.fine_tune(
"resources/taggers/question-classification-with-transformer",
learning_rate=5.0e-4,
mini_batch_size=4,
max_epochs=1,
)
Big thanks to @janpf for this new feature!
Smaller Library
We've removed dependencies such as gensim
from the core package, since they increased the size of the Flair library and caused some compatibility/maintenance issues. This means the core package is now smaller and fast to install.
Install as always with:
pip install flair
For certain features, you still need gensim
, such as training a model that uses classic word embeddings. For this use case, install with:
pip install flair[word-embeddings]
Or just install gensim
separately.
Big thanks to @helpmefindaname for this new feature!
- Make gensim optional by @helpmefindaname in #3493
- Update models for v0.14.0 by @alanakbik in #3505
- Relax version constraint for konoha by @himkt in #3394
- Dependencies maintainance updates by @helpmefindaname in #3402
- Make janome optional by @himkt in #3405
- Bump min. version of bpemb by @stefan-it in #3468
Other Improvements
New Features and Improvements
- Speed up euclidean distance calculation by @sheldon-roberts in #3485
- Add DataTriples which act just like DataPairs by @janpf in #3481
- Add random seed parameter to dataset splitting and downsampling for better reproducibility by @MattGPT-ai in #3475
- Allow cpu device even if gpu available by @drbh in #3417
- Add prediction label type for span classifier by @helpmefindaname in #3432
- Character embeddings store their embedding name too by @helpmefindaname in #3477
Bug Fixes
TextPairRegressor
: Fix data point iteration by @ya0guang in #3413TextPairRegressor
: Fix GPU memory leak by @MattGPT-ai in #3490TextRegressor
: Fix label_name bug by @sheldon-roberts in #3491SequenceTagger
: Fix _all_scores_for_token in ViterbiDecoder by @mauryaland in #3455SentenceSplitter
: Fix linking of sentences by @mariosaenger in #3397SentenceSplitter
: Fix case where split was performed on special characters by @helpmefindaname in #3404Classifier
: Fix loading by moving error message to main load function by @alanakbik in #3504Trainer
: Fix edge case by loading best model at end, even when there is no final evaluation by @helpmefindaname in #3470TransformerEmbeddings
: Fix special tokens by not replacing replace_additional_special_tokens by @helpmefindaname in #3451- Unit tests: Fix double
data_folder
in unit test by @ya0guang in #3412
New Datasets
- Add revision support for all Universal Dependencies datasets by @stefan-it in #3420
NER_ESTONIAN_NOISY
: Support for Estonian NER dataset with noise by @teresaloeffelhardt in #3463MASAKHA_POS
: Support for two new languages by @stefan-it in #3421UD_BAVARIAN_MAIBAAM
: Add support for new Bavarian MaiBaam UD by @stefan-it in #3426
Documentation
- Minor readme fixes by @stefan-it in #3424
- Fix typo transformer-embeddings.md by @abhisheklomsh in #3500
- Fix typo in how-model-training-works.md by @abhisheklomsh in #3499
Build Management
- Fix black and ruff by @stefan-it in #3423
- Remove zappr yaml by @helpmefindaname in #3435
- Fix
tests
package being incorrectly included in builds by @asumagic in #3440
New Contributors
- @ya0guang made their first contribution in #3413
- @drbh made their first contribution in #3417
- @asumagic made their first contribution in #3440
- @MattGPT-ai made their first contribution in #3475
- @janpf made their first contribution in #3481
- @sheldon-roberts made their first contribution in #3485
- @abhisheklomsh made their first contribution in #3500
- @teresaloeffelhardt made their first contribution in #3463
Full Changelog: v0.13.1...v0.14.0