Releases: flairNLP/flair
Release 0.4.1
Release 0.4.1 with lots of new features, new embeddings (RNN, Transformer and BytePair embeddings), new languages (Japanese, Spanish, Basque), new datasets, bug fixes and speed improvements (2x training speed for language models).
New Embeddings
Biomedical Embeddings
Added first embeddings trained over PubMed data, namely
Load these for instance with:
# Flair embeddings PubMed
flair_embedding_forward = FlairEmbeddings('pubmed-forward')
flair_embedding_backward = FlairEmbeddings('pubmed-backward')
# ELMo embeddings PubMed
elmo_embeddings = ELMoEmbeddings('pubmed')
Byte Pair Embeddings
Added the byte pair embeddings library by @bheinzerling. Support for 275 languages. Very useful if you want to train small models. Load these for instance with:
# initialize embeddings
embeddings = BytePairEmbeddings(language='en')
Transformer-XL Embeddings
Transformer-XL embeddings added by @stefan-it. Load with:
# initialize embeddings
embeddings = TransformerXLEmbeddings()
ELMo Transformer Embeddings
Experimental transformer version of ELMo embeddings
added by @stefan-it.
DocumentRNNEmbeddings
The new DocumentRNNEmbeddings class replaces the now-deprecated DocumentLSTMEmbeddings. This class allows you to choose which type of RNN you want to use. By default, it uses a GRU.
Initialize like this:
from flair.embeddings import WordEmbeddings, DocumentRNNEmbeddings
glove_embedding = WordEmbeddings('glove')
document_lstm_embeddings = DocumentRNNEmbeddings([glove_embedding], rnn_type='LSTM')
New languages
Japanese
FlairEmbeddings
for Japanese trained by @frtacoa and @minh-agent:
# forward and backward embedding
embeddings_fw = FlairEmbeddings('japanese-forward')
embeddings_bw = FlairEmbeddings('japanese-backward')
Spanish
Added pre-computed FlairEmbeddings
for Spanish. Embeddings were computed over Wikipedia by @iamyihwa (see #80 )
To load Spanish FlairEmbeddings
, simply do:
# default forward and backward embedding
embeddings_fw = FlairEmbeddings('spanish-forward')
embeddings_bw = FlairEmbeddings('spanish-backward')
# CPU-friendly forward and backward embedding
embeddings_fw_fast = FlairEmbeddings('spanish-forward-fast')
embeddings_bw_fast = FlairEmbeddings('spanish-backward-fast')
Basque
- @stefan-it trained
FlairEmbeddings
for Basque which we now include, load with:
forward_lm_embeddings = FlairEmbeddings('basque-forward')
backward_lm_embeddings = FlairEmbeddings('basque-backward')
- add Basque FastText embeddings, load with:
wikipedia_embeddings = WordEmbeddings('eu-wiki')
crawl_embeddings = WordEmbeddings('eu-crawl')
New Datasets
- IMDB dataset #410 - load with
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.IMDB)
- TREC_6 and TREC_50 #450 - load with
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.TREC_6)
- adds download routines for Basque Universal Dependencies and Named Entities, load with
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_BASQUE)
corpus_ner = NLPTaskDataFetcher.load_corpus(NLPTask.NER_BASQUE)
Other features
FlairEmbeddings for long text
FlairEmbeddings
can now be generated for arbitrarily long strings without causing out of memory errors. See #444
Function for calculating perplexity of a string #531
Use like this:
from flair.embeddings import FlairEmbeddings
# get language model
language_model = FlairEmbeddings('news-forward-fast').lm
# calculate perplexity for grammatical sentence
grammatical = 'The company made a profit'
perplexity_gramamtical_sentence = language_model.calculate_perplexity(grammatical)
# calculate perplexity for ungrammatical sentence
ungrammatical = 'Nook negh qapla!'
perplexity_ungramamtical_sentence = language_model.calculate_perplexity(ungrammatical)
# print both
print(f'"{grammatical}" - perplexity is {perplexity_gramamtical_sentence}')
print(f'"{ungrammatical}" - perplexity is {perplexity_ungramamtical_sentence}')
Bug fixes
- Overflow error in text generation #322
- Sentence embeddings are now vectors #368
- macro average F-score computation #521
- character embeddings on CUDA #434
- accuracy calculation #553
Speed improvements
Release 0.4
Release 0.4 with new models, lots of new languages, experimental multilingual models, hyperparameter selection methods, BERT and ELMo embeddings, etc.
New Features
Support for new languages
Flair embeddings
We now include new language models for:
In addition to English and German. You can load FlairEmbeddings for Dutch for instance with:
flair_embeddings = FlairEmbeddings('dutch-forward')
Word Embeddings
We now include pre-trained FastText Embeddings for 30 languages: English, German, Dutch, Italian, French, Spanish, Swedish, Danish, Norwegian, Czech, Polish, Finnish, Bulgarian, Portuguese, Slovenian, Slovakian, Romanian, Serbian, Croatian, Catalan, Russian, Hindi, Arabic, Chinese, Japanese, Korean, Hebrew, Turkish, Persian, Indonesian.
Each language has embeddings trained over Wikipedia, or Web crawls. So instantiate with:
# German embeddings computed over Wikipedia
german_wikipedia_embeddings = WordEmbeddings('de-wiki')
# German embeddings computed over web crawls
german_crawl_embeddings = WordEmbeddings('de-crawl')
Named Entity Recognition
Thanks to the Flair community, we now include NER models for:
Next to the previous models for English and German.
Part-of-Speech Taggigng
Thanks to the Flair community, we now include PoS models for:
Multilingual models
As a major new feature, we now include models that can tag text in various languages.
12-language Part-of-Speech Tagging
We include a PoS model trained over 12 different languages (English, German, Dutch, Italian, French, Spanish, Portuguese, Swedish, Norwegian, Danish, Finnish, Polish, Czech).
# load model
tagger = SequenceTagger.load('pos-multi')
# text with English and German sentences
sentence = Sentence('George Washington went to Washington . Dort kaufte er einen Hut .')
# predict PoS tags
tagger.predict(sentence)
# print sentence with predicted tags
print(sentence.to_tagged_string())
4-language Named Entity Recognition
We include a NER model trained over 4 different languages (English, German, Dutch, Spanish).
# load model
tagger = SequenceTagger.load('ner-multi')
# text with English and German sentences
sentence = Sentence('George Washington went to Washington . Dort traf er Thomas Jefferson .')
# predict NER tags
tagger.predict(sentence)
# print sentence with predicted tags
print(sentence.to_tagged_string())
This model also kind of works on other languages, such as French.
Pre-trained classification models (issue 70)
Flair now also includes two pre-trained classification models:
- de-offensive-lanuage: detecting offensive language in German text (GermEval 2018 Task 1)
- en-sentiment: detecting postive and negative sentiment in English text (IMDB)
Simply load the TextClassifier
using the preferred model, such as
TextClassifier.load('en-sentiment')
BERT and ELMo embeddings
We added both BERT and ELMo embeddings so you can try them out, and mix and match them with Flair embeddings or any other embedding types. We hope this will enable the research community to better compare and combine approaches.
BERT Embeddings (issue 251)
We added BERT embeddings to Flair. We are using the implementation of huggingface. The embeddings can be used as any other embedding type in Flair:
from flair.embeddings import BertEmbeddings
# init embedding
embedding = BertEmbeddings()
# create a sentence
sentence = Sentence('The grass is green .')
# embed words in sentence
embedding.embed(sentence)
ELMo Embeddings (issue 260)
Flair now also includes ELMo embeddings. We use the implementation of AllenNLP. As this implementation comes with a lot of sub-dependencies, you need to first install the library via pip install allennlp
before you can use it in Flair. Using the embeddings is as simple as using any other embedding type:
from flair.embeddings import ELMoEmbeddings
# init embedding
embedding = ELMoEmbeddings()
# create a sentence
sentence = Sentence('The grass is green .')
# embed words in sentence
embedding.embed(sentence)
Multi-Dataset Training (issue 232)
You can now train a model on on multiple datasets with the MultiCorpus
object. We use this to train our multilingual models.
Just create multiple corpora and put them into MultiCorpus
:
english_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)
german_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_GERMAN)
dutch_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_DUTCH)
multi_corpus = MultiCorpus([english_corpus, german_corpus, dutch_corpus])
The multi_corpus
can now be used for training, just as any other corpus before. Check the tutorial for more details.
Parameter Selection using Hyperopt (issue 242)
We built a wrapper around hyperopt to allow you to search for the best hyperparameters for your downstream task.
Define your search space and start training using several different parameter settings. The results are written to a specific file called param_selection.txt
in the result directory. Check the tutorial for more details.
NLP Dataset Downloader (issue 243)
To make it as easy as possible to start training models, we have a new feature for automatically downloading publicly available NLP datasets. For instance, by running this code:
corpus = NLPTaskDataFetcher.load_corpus(NLPTask.UD_ENGLISH)
you download the Universal Dependencies corpus for English and can immediately start training models. The list of available datasets can be found in the tutorial.
Model training features
We added various other features to model training.
Saving training log (issue 212)
The training log output will from now on be automatically saved in the result directory you provide for training.
The log will be saved in training.log
.
Resuming training (issue 217)
It is now possible to stop training at any point in time and to resume it later by training with checkpoint
set to True
. Check the tutorial for more details.
Custom Optimizers (issue 220)
You can now choose other optimizers besides SGD, i.e. any PyTorch optimizer, plus our own modified implementations of SDG and Adam, namely SGDW and AdamW.
Learning Rate Finder (issue 228)
A new helper method to assist you in finding a good learning rate for model training.
Breaking Changes
This release introduces breaking changes. The most important are:
Unified Model Trainer (issue 189)
Instead of maintaining two separate trainer classes for sequence labeling and text classification, we now have one model training class, namely ModelTrainer
. This replaces the earlier classes SequenceTaggerTrainer
and TextClassifierTrainer
.
Downstream task models now implement the new flair.nn.Model
interface. So, both the SequenceTagger
and TextClassifier
now inherit from flair.nn.Model
. This allows both models to be trained with the ModelTrainer
, like this:
# Training text classifier
tagger = SequenceTagger(512, embeddings, tag_dictionary, 'ner')
trainer = ModelTrainer(tagger, corpus)
trainer.train('results')
# Training text classifier
classifier = TextClassifier(document_embedding, label_dictionary=label_dict)
trainer = ModelTrainer(classifier, corpus)
trainer.train('results')
The advantage is that all training parameters ans training procedures are now the same for sequence labeling and text classification, which reduces redundancy and hopefully make it easier to understand.
Metric class
The metric class is now refactored to compute micro and macro averages for F1 and accuracy. There is also a new enum EvaluationMetric
which you can pass to the ModelTrainer to tell it what to use for evaluation.
Updates and Bug Fixes
Torch 1.0 (issue 176)
Flair now bulids on torch 1.0.
Use Pathlib (issue 176)
...
Release 0.3.2
This is an update over release 0.3.1 with some critical bug fixes, a few new features and a lot more pre-packaged embeddings.
New Features
Embeddings
More word embeddings (#194 )
We added FastText embeddings for 10 languages ('en', 'de', 'fr', 'pl', 'it', 'es', 'pt', 'nl', 'ar', 'sv'), load using the two-letter language code, like this:
french_embedding = WordEmbeddings('fr')
More character LM embeddings (#204 #187 )
Thanks to contribution by @stefan-it, we added CharLMEmbeddings for Bulgarian and Slovenian. Load like this:
flm_embeddings = CharLMEmbeddings('slovenian-forward')
blm_embeddings = CharLMEmbeddings('slovenian-backward')
Custom embeddings (#170 )
Add explanation on how to use your own custom word embeddings. Simply convert to gensim.KeyedVectors and point embedding class there:
custom_embedding = WordEmbeddings('path/to/your/custom/embeddings.gensim')
New embeddings type: DocumentPoolEmbeddings
(#191 )
Add a new embedding class for document-level embeddings. You can now choose between different pooling options, e.g. min, max and average. Create the new embeddings like this:
word_embeddings = WordEmbeddings('glove')
pool_embeddings = DocumentPoolEmbeddings([word_embeddings], mode='min')
Language model
New method: generate_text()
(#167 )
The LanguageModel
class now has an in-built generate_text()
method to sample the LM. Run code like this:
# load your language model
model = LanguageModel.load_language_model('path/to/your/lm')
# generate 2000 characters
text = model.generate_text(20000)
print(text)
Metrics
Class-based metrics in Metric
class (#164 )
Refactored Metric class to provide class-based metrics, as well as micro and macro averaged F1 scores.
Bug Fixes
Fix serialization error for MacOS and Windows (#174 )
On these setups, we got errors when serializing or loading large models. We've put in place a workaround that limits model size so it works on those systems. Added bonus is that models are smaller now.
"Frozen" dropout (#184 )
Potentially big issue in which dropout was frozen in the first epoch in embeddings produced from the character LM, meaning that throughout training the same dimensions stayed dropped. Fixed this.
Testing step in language model trainer (#178 )
Previously, the language model was never applied to test data during training. A final testing step has been added in (again).
Testing
Distinguish between unit and integration tests (#183)
Instructions on how to run tests with pipenv (#161 )
Optimizations
Disable autograd during testing and prediction (#175)
Since autograd is unused here this gives us minor speedups.
Release 0.3.1
This is a stability-update over release 0.3.0 with small optimizations, refactorings and bug fixes. For list of new features, refer to 0.3.0.
Optimizations
Retain Token embeddings in memory by default (#146 )
Allow for faster training of text classifier on large datasets by keeping token embeddings im memory.
Always clear embeddings after prediction (#149 )
After prediction, remove embeddings from memory to avoid filling up memory.
Refactorings
Alignd TextClassificationTrainer and SquenceTaggerTrainer (#148 )
Align signatures and features of the two training classes to make it easier to understand training options.
Updated DocumentLSTMEmbeddings (#150 )
Remove unused flag and code from DocumentLSTMEmbeddings
Removed unneeded AWS and Jinja2 dependencies (#158 )
Some dependencies are no longer required.
Bug Fixes
Fixed error when predicting over empty sentences. (#157)
Serialization: reset cache settings when saving a model. (#153 )
Release 0.3.0
Breaking Changes
New Label
class with confidence score (#38)
A tag prediction is not a simple string anymore but a Label
, which holds a value and a confidence score.
To obtain the tag name you need to call tag.value
. To get the score call tag.score
. This can help you build
applications in which you only want to use predictions that lie above a specific confidence threshold.
LockedDropout
moved to the new flair.nn
module (#48)
New Features
Multi-token spans (#54, #97)
Entities are can now be wrapped into multi-token spans (type: Span
). This is helpful for entities that span multiple words, such as "George Washington". A Span
contains the position of the entity in the original text, the tag, a confidence score, and its text. You can get spans from a sentence by using the get_spans()
method, like so:
from flair.data import Sentence
from flair.models import SequenceTagger
# make a sentence
sentence = Sentence('George Washington went to Washington .')
# load and run NER
tagger = SequenceTagger.load('ner')
tagger.predict(sentence)
# get span entities, together with tag and confidence score
for entity in sentence.get_spans('ner'):
print('{} {} {}'.format(entity.text, entity.tag, entity.score))
Predictions with confidence score (#38)
Predicted tags are no longer simple strings, but objects of type Label
that contain a value and a confidence score. These scores are extracted during prediction from the sequence tagger or text classifier and indicate how confident the model is of a prediction. Print confidence scores of tags like this:
from flair.data import Sentence
from flair.models import SequenceTagger
# make a sentence
sentence = Sentence('George Washington went to Washington .')
# load the POS tagger
tagger = SequenceTagger.load('pos')
# run POS over sentence
tagger.predict(sentence)
# print token, predicted POS tag and confidence score
for token in sentence:
print('{} {} {}'.format(token.text, token.get_tag('pos').value, token.get_tag('pos').score))
Visualization routines (#61)
flair
now includes visualizations for plotting training curves and weights when training a sequence tagger or text classifier. We also added visualization routines for plotting embeddings and highlighting tags in a sentence. For instance, to visualize contextual string embeddings, do this:
from flair.data_fetcher import NLPTaskDataFetcher, NLPTask
from flair.embeddings import CharLMEmbeddings
from flair.visual import Visualizer
# get a list of Sentence objects
corpus = NLPTaskDataFetcher.fetch_data(NLPTask.CONLL_03).downsample(0.1)
sentences = corpus.train + corpus.test + corpus.dev
# init embeddings (can also be a StackedEmbedding)
embeddings = CharLMEmbeddings('news-forward-fast')
# embed corpus batch-wise
batches = [sentences[x:x + 8] for x in range(0, len(sentences), 8)]
for batch in batches:
embeddings.embed(batch)
# visualize
visualizer = Visualizer()
visualizer.visualize_word_emeddings(embeddings, sentences, 'data/visual/embeddings.html')
Implementation of different dropouts (#48)
Different dropout possibilities (Locked Dropout and Word Dropout) were added and can be used during training.
Memory management for training on large data sets (#137)
flair
now stores contextual string embeddings on disk to speed up training and allow for training on larger datsets.
Pre-trained language models for Polish
Added pre-trained language models for Polish, donated by (Borchmann et al., 2018). Load the Polish embeddings like this:
flm_embeddings = CharLMEmbeddings('polish-forward')
blm_embeddings = CharLMEmbeddings('polish-backward')
Bug Fixes
Fix evaluation of sequence tagger (#79, #75)
The script eval.pl
for sequence tagger contained bugs. flair
now uses its own evaluation methods.
Fix bugs in text classifier (#108)
Fixed bugs in single label training and out-of-memory errors during evaluation.
Others
Standardize logging output (#16)
Logging output for sequence tagger and text classifier is imporved and standardized.
Update torch version (#34, #106)
flair now uses torch version 0.4.1
Updated documentation (#138, #89)
Expanded documentation and tutorials.
Version 0.2.0
Breaking Changes
Reorganized package structure #12
There are now two packages: flair.models
and flair.trainers
for the models and model trainers respectively.
Models package
flair.models
contains 3 model classes: SequenceTagger
, TextClassifier
and LanguageModel
.
Trainers package
flair.trainers
contains 3 model trainer classes: SequenceTaggerTrainer
, TextClassifierTrainer
and LanguageModelTrainer
.
Direct import from package
You call these classes directly from the packages, for instance the SequenceTagger is now instantiated as:
from flair.models import SequenceTagger
tagger = SequenceTagger.load('ner')
Reorganized embeddings #12
Clear distinction between token-level and document-level embeddings by adding two classes, namely TokenEmbeddings
and DocumentEmbeddings
from which respective embeddings need to inherit.
New Features
LanguageModelTrainer #24 #17
Added LanguageModelTrainer
class to train your own LM embeddings.
Document Classification #10
Added experimental TextClassifier
model for document-level text classification. Also added corresponding model trainer class, i.e. TextClassifierTrainer
.
Batch prediction #7
Added batching into prediction method for faster sequence tagging
CPU-friendly pre-trained models #29
Added pre-trained models with smaller LM embeddings for faster CPU-inference speed
You can load them by adding '-fast' to the model name. Only for English at present.
from flair.models import SequenceTagger
tagger = SequenceTagger.load('ner-fast')
Learning Rate Scheduling #19
Added learning rate schedulers to all trainer classes for improved learning rate annealing functionality and control.
Auto-spawn on GPUs #19
All model classes now automatically spawn on GPUs if available. The separate .cuda()
call is no longer necessary.
Bug Fixes
Retagging error #23
Fixed error that occurred when using multiple pre-trained taggers on the same sentence.
Empty sentence error #33
Fixed error that caused data fetchers to sometimes create empty sentences.
Other
Unit Tests #15
Added a large set of automated unit tests for better stability.
Documentation #15
Expanded documentation and tutorials. Also expanded descriptions of APIs.
Code Simplifications in sequence tagger #19
A number of code simplifications all around, hopefully making the code easier to understand.
Version 0.1.0
First release of Flair Framework
Static word embeddings:
- includes prepared word embeddings from GloVe, FastText, Numberbatch and Extvec
- includes prepared word embeddings for English, German and Swedish
Contextual string embeddings:
- includes pre-trained models for English and German
Text embeddings:
- Two experimental methods for full-text embeddings (LSTM and Mean)
Sequence labeling:
- pre-trained models for English (PoS-tagging, chunking and NER)
- pre-trained models for German (PoS-tagging and NER)
- experimental semantic frame detector for English