allow finetuning of word-embeddings #2491

helpmefindaname · 2021-10-24T15:59:28Z

This PR reworks the WordEmbeddings to add a few features:

embeddings are stored as a torch Embedding instead of a gensim keyedvector. That way it will never come to version issues, if gensim doesn't ensure backwards compatibility (happened to me a few times over the last 2 years)
embeddings can be put on gpu, if wanted. For that, WordEmbeddings('<name>', force_cpu=False) can be used
embeddings can be fine-tuned. For that, FordEmbeddings('<name>', fine_tune=True) can be used.
the lru_cache will store indices per word instead of vector lookups. This is because the lookups shouldn't take long if calculating all indices at once (especially if using gpu). However, the regex operations and multiple lookups will remain cached.
lru_cache is increased, as it takes (<embedding_length> - 1)* 4 bytes less per lookup.
allows usage of "stable embeddings" → https://arxiv.org/abs/2110.02861 which seem to "reduce gradient variance that comes from the highly non-uniform distribution of input tokens"

…d embeddings

flair/embeddings/token.py

alanakbik · 2021-10-30T11:49:48Z

@helpmefindaname thanks a lot for adding this, really cool features!

I ran some experiments with the vanilla Flair configuration on CoNLL-03 (glove+flair embeddings) with three settings: no-finetune (i.e. same as before), finetune-normal (with finetuning) and finetune-stable (with finetuning and stable=True). Each configuration ran 5 times, here are averages and standard deviation:

Approach	Test F1
no-finetune	93.01 +- 0.12
finetune-normal	93.03 +- 0.08
finetune-stable	92.76 +- 0.04

alanakbik · 2021-11-02T04:06:37Z

@helpmefindaname we just realized that gensim is pinned to 3.8. max in Flair, which does not work with Python 3.9.

So I was thinking of upgrading to gensim version 4, but there's lots of changes in the syntax, so I guess that would mean that 3 no longer works. Do you think requiring gensim > 4 makes sense in Flair?

helpmefindaname added 6 commits October 12, 2021 23:29

refactor wordembeddings to not depend on gensim after loading

7d5e924

remove useless indexes

fe326a7

implement quantization

90c42e2

make wordembeddings be a propper torch layer and implement stable wor…

6097a6c

…d embeddings

rename stay cpu to force cpu

1f2aee7

add some explanations for unclear/complex code

e1e2b93

alanakbik reviewed Oct 28, 2021

View reviewed changes

flair/embeddings/token.py Outdated Show resolved Hide resolved

use long for word indices for torch backwards compatibility

eb8c80f

alanakbik reviewed Oct 29, 2021

View reviewed changes

flair/embeddings/token.py Outdated Show resolved Hide resolved

more clear error message if training word embeddings on cpu

c790f4f

alanakbik merged commit 06a78c0 into flairNLP:master Oct 30, 2021

helpmefindaname deleted the bf/word2vec branch December 9, 2021 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow finetuning of word-embeddings #2491

allow finetuning of word-embeddings #2491

helpmefindaname commented Oct 24, 2021

alanakbik commented Oct 30, 2021

alanakbik commented Nov 2, 2021

allow finetuning of word-embeddings #2491

allow finetuning of word-embeddings #2491

Conversation

helpmefindaname commented Oct 24, 2021

alanakbik commented Oct 30, 2021

alanakbik commented Nov 2, 2021