Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow finetuning of word-embeddings #2491

Merged
merged 8 commits into from
Oct 30, 2021

Conversation

helpmefindaname
Copy link
Collaborator

This PR reworks the WordEmbeddings to add a few features:

  • embeddings are stored as a torch Embedding instead of a gensim keyedvector. That way it will never come to version issues, if gensim doesn't ensure backwards compatibility (happened to me a few times over the last 2 years)
  • embeddings can be put on gpu, if wanted. For that, WordEmbeddings('<name>', force_cpu=False) can be used
  • embeddings can be fine-tuned. For that, FordEmbeddings('<name>', fine_tune=True) can be used.
  • the lru_cache will store indices per word instead of vector lookups. This is because the lookups shouldn't take long if calculating all indices at once (especially if using gpu). However, the regex operations and multiple lookups will remain cached.
  • lru_cache is increased, as it takes (<embedding_length> - 1)* 4 bytes less per lookup.
  • allows usage of "stable embeddings" → https://arxiv.org/abs/2110.02861 which seem to "reduce gradient variance that comes from the highly non-uniform distribution of input tokens"

@alanakbik
Copy link
Collaborator

@helpmefindaname thanks a lot for adding this, really cool features!

I ran some experiments with the vanilla Flair configuration on CoNLL-03 (glove+flair embeddings) with three settings: no-finetune (i.e. same as before), finetune-normal (with finetuning) and finetune-stable (with finetuning and stable=True). Each configuration ran 5 times, here are averages and standard deviation:

Approach Test F1
no-finetune 93.01 +- 0.12
finetune-normal 93.03 +- 0.08
finetune-stable 92.76 +- 0.04

@alanakbik alanakbik merged commit 06a78c0 into flairNLP:master Oct 30, 2021
@alanakbik
Copy link
Collaborator

@helpmefindaname we just realized that gensim is pinned to 3.8. max in Flair, which does not work with Python 3.9.

So I was thinking of upgrading to gensim version 4, but there's lots of changes in the syntax, so I guess that would mean that 3 no longer works. Do you think requiring gensim > 4 makes sense in Flair?

@helpmefindaname helpmefindaname deleted the bf/word2vec branch December 9, 2021 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants