Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of "Alternative Weighting Schemes for ELMo Embeddings" #647

Closed
stefan-it opened this issue Apr 8, 2019 · 8 comments
Closed

Use of "Alternative Weighting Schemes for ELMo Embeddings" #647

stefan-it opened this issue Apr 8, 2019 · 8 comments
Assignees
Labels
enhancement Improving of an existing feature

Comments

@stefan-it
Copy link
Member

stefan-it commented Apr 8, 2019

Hi,

this recent NAACL paper (@nreimers) proposes new weighting schemes for ELMo embeddings:

ELMo embeddings (Peters et. al, 2018) had a huge impact on the NLP community and may recent publications use these embeddings to boost the performance for downstream NLP tasks. However, integration of ELMo embeddings in existent NLP architectures is not straightforward. In contrast to traditional word embeddings, like GloVe or word2vec embeddings, the bi-directional language model of ELMo produces three 1024 dimensional vectors per token in a sentence. Peters et al. proposed to learn a task-specific weighting of these three vectors for downstream tasks. However, this proposed weighting scheme is not feasible for certain tasks, and, as we will show, it does not necessarily yield optimal performance. We evaluate different methods that combine the three vectors from the language model in order to achieve the best possible performance in downstream NLP tasks. We notice that the third layer of the published language model often decreases the performance. By learning a weighted average of only the first two layers, we are able to improve the performance for many datasets. Due to the reduced complexity of the language model, we have a training speed-up of 19-44% for the downstream task.

Find the paper "Alternative Weighting Schemes for ELMo Embeddings" paper here.

In our current ELMoEmbeddings implementation we use a torch.cat of all three layers. I think we could also implement an alternative weighting scheme :)

@stefan-it stefan-it added the enhancement Improving of an existing feature label Apr 8, 2019
@stefan-it stefan-it self-assigned this Apr 15, 2019
@alanakbik
Copy link
Collaborator

Thanks for sharing this paper - interesting results. Table 1 for instance shows that concatenation of all three layers is worse on some tasks than just using one of the layers. This seems counterintuitive since if you concatenate everything, the downstream task NN can choose from all information what it needs instead of being limited to one layer. Any idea why this happens? Is this an overfitting issue?

@nreimers
Copy link

@alanakbik Note that the differences are in general rather small, for example, for SNLI 88.48 vs. 88.50. Likely that these differences are due to randomness.

Adding more features and letting the classifier doing the work of feature selection does not work in all cases, as it can lead to a more heavily overfitting on the training data. Reducing dimension is in many cases a useful method to make the system more stable and leading to better test performances.

@alanakbik
Copy link
Collaborator

@nreimers interesting, thanks! Since our embeddings are generally very large perhaps reducing dimensionality would also work for us - definitely something to look into.

@nreimers
Copy link

Reducing the embedding size, for example by averaging instead of concatenating the 3 ELMo embeddings, can be give quite a speed-up in terms of training & inference speed, as less complex operations are performed and less parameters must be trained for the biLSTM.

In terms of F1 reducing the dimensionality can give sometimes a little boost (less overfitting), but even without F1 boost, it can give sometimes a nice training & inference speedup.

@falcaopetri
Copy link
Contributor

@stefan-it @alanakbik

Context

I was training a SequenceTagger model with custom ELMo embeddings and comparing it with a similar sequence labelling model which I implemented. The issue I was having was that the SequenceTagger was taking much more time to train than my custom implementation.

Cause

Flair's ELMoEmbeddings behavior to concatenate the output of the three ELMo's layers was yielding an ELMoEmbeddings with 3072-dims. This endeds up increasing the model size (see #1433 for example). Meanwhile, my custom model was taking an average of the three layers, therefore yielding a 1024-dims vector.

Observations

I'm not sure if concatenating the 3 vectors is an obvious behavior, but I think that it could at least be documented at docs/embeddings/ELMO_EMBEDDINGS.md.

Also, the outputs of the ELMoEmbeddings and StackedEmbeddings were not very helpful to debug it, since it does not show the embedding_length property:
image

(Side note, the printed size of the embedding2nn is of course the StackedEmbeddings's length, but I did not notice it at first.)

Features request

So, some suggestions, if it makes sense:

@alanakbik
Copy link
Collaborator

@falcaopetri thanks, those are good points - we'll update the documentation! And printing the embedding_length is a good idea. Would you be interested in doing a PR for the modifications in the ELMoEmbeddings class?

@falcaopetri
Copy link
Contributor

@alanakbik , I sure could implement the different ways to combine ELMo layers' embeddings (all, top, average), with all being the default.
Would you also be intereted in migrating from allennlp.commands.ElmoEmbedder to allennlp.modules.elmo.Elmo? Because in this case, well, I basically do not know anything about AllenNLP's ELMo. I could give it a try anyway, but I guess this would require a deeper discussion.

falcaopetri added a commit to falcaopetri/flair that referenced this issue Apr 26, 2020
falcaopetri added a commit to falcaopetri/flair that referenced this issue Apr 26, 2020
falcaopetri added a commit to falcaopetri/flair that referenced this issue Apr 26, 2020
alanakbik added a commit that referenced this issue Apr 28, 2020
GH-647: add different ways to combine ELMo layers
@alanakbik
Copy link
Collaborator

PR was merged and will be part of next Flair release!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improving of an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants