Use of "Alternative Weighting Schemes for ELMo Embeddings" #647

stefan-it · 2019-04-08T08:11:53Z

Hi,

this recent NAACL paper (@nreimers) proposes new weighting schemes for ELMo embeddings:

ELMo embeddings (Peters et. al, 2018) had a huge impact on the NLP community and may recent publications use these embeddings to boost the performance for downstream NLP tasks. However, integration of ELMo embeddings in existent NLP architectures is not straightforward. In contrast to traditional word embeddings, like GloVe or word2vec embeddings, the bi-directional language model of ELMo produces three 1024 dimensional vectors per token in a sentence. Peters et al. proposed to learn a task-specific weighting of these three vectors for downstream tasks. However, this proposed weighting scheme is not feasible for certain tasks, and, as we will show, it does not necessarily yield optimal performance. We evaluate different methods that combine the three vectors from the language model in order to achieve the best possible performance in downstream NLP tasks. We notice that the third layer of the published language model often decreases the performance. By learning a weighted average of only the first two layers, we are able to improve the performance for many datasets. Due to the reduced complexity of the language model, we have a training speed-up of 19-44% for the downstream task.

Find the paper "Alternative Weighting Schemes for ELMo Embeddings" paper here.

In our current ELMoEmbeddings implementation we use a torch.cat of all three layers. I think we could also implement an alternative weighting scheme :)

The text was updated successfully, but these errors were encountered:

alanakbik · 2019-04-15T13:51:43Z

Thanks for sharing this paper - interesting results. Table 1 for instance shows that concatenation of all three layers is worse on some tasks than just using one of the layers. This seems counterintuitive since if you concatenate everything, the downstream task NN can choose from all information what it needs instead of being limited to one layer. Any idea why this happens? Is this an overfitting issue?

nreimers · 2019-04-15T14:19:14Z

@alanakbik Note that the differences are in general rather small, for example, for SNLI 88.48 vs. 88.50. Likely that these differences are due to randomness.

Adding more features and letting the classifier doing the work of feature selection does not work in all cases, as it can lead to a more heavily overfitting on the training data. Reducing dimension is in many cases a useful method to make the system more stable and leading to better test performances.

alanakbik · 2019-04-15T14:32:05Z

@nreimers interesting, thanks! Since our embeddings are generally very large perhaps reducing dimensionality would also work for us - definitely something to look into.

nreimers · 2019-04-15T15:38:58Z

Reducing the embedding size, for example by averaging instead of concatenating the 3 ELMo embeddings, can be give quite a speed-up in terms of training & inference speed, as less complex operations are performed and less parameters must be trained for the biLSTM.

In terms of F1 reducing the dimensionality can give sometimes a little boost (less overfitting), but even without F1 boost, it can give sometimes a nice training & inference speedup.

falcaopetri · 2020-02-18T23:43:27Z

@stefan-it @alanakbik

Context

I was training a SequenceTagger model with custom ELMo embeddings and comparing it with a similar sequence labelling model which I implemented. The issue I was having was that the SequenceTagger was taking much more time to train than my custom implementation.

Cause

Flair's ELMoEmbeddings behavior to concatenate the output of the three ELMo's layers was yielding an ELMoEmbeddings with 3072-dims. This endeds up increasing the model size (see #1433 for example). Meanwhile, my custom model was taking an average of the three layers, therefore yielding a 1024-dims vector.

Observations

I'm not sure if concatenating the 3 vectors is an obvious behavior, but I think that it could at least be documented at docs/embeddings/ELMO_EMBEDDINGS.md.

Also, the outputs of the ELMoEmbeddings and StackedEmbeddings were not very helpful to debug it, since it does not show the embedding_length property:

(Side note, the printed size of the embedding2nn is of course the StackedEmbeddings's length, but I did not notice it at first.)

Features request

So, some suggestions, if it makes sense:

Document that ELMoEmbeddings is concatenating all ELMo's output layers.
Print embedding_length when printing the embedding.
Add an option to decide which way to combine the 3 layers' embeddings, such as all, top, average (see allennlp/commands/elmo.py#L33-L35 and allennlp/commands/elmo.py#L394-L399).
Bonus: Instead of allennlp.commands.ElmoEmbedder, shouldn't we use allennlp.modules.elmo.Elmo? This way we could learn the downstream task's weights for each layer, as described in Using ELMo as a PyTorch Module to train a new model.

alanakbik · 2020-02-24T08:40:31Z

@falcaopetri thanks, those are good points - we'll update the documentation! And printing the embedding_length is a good idea. Would you be interested in doing a PR for the modifications in the ELMoEmbeddings class?

GH-647: add ELMo layer documentation

falcaopetri · 2020-02-24T21:13:20Z

@alanakbik , I sure could implement the different ways to combine ELMo layers' embeddings (all, top, average), with all being the default.
Would you also be intereted in migrating from allennlp.commands.ElmoEmbedder to allennlp.modules.elmo.Elmo? Because in this case, well, I basically do not know anything about AllenNLP's ELMo. I could give it a try anyway, but I guess this would require a deeper discussion.

GH-647: add different ways to combine ELMo layers

alanakbik · 2020-04-30T12:57:06Z

PR was merged and will be part of next Flair release!

stefan-it added the enhancement Improving of an existing feature label Apr 8, 2019

stefan-it self-assigned this Apr 15, 2019

stefan-it mentioned this issue Jun 27, 2019

ELMO pubmed model #502

Closed

alanakbik added a commit that referenced this issue Feb 24, 2020

GH-647: add ELMo layer documentation

b9fdf3a

alanakbik mentioned this issue Feb 24, 2020

GH-647: add ELMo layer documentation #1447

Merged

alanakbik added a commit that referenced this issue Feb 24, 2020

Merge pull request #1447 from flairNLP/GH-1400-onehot-documentation

0e1a182

GH-647: add ELMo layer documentation

falcaopetri added a commit to falcaopetri/flair that referenced this issue Apr 26, 2020

flairNLPGH-647: add different ways to combine ELMo layers

87def73

falcaopetri added a commit to falcaopetri/flair that referenced this issue Apr 26, 2020

flairNLPGH-647: add different ways to combine ELMo layers

454ba44

falcaopetri added a commit to falcaopetri/flair that referenced this issue Apr 26, 2020

flairNLPGH-647: add different ways to combine ELMo layers

8853bda

falcaopetri mentioned this issue Apr 26, 2020

GH-647: add different ways to combine ELMo layers #1547

Merged

alanakbik added a commit that referenced this issue Apr 28, 2020

Merge pull request #1547 from falcaopetri/GH-647-elmo-output-modes

0527436

GH-647: add different ways to combine ELMo layers

alanakbik closed this as completed Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of "Alternative Weighting Schemes for ELMo Embeddings" #647

Use of "Alternative Weighting Schemes for ELMo Embeddings" #647

stefan-it commented Apr 8, 2019 •

edited

Loading

alanakbik commented Apr 15, 2019

nreimers commented Apr 15, 2019

alanakbik commented Apr 15, 2019

nreimers commented Apr 15, 2019

falcaopetri commented Feb 18, 2020

alanakbik commented Feb 24, 2020

falcaopetri commented Feb 24, 2020

alanakbik commented Apr 30, 2020

Use of "Alternative Weighting Schemes for ELMo Embeddings" #647

Use of "Alternative Weighting Schemes for ELMo Embeddings" #647

Comments

stefan-it commented Apr 8, 2019 • edited Loading

alanakbik commented Apr 15, 2019

nreimers commented Apr 15, 2019

alanakbik commented Apr 15, 2019

nreimers commented Apr 15, 2019

falcaopetri commented Feb 18, 2020

Context

Cause

Observations

Features request

alanakbik commented Feb 24, 2020

falcaopetri commented Feb 24, 2020

alanakbik commented Apr 30, 2020

stefan-it commented Apr 8, 2019 •

edited

Loading