-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of "Alternative Weighting Schemes for ELMo Embeddings" #647
Comments
Thanks for sharing this paper - interesting results. Table 1 for instance shows that concatenation of all three layers is worse on some tasks than just using one of the layers. This seems counterintuitive since if you concatenate everything, the downstream task NN can choose from all information what it needs instead of being limited to one layer. Any idea why this happens? Is this an overfitting issue? |
@alanakbik Note that the differences are in general rather small, for example, for SNLI 88.48 vs. 88.50. Likely that these differences are due to randomness. Adding more features and letting the classifier doing the work of feature selection does not work in all cases, as it can lead to a more heavily overfitting on the training data. Reducing dimension is in many cases a useful method to make the system more stable and leading to better test performances. |
@nreimers interesting, thanks! Since our embeddings are generally very large perhaps reducing dimensionality would also work for us - definitely something to look into. |
Reducing the embedding size, for example by averaging instead of concatenating the 3 ELMo embeddings, can be give quite a speed-up in terms of training & inference speed, as less complex operations are performed and less parameters must be trained for the biLSTM. In terms of F1 reducing the dimensionality can give sometimes a little boost (less overfitting), but even without F1 boost, it can give sometimes a nice training & inference speedup. |
ContextI was training a SequenceTagger model with custom ELMo embeddings and comparing it with a similar sequence labelling model which I implemented. The issue I was having was that the SequenceTagger was taking much more time to train than my custom implementation. CauseFlair's ObservationsI'm not sure if concatenating the 3 vectors is an obvious behavior, but I think that it could at least be documented at docs/embeddings/ELMO_EMBEDDINGS.md. Also, the outputs of the ELMoEmbeddings and StackedEmbeddings were not very helpful to debug it, since it does not show the (Side note, the printed size of the Features requestSo, some suggestions, if it makes sense:
|
@falcaopetri thanks, those are good points - we'll update the documentation! And printing the |
GH-647: add ELMo layer documentation
@alanakbik , I sure could implement the different ways to combine ELMo layers' embeddings ( |
GH-647: add different ways to combine ELMo layers
PR was merged and will be part of next Flair release! |
Hi,
this recent NAACL paper (@nreimers) proposes new weighting schemes for ELMo embeddings:
Find the paper "Alternative Weighting Schemes for ELMo Embeddings" paper here.
In our current
ELMoEmbeddings
implementation we use atorch.cat
of all three layers. I think we could also implement an alternative weighting scheme :)The text was updated successfully, but these errors were encountered: