Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train NER for Swedish #3

Closed
alanakbik opened this issue Jul 9, 2018 · 16 comments
Closed

Train NER for Swedish #3

alanakbik opened this issue Jul 9, 2018 · 16 comments
Assignees
Labels
good first issue Good for newcomers help wanted Extra attention is needed new language New languages wontfix This will not be worked on

Comments

@alanakbik
Copy link
Collaborator

alanakbik commented Jul 9, 2018

Train a simple NER tagger for Swedish trained for instance over this dataset.

For this task, we need to adapt the NLPTaskDataFetcher for the appropriate Swedish dataset and train a simple model using Swedish word embeddings. How to train a model is illustrated here.

Swedish word embeddings can now be loaded with

embeddings = WordEmbeddings('sv-fasttext')

For issue #2

@alanakbik alanakbik added help wanted Extra attention is needed good first issue Good for newcomers labels Jul 9, 2018
@EmilStenstrom
Copy link

I’m on vacation right now, so just for reference, if someone else wants to start on this: here is code to parse SUC 3.0 into IOB format. It’s licensed so you can just copy parts of it into flair. https://github.com/EmilStenstrom/suc_to_iob

@roshammar
Copy link

If no one else is working on this, I might give it a go!

@EmilStenstrom
Copy link

@roshammar I'm not working on it, but am very interested in anything you can get working!

@stefan-it
Copy link
Member

Great, I'm currently training a backward lm for Dutch (forward is already completed), so just let me know, if you need a lm for Swedish :)

@roshammar
Copy link

@stefan-it Sure, I'd be very interested in that!

@stefan-it
Copy link
Member

I'm currently training the forward lm and will post back, when the training has finished :)

@stefan-it
Copy link
Member

stefan-it commented Nov 23, 2018

LM's for Swedish are uploaded now:

wget https://schweter.eu/cloud/flair-lms/lm-sv-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-sv-large-backward-v0.1.pt

On Universal Dependencies (v1.2) an accuracy of 96.59 % can be achieved with only using fasttext embeddings. Using the forward + backward language model an accuracy of 98.32 % can be achieved. Current state-of-the-art is Yasunaga et. al (2017) with adversarial training achieving an accuracy of 96.70 %.

Feel free to integrate the language models in flair!

For the NER task: the dataset (suc_3.0_iob.txt) mentioned in the first two posts is not splitted into training, dev and test. So we need to split the original dataset manually -> maybe @alanakbik could do the splitting :)

@alanakbik
Copy link
Collaborator Author

Hey this is great! We will absolutely include this in the 0.4 release - looks like we're getting serious about multilinguality :)

For 0.4. we just pushed a PR that does random sampling to get dev data from train if no dev data exists. I think we can add a similar thing for test data in this case!

@roshammar
Copy link

So, finally I had the time to look at this.

I have now trained a first model on Swedish (dataset SUC 3.0), using only PRS, LOC, and ORG as entities.

Overall test score is 0.9121 (LOC 0.8575, ORG 0.6383, PRS: 0.9298).

I have observed many errors in the training data, both TP and FP, so I will try to improve the data and do more experiments to get even better results.
I will also train other models with more NER tags than PRS, ORG, LOC.

A thought: Currently this is trained on sentence level. Would it not be beneficial to train on document level, since then the same entity might be mentioned several times, increasing our confidence? How long sequences can be handled?

@roshammar
Copy link

And, of course, a big thank you to @stefan-it for the models!

@alanakbik
Copy link
Collaborator Author

Hey this is great - thanks for sharing the results!

Yes, we've been thinking a lot about how to get better document-level infos into the classifier. We have a simple prototype embeddings class for one way to do this in the current release-0.4 branch - called FlairEmbeddings. It embeds and averages over all sentence words in a batch and also keeps a memory of previously embedded words. It looks like this gives us an F1-score boost, but we are still tinkering around, so the class might still change.

Any ideas / contributions in this space will be very welcome :)

@EmilStenstrom
Copy link

Is there anything else that's required to train a ner model for swedish? Can I help somehow? Any updates @roshammar @stefan-it ?

@thak123
Copy link

thak123 commented Nov 12, 2019

what is the location of the model ?

@stefan-it
Copy link
Member

Hi @thak123 you can use the Flair embeddings with:

from flair.embeddings import FlairEmbeddings
forward_embeddings = FlairEmbeddings("sv-forward")
backward_embeddings = FlairEmbeddings("sv-backward")

Details about the amount of data for training these embeddings can be found here :)

@stale
Copy link

stale bot commented Apr 29, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Apr 29, 2020
@stale stale bot closed this as completed May 6, 2020
@EmilStenstrom
Copy link

@stefan-it no news I guess? :/

marcelmmm added a commit that referenced this issue May 30, 2020
alanakbik pushed a commit that referenced this issue Oct 15, 2020
whoisjones added a commit that referenced this issue Feb 9, 2021
alanakbik pushed a commit that referenced this issue Apr 19, 2021
whoisjones added a commit that referenced this issue Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed new language New languages wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

6 participants