-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train NER for Swedish #3
Comments
I’m on vacation right now, so just for reference, if someone else wants to start on this: here is code to parse SUC 3.0 into IOB format. It’s licensed so you can just copy parts of it into flair. https://github.com/EmilStenstrom/suc_to_iob |
If no one else is working on this, I might give it a go! |
@roshammar I'm not working on it, but am very interested in anything you can get working! |
Great, I'm currently training a backward lm for Dutch (forward is already completed), so just let me know, if you need a lm for Swedish :) |
@stefan-it Sure, I'd be very interested in that! |
I'm currently training the forward lm and will post back, when the training has finished :) |
LM's for Swedish are uploaded now: wget https://schweter.eu/cloud/flair-lms/lm-sv-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-sv-large-backward-v0.1.pt On Universal Dependencies (v1.2) an accuracy of 96.59 % can be achieved with only using Feel free to integrate the language models in flair! For the NER task: the dataset ( |
Hey this is great! We will absolutely include this in the 0.4 release - looks like we're getting serious about multilinguality :) For 0.4. we just pushed a PR that does random sampling to get dev data from train if no dev data exists. I think we can add a similar thing for test data in this case! |
So, finally I had the time to look at this. I have now trained a first model on Swedish (dataset SUC 3.0), using only PRS, LOC, and ORG as entities. Overall test score is 0.9121 (LOC 0.8575, ORG 0.6383, PRS: 0.9298). I have observed many errors in the training data, both TP and FP, so I will try to improve the data and do more experiments to get even better results. A thought: Currently this is trained on sentence level. Would it not be beneficial to train on document level, since then the same entity might be mentioned several times, increasing our confidence? How long sequences can be handled? |
And, of course, a big thank you to @stefan-it for the models! |
Hey this is great - thanks for sharing the results! Yes, we've been thinking a lot about how to get better document-level infos into the classifier. We have a simple prototype embeddings class for one way to do this in the current release-0.4 branch - called Any ideas / contributions in this space will be very welcome :) |
Is there anything else that's required to train a ner model for swedish? Can I help somehow? Any updates @roshammar @stefan-it ? |
what is the location of the model ? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
@stefan-it no news I guess? :/ |
…nd used it in NER_SWEDISH corpus class to add IOB2 tags to the swedish dataset https://github.com/klintan/swedish-ner-corpus/
Train a simple NER tagger for Swedish trained for instance over this dataset.
For this task, we need to adapt the NLPTaskDataFetcher for the appropriate Swedish dataset and train a simple model using Swedish word embeddings. How to train a model is illustrated here.
Swedish word embeddings can now be loaded with
embeddings = WordEmbeddings('sv-fasttext')
For issue #2
The text was updated successfully, but these errors were encountered: