Train NER for Swedish #3

alanakbik · 2018-07-09T13:24:47Z

Train a simple NER tagger for Swedish trained for instance over this dataset.

For this task, we need to adapt the NLPTaskDataFetcher for the appropriate Swedish dataset and train a simple model using Swedish word embeddings. How to train a model is illustrated here.

Swedish word embeddings can now be loaded with

embeddings = WordEmbeddings('sv-fasttext')

For issue #2

The text was updated successfully, but these errors were encountered:

EmilStenstrom · 2018-07-10T08:33:56Z

I’m on vacation right now, so just for reference, if someone else wants to start on this: here is code to parse SUC 3.0 into IOB format. It’s licensed so you can just copy parts of it into flair. https://github.com/EmilStenstrom/suc_to_iob

roshammar · 2018-11-15T13:28:39Z

If no one else is working on this, I might give it a go!

EmilStenstrom · 2018-11-15T15:38:54Z

@roshammar I'm not working on it, but am very interested in anything you can get working!

stefan-it · 2018-11-15T15:53:14Z

Great, I'm currently training a backward lm for Dutch (forward is already completed), so just let me know, if you need a lm for Swedish :)

roshammar · 2018-11-15T17:19:41Z

@stefan-it Sure, I'd be very interested in that!

stefan-it · 2018-11-19T20:24:44Z

I'm currently training the forward lm and will post back, when the training has finished :)

stefan-it · 2018-11-23T14:17:44Z

LM's for Swedish are uploaded now:

wget https://schweter.eu/cloud/flair-lms/lm-sv-large-forward-v0.1.pt
wget https://schweter.eu/cloud/flair-lms/lm-sv-large-backward-v0.1.pt

On Universal Dependencies (v1.2) an accuracy of 96.59 % can be achieved with only using fasttext embeddings. Using the forward + backward language model an accuracy of 98.32 % can be achieved. Current state-of-the-art is Yasunaga et. al (2017) with adversarial training achieving an accuracy of 96.70 %.

Feel free to integrate the language models in flair!

For the NER task: the dataset (suc_3.0_iob.txt) mentioned in the first two posts is not splitted into training, dev and test. So we need to split the original dataset manually -> maybe @alanakbik could do the splitting :)

alanakbik · 2018-11-23T14:26:14Z

Hey this is great! We will absolutely include this in the 0.4 release - looks like we're getting serious about multilinguality :)

For 0.4. we just pushed a PR that does random sampling to get dev data from train if no dev data exists. I think we can add a similar thing for test data in this case!

roshammar · 2018-11-29T08:39:38Z

So, finally I had the time to look at this.

I have now trained a first model on Swedish (dataset SUC 3.0), using only PRS, LOC, and ORG as entities.

Overall test score is 0.9121 (LOC 0.8575, ORG 0.6383, PRS: 0.9298).

I have observed many errors in the training data, both TP and FP, so I will try to improve the data and do more experiments to get even better results.
I will also train other models with more NER tags than PRS, ORG, LOC.

A thought: Currently this is trained on sentence level. Would it not be beneficial to train on document level, since then the same entity might be mentioned several times, increasing our confidence? How long sequences can be handled?

roshammar · 2018-11-29T08:46:08Z

And, of course, a big thank you to @stefan-it for the models!

alanakbik · 2018-11-29T08:47:11Z

Hey this is great - thanks for sharing the results!

Yes, we've been thinking a lot about how to get better document-level infos into the classifier. We have a simple prototype embeddings class for one way to do this in the current release-0.4 branch - called FlairEmbeddings. It embeds and averages over all sentence words in a batch and also keeps a memory of previously embedded words. It looks like this gives us an F1-score boost, but we are still tinkering around, so the class might still change.

Any ideas / contributions in this space will be very welcome :)

EmilStenstrom · 2019-08-05T13:19:15Z

Is there anything else that's required to train a ner model for swedish? Can I help somehow? Any updates @roshammar @stefan-it ?

thak123 · 2019-11-12T14:44:31Z

what is the location of the model ?

stefan-it · 2019-11-12T14:52:12Z

Hi @thak123 you can use the Flair embeddings with:

from flair.embeddings import FlairEmbeddings
forward_embeddings = FlairEmbeddings("sv-forward")
backward_embeddings = FlairEmbeddings("sv-backward")

Details about the amount of data for training these embeddings can be found here :)

stale · 2020-04-29T18:10:48Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

EmilStenstrom · 2020-05-06T19:37:35Z

@stefan-it no news I guess? :/

…nd used it in NER_SWEDISH corpus class to add IOB2 tags to the swedish dataset https://github.com/klintan/swedish-ner-corpus/

import pre ANERcorp

SaveModelState

merge upstream changes

alanakbik added help wanted Extra attention is needed good first issue Good for newcomers labels Jul 9, 2018

roshammar mentioned this issue Nov 15, 2018

Overestimated performance for NER in the 0.2 version #206

Closed

alanakbik added new language New languages release-0.4 labels Nov 23, 2018

alanakbik pushed a commit that referenced this issue Nov 26, 2018

GH-3: data fetcher samples test data from train if no test file exists

0de8c0b

alanakbik mentioned this issue Nov 26, 2018

GH-243: dataset downloader #246

Closed

alanakbik pushed a commit that referenced this issue Nov 26, 2018

GH-3: data fetcher samples test data from train if no test file exists

dbf593f

alanakbik mentioned this issue Nov 26, 2018

GH-243: dataset downloader #247

Merged

iamyihwa mentioned this issue Dec 10, 2018

'list' object has no attribute 'embed' when trying to predict with pretrained model #294

Closed

tabergma removed the release-0.4 label Dec 20, 2018

JieyuZhao mentioned this issue Jan 31, 2019

CUDA Error during CharacterEmbeddings #421

Closed

jmlongriver mentioned this issue Feb 20, 2019

When running the calculate perplexity, got the error #534

Closed

pvcastro mentioned this issue Mar 15, 2019

Running predict on CPU using ELMo Embeddings #610

Closed

TheArowanaDude mentioned this issue Aug 5, 2019

How do you convert a model trained on GPU to be used for inferencing on CPU? #957

Closed

carljohanrehn mentioned this issue Sep 22, 2019

Using One Hot Embeddings fails on gpu #1141

Closed

stefan-it self-assigned this Nov 12, 2019

CatarinaPC mentioned this issue Mar 10, 2020

By default, what is the best model? #1472

Closed

stale bot added the wontfix This will not be worked on label Apr 29, 2020

stale bot closed this as completed May 6, 2020

marcelmmm added a commit that referenced this issue May 30, 2020

GH-3: added function add_IOB2_tags to datasets/sequence_labeling.py a…

a03846b

…nd used it in NER_SWEDISH corpus class to add IOB2 tags to the swedish dataset https://github.com/klintan/swedish-ner-corpus/

marcelmmm mentioned this issue May 30, 2020

Ner corpus swedish #1652

Merged

makcedward mentioned this issue Oct 10, 2020

Sentence Augmentation (NLPAug) #1903

Closed

alanakbik pushed a commit that referenced this issue Oct 15, 2020

Merge pull request #3 from flairNLP/master

fc4d66c

import pre ANERcorp

whoisjones added a commit that referenced this issue Feb 9, 2021

Merge pull request #3 from whoisjones/SaveModelState

0650014

SaveModelState

alanakbik pushed a commit that referenced this issue Apr 19, 2021

Merge pull request #3 from flairNLP/master

fc6071c

merge upstream changes

whoisjones added a commit that referenced this issue Nov 9, 2021

multitask model adjustments #3

faa6ef8

Madhu000 mentioned this issue Aug 8, 2022

Fine-tuning t5-base model raises an error #1661

Closed

None-Such mentioned this issue May 29, 2023

[Question]: How to Train a Multi-label Text Classifier? #3255

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train NER for Swedish #3

Train NER for Swedish #3

alanakbik commented Jul 9, 2018 •

edited

Loading

EmilStenstrom commented Jul 10, 2018

roshammar commented Nov 15, 2018

EmilStenstrom commented Nov 15, 2018

stefan-it commented Nov 15, 2018

roshammar commented Nov 15, 2018

stefan-it commented Nov 19, 2018

stefan-it commented Nov 23, 2018 •

edited

Loading

alanakbik commented Nov 23, 2018

roshammar commented Nov 29, 2018

roshammar commented Nov 29, 2018

alanakbik commented Nov 29, 2018

EmilStenstrom commented Aug 5, 2019

thak123 commented Nov 12, 2019

stefan-it commented Nov 12, 2019

stale bot commented Apr 29, 2020

EmilStenstrom commented May 6, 2020

Train NER for Swedish #3

Train NER for Swedish #3

Comments

alanakbik commented Jul 9, 2018 • edited Loading

EmilStenstrom commented Jul 10, 2018

roshammar commented Nov 15, 2018

EmilStenstrom commented Nov 15, 2018

stefan-it commented Nov 15, 2018

roshammar commented Nov 15, 2018

stefan-it commented Nov 19, 2018

stefan-it commented Nov 23, 2018 • edited Loading

alanakbik commented Nov 23, 2018

roshammar commented Nov 29, 2018

roshammar commented Nov 29, 2018

alanakbik commented Nov 29, 2018

EmilStenstrom commented Aug 5, 2019

thak123 commented Nov 12, 2019

stefan-it commented Nov 12, 2019

stale bot commented Apr 29, 2020

EmilStenstrom commented May 6, 2020

alanakbik commented Jul 9, 2018 •

edited

Loading

stefan-it commented Nov 23, 2018 •

edited

Loading