Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for more languages? #2

Closed
EmilStenstrom opened this issue Jul 5, 2018 · 26 comments
Closed

Support for more languages? #2

EmilStenstrom opened this issue Jul 5, 2018 · 26 comments
Labels
enhancement Improving of an existing feature language model Related to language model wontfix This will not be worked on

Comments

@EmilStenstrom
Copy link

Hi! Flair looks amazing. Clean code, easy to use. Thanks for making it open source!

I was wondering if you plan to add support for more languages? Maybe all the languages where Zalando operates? :) I'm working for a company that need NLP-code that works across pretty much the same set of countries.

Looking at different available libraries, pre-trained models for more than just English (and German in this case!), is lacking in all the other libraries.

@alanakbik
Copy link
Collaborator

Hello Emil! Thanks for the interest - we are thinking of adding more models in more languages. In particular, we are currently looking at French, Italian and Dutch. Which languages / tasks are you most interested in?

@EmilStenstrom
Copy link
Author

EmilStenstrom commented Jul 5, 2018

It's a bit different depending on if it's for hobby projects or work projects.

Hobby projects: Swedish / POS, Swedish / NER
Business projects: Nordics (Swedish/Norwegian/Danish/Finish), German, Spanish, English, Polish. POS and NER.

@alanakbik
Copy link
Collaborator

Ok great! The German models (POS/NER) will be put online probably sometime next week.

We will also progressively add more languages in the near future. Of your list, I think Polish and Spanish are the most likely to be added soonish, though I can't say exactly when.

@EmilStenstrom
Copy link
Author

Is there something about adding a new language that I could help with?

For instance, there one big Swedish dataset with POS and NER tags called SUC 3.0. It's available for download here: https://spraakbanken.gu.se/eng/resource/suc3

@alanakbik alanakbik added the enhancement Improving of an existing feature label Jul 9, 2018
@alanakbik
Copy link
Collaborator

Yes, if you're interested you could train a new model for Swedish POS or NER. You would probably need to adapt the NLPTaskDataFetcher for the task you want to train it on, but otherwise could probably use pretty much the same code as given here (and in the experiments section).

I've added Swedish word embeddings to the project. I will also add issues for this task if you are interested!

@EmilStenstrom
Copy link
Author

Fantastic! If you add the issues I’ll see where I can help.

@eduardompereira
Copy link

Are you planning to work on Portuguese language?

@alanakbik
Copy link
Collaborator

@eduardompereira I am not sure how quickly we can get around to Portuguese, so we'd welcome contributions here! If it helps, we could package standard word embeddings for Portuguese with the next release? Are you aware of good NER datasets for Portuguese?

@tabergma tabergma added the language model Related to language model label Oct 4, 2018
@mhham
Copy link

mhham commented Oct 24, 2018

Hi there.
Any news on the french models ?
For NER and POS-Tagging there is the WikiNER french dataset which comes in a quite easily adaptable format :
https://github.com/dice-group/FOX/tree/master/input/Wikiner

For the word embeddings one can also use french fasttext embeddings:
https://fasttext.cc/docs/en/crawl-vectors.html

@alanakbik
Copy link
Collaborator

Hi @mhham thanks for the pointers - more languages are definitely planned and French is high up on our priority list. I am hoping that the next release will be a lot more multilingual than currently, but I am not sure how quickly we can get around to which language. Of course contributions are always welcome!

@lz-chen
Copy link

lz-chen commented Feb 13, 2019

Hi, thanks for the great work! I wonder how many languages does flair support for NER now? From what I see on release 0.4 it seems that English, German, Dutch, French, italian, Spanish, Portuguese, Polish are supported?
Btw is there any updates on Nordic language models @EmilStenstrom? I am currently working with NER in Norwegian so it would be very useful:) Thanks!

@stefan-it
Copy link
Member

I've trained FlairEmbeddings on Wikipedia dumps + OPUS (1 epoch) for some more languages:

no, fa, ar, id, pl, da, hi, nl, eu, sl, he, hr, fi, bg, cs and sv.

I'll provide them as soon as I have checked their performance on UD :)

@lz-chen
Copy link

lz-chen commented Feb 13, 2019

Thanks for the reply! @stefan-it
I just read the Tutorial 2, so the pretrained NER model is available in German, French and Dutch, right?

@alanakbik
Copy link
Collaborator

Yes, you could also test our multilingual NER model, which can detect entities in English, German, Dutch and Spanish (and even other languages a little) even though it is only one model.

@lz-chen
Copy link

lz-chen commented Feb 13, 2019

Thanks for the pointer! Will try that out:)

@pvcastro
Copy link
Contributor

@eduardompereira I am not sure how quickly we can get around to Portuguese, so we'd welcome contributions here! If it helps, we could package standard word embeddings for Portuguese with the next release? Are you aware of good NER datasets for Portuguese?

Hi @alanakbik . I study NER for Portuguese, and for "general" NER models, I believe the best dataset is the one spacy uses, which is the one from WikiNER (Learning multilingual named entity recognition from Wikipedia) .
As for Portuguese word embeddings, there's a lab from an university here in Brazil that trained many different models of word embeddings for Portuguese here. In order for them to be available in flair, should they be added to embeddings.WordEmbeddings?

@alanakbik
Copy link
Collaborator

@pvcastro yes good idea. We've already added the WikiNER dataset for Portuguese (see tutorial). You can load it with:

original_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.WIKINER_PORTUGUESE)

Aside from this, I think it would be good to support a downloading and conversion routine for word embeddings such as the ones you linked, to make it easy to start experimenting with them!

@pvcastro
Copy link
Contributor

OK, great. I'll work on this and submit a PR soon.
Thanks @alanakbik!

@jimkts
Copy link

jimkts commented May 20, 2019

Hi guys! Flair is amazing....I am reading your project because I am writing my Msc thesis in NLP. I was wondering if Flair support Greek language?

@alanakbik
Copy link
Collaborator

Hello @jimkts - only one embedding type currently supports Greek, namely BytePairEmbeddings, which you could use to embed sentences and train models for Greek:

embeddings = BytePairEmbeddings("el")

sentence = Sentence('Αγαπώ την Ελλάδα')
embeddings.embed(sentence)

for token in sentence:
    print(token)
    print(token.embedding)

In order to train a model, you would need to add a Greek training dataset. For instance, the Greek Universal Dependency Treebank or a dataset for Named Entity Recognition. You can check out the tutorials on how to read in your own datasets or train your own models. If you have questions do let us know - we'd be happy to add Greek support.

@stefan-it
Copy link
Member

@jimkts I could train Flair embeddings for Greek if you want :)

Meanwhile, you could also try the multilingual BERT model (it also includes Greek, trained on Wikipedia).

@jimkts
Copy link

jimkts commented Jul 1, 2019

Hello @jimkts - only one embedding type currently supports Greek, namely BytePairEmbeddings, which you could use to embed sentences and train models for Greek:

embeddings = BytePairEmbeddings("el")

sentence = Sentence('Αγαπώ την Ελλάδα')
embeddings.embed(sentence)

for token in sentence:
    print(token)
    print(token.embedding)

In order to train a model, you would need to add a Greek training dataset. For instance, the Greek Universal Dependency Treebank or a dataset for Named Entity Recognition. You can check out the tutorials on how to read in your own datasets or train your own models. If you have questions do let us know - we'd be happy to add Greek support.

Hello @alanakbik ....I trained a big Greek corpus(~17 Gb and ~3500000 words) on gensim Word2Vec. How can I use this pre trained model on Flair?

@stale
Copy link

stale bot commented Apr 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@marcomoriatbi
Copy link

Hello Emil! Thanks for the interest - we are thinking of adding more models in more languages. In particular, we are currently looking at French, Italian and Dutch. Which languages / tasks are you most interested in?

Dear Alan, is it available any support to Italian NER? Is it required a new training for Italian NER? Thanks

@alanakbik
Copy link
Collaborator

Hello @marcomoriatbi there is no pre-trained model for Italiian NER yet. You could try 'ner-multi' which was trained over 4 languages and kind of works also for related languages it was trained for. I tried this model for French and it worked ok, so maybe that extends to Italian as well.

Otherwise, you would need to train your own Italian NER model. There are Italian Flair embddings included, but on the dataset side, we currently only include NER datasets for Italian that were automatically generated: WIKINER_ITALIAN, WIKIANN and XTREME (see here for more info). I think there are better NER datasets for Italian out there.

whoisjones added a commit that referenced this issue Feb 4, 2021
whoisjones added a commit that referenced this issue Feb 4, 2021
alanakbik pushed a commit that referenced this issue Jun 8, 2021
whoisjones added a commit that referenced this issue Nov 9, 2021
@longsc2603
Copy link

Hi, I am looking through Flair and wondering if it support Vietnamese or not. If not, will it in the future? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improving of an existing feature language model Related to language model wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests