-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for more languages? #2
Comments
Hello Emil! Thanks for the interest - we are thinking of adding more models in more languages. In particular, we are currently looking at French, Italian and Dutch. Which languages / tasks are you most interested in? |
It's a bit different depending on if it's for hobby projects or work projects. Hobby projects: Swedish / POS, Swedish / NER |
Ok great! The German models (POS/NER) will be put online probably sometime next week. We will also progressively add more languages in the near future. Of your list, I think Polish and Spanish are the most likely to be added soonish, though I can't say exactly when. |
Is there something about adding a new language that I could help with? For instance, there one big Swedish dataset with POS and NER tags called SUC 3.0. It's available for download here: https://spraakbanken.gu.se/eng/resource/suc3 |
Yes, if you're interested you could train a new model for Swedish POS or NER. You would probably need to adapt the NLPTaskDataFetcher for the task you want to train it on, but otherwise could probably use pretty much the same code as given here (and in the experiments section). I've added Swedish word embeddings to the project. I will also add issues for this task if you are interested! |
Fantastic! If you add the issues I’ll see where I can help. |
Are you planning to work on Portuguese language? |
@eduardompereira I am not sure how quickly we can get around to Portuguese, so we'd welcome contributions here! If it helps, we could package standard word embeddings for Portuguese with the next release? Are you aware of good NER datasets for Portuguese? |
Hi there. For the word embeddings one can also use french fasttext embeddings: |
Hi @mhham thanks for the pointers - more languages are definitely planned and French is high up on our priority list. I am hoping that the next release will be a lot more multilingual than currently, but I am not sure how quickly we can get around to which language. Of course contributions are always welcome! |
Hi, thanks for the great work! I wonder how many languages does flair support for NER now? From what I see on release 0.4 it seems that English, German, Dutch, French, italian, Spanish, Portuguese, Polish are supported? |
I've trained FlairEmbeddings on Wikipedia dumps + OPUS (1 epoch) for some more languages: no, fa, ar, id, pl, da, hi, nl, eu, sl, he, hr, fi, bg, cs and sv. I'll provide them as soon as I have checked their performance on UD :) |
Thanks for the reply! @stefan-it |
Yes, you could also test our multilingual NER model, which can detect entities in English, German, Dutch and Spanish (and even other languages a little) even though it is only one model. |
Thanks for the pointer! Will try that out:) |
Hi @alanakbik . I study NER for Portuguese, and for "general" NER models, I believe the best dataset is the one spacy uses, which is the one from WikiNER (Learning multilingual named entity recognition from Wikipedia) . |
@pvcastro yes good idea. We've already added the WikiNER dataset for Portuguese (see tutorial). You can load it with: original_corpus = NLPTaskDataFetcher.load_corpus(NLPTask.WIKINER_PORTUGUESE) Aside from this, I think it would be good to support a downloading and conversion routine for word embeddings such as the ones you linked, to make it easy to start experimenting with them! |
OK, great. I'll work on this and submit a PR soon. |
Hi guys! Flair is amazing....I am reading your project because I am writing my Msc thesis in NLP. I was wondering if Flair support Greek language? |
Hello @jimkts - only one embedding type currently supports Greek, namely embeddings = BytePairEmbeddings("el")
sentence = Sentence('Αγαπώ την Ελλάδα')
embeddings.embed(sentence)
for token in sentence:
print(token)
print(token.embedding) In order to train a model, you would need to add a Greek training dataset. For instance, the Greek Universal Dependency Treebank or a dataset for Named Entity Recognition. You can check out the tutorials on how to read in your own datasets or train your own models. If you have questions do let us know - we'd be happy to add Greek support. |
@jimkts I could train Flair embeddings for Greek if you want :) Meanwhile, you could also try the multilingual BERT model (it also includes Greek, trained on Wikipedia). |
Hello @alanakbik ....I trained a big Greek corpus(~17 Gb and ~3500000 words) on gensim Word2Vec. How can I use this pre trained model on Flair? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Dear Alan, is it available any support to Italian NER? Is it required a new training for Italian NER? Thanks |
Hello @marcomoriatbi there is no pre-trained model for Italiian NER yet. You could try ' Otherwise, you would need to train your own Italian NER model. There are Italian Flair embddings included, but on the dataset side, we currently only include NER datasets for Italian that were automatically generated: |
Hi, I am looking through Flair and wondering if it support Vietnamese or not. If not, will it in the future? Thank you! |
Hi! Flair looks amazing. Clean code, easy to use. Thanks for making it open source!
I was wondering if you plan to add support for more languages? Maybe all the languages where Zalando operates? :) I'm working for a company that need NLP-code that works across pretty much the same set of countries.
Looking at different available libraries, pre-trained models for more than just English (and German in this case!), is lacking in all the other libraries.
The text was updated successfully, but these errors were encountered: