-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chinese language support #68
Comments
Hi Guoao! Thank you for your message and for your help! We are pleased to have you there!! For the "things, you can do", which one would you be more interested to start with? For "Translate documents & tutorials into Chinese" this is very cool and very useful, first, though, we will need to set-up a system that allows this kind of translation. Basically, using Sphinx internationalization. Unfortunately, I do not speak nor write Chinese and I never did NLP on Chinese texts. What are the most popular Chinese NLP python tools Chinese NLP developers are using now? Do you think it will be possible to develop something like that
Also, do you think we will need to provide support for Regards, |
I would prefer to start with adding Chinese support for the preprocessing module. The most common Chinese NLP tools right now should be jieba, HanLP, and pkuseg. Also, spaCy has integrated I think creating a distinct module
Simplified and Traditional Chinese shares almost the same grammar and usage, but with some difference on specific words. For example, "software" are written as "软件" and "軟體" (simplified as "软体"). So we usually simply transform traditional characters to the simplified ones and vice versa. Since most of the Chinese NLP tools are developed on simplified Chinese, I think support for |
Hey! Thank you for your exhaustive answer! Adding Chinese support sounds super good to me! Before starting with the details of the implementation, we need to figure out how Texthero should provide multi-language support. I opened a discussion-issue there #84. What are your opinion on that? Imagine for instance we have a Pandas Series composed of 4 different languages ... what would be the elegant and easiest solution to do text segmentation there? |
Hey @AlfredWGA! Great, thank you! Is that word_segmentation the same as Looking forward to seeing what you come up with! |
@jbesomi I'm also confused about the difference. Here says that word segmentation is a prior process of tokenization, but in practice we think they're almost equivalent.
|
Hey @AlfredWGA! For simplicity and to keep the same pipeline as the other languages, I would say to consider word segmentation as simply tokenization. Regarding your question, probably it make sense to just focus on the specific functions we need for the Chinese language. Some general remarks: In principle, we prefer to install as few external packages as possible. Using If you look at the Texthero's source code (for example
Just by replacing "en_core_web_sm" with "zh_core_web_sm" we might be able to tokenize (word segment) Chinese text. It's that true? For curiosity, I also looked at the Chinese stopwords:
the output is like
Do you think we can basically just load the "zh_core_web_sm" instead of "en_core_web_sm" and we will be able to provide Chinese support? It would be great if you try that on a Jupyter Notebook and see how it works. I cannot do it myself as I don't know how an NLP pipeline with Chinese text work. Other than Thanks! 🎉 |
Hi @jbesomi. Also, as I mentioned above spaCy has provided another API for user to directly call jieba and pkuseg (https://spacy.io/usage/models#chinese) for word segmentation.
The output is
Although using
And let the user to decide which segmenter they want to use. I just look through Texthero's API and I think Chinese can use the same pipe parts as the current ones. |
Thank you for your exhaustive replay @AlfredWGA! For you to know, right now I'm okay in letting the user select which segmenter to use.
where
This goes a bit in the opposite direction of the general purpose of Texthero. The idea of Texthero is that we evaluate the different option ("spacy", "jieba", "pkuseg", etc) and pick the best for the user; so that he does not have to make a choice. This both goes with the assumption that the users do not know which one is better (exactly as us), that we tested and picked the best and that in most cases any solution is good (for specific problems, Texthero does not work either ...) Having said that, it would be even better if we do a comparison of the three tokenizers (or we find some papers or articles that compare them) and we pick the best one. Also, if we find that |
Hi, @jbesomi. You've made a good point. The Chinese model of spaCy was originally released from howl-anderson/Chinese_models_for_SpaCy, however there hasn't been any info about its performance compared to other tools. jieba is stable and has lots of users (over 23.6k stars), I would suggest that we take jieba API from spaCy for starters, and see how spaCy's Chinese model performs later? |
Hey @AlfredWGA, I gave a quick look to Your idea of setting the language through I'm looking forward to seeing your implementation. For any question do not hesitate to ask! 😃 |
I just found in https://spacy.io/usage/models#chinese that spaCy's Chinese model is a custom |
Hello, my name is Guoao Wei. I am a Chinese student interested in NLP and I can help with the Chinese language support for this amazing repository.
About me
I received a bachelor's degree of Software Engineering in China. I worked as a research intern in the Chinese Chinese Academy of Sciences for a year, focusing on NLP-related topics.
I have been searching for tools that saves time on writing redundant preprocessing codes when dealing with text data (I wrote my own simple one AlfredWGA/nlputils), until I find Texthero. Therefore I am happy to contribute to this toolkit.
Thins I can do
The text was updated successfully, but these errors were encountered: