-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to provide multilingual support #84
Comments
For Asian languages (Chinese, Japanese...), word segmentation is an essential step in preprocessing. We usually remove non-textual characters in corpus, making them looks like naturally written texts, and segment the text with NLP tools. Its difference to the current
Seems |
Just used Texthero for the first time yesterday in Portuguese. Pipeline for preprocessing seems fine, except for stopwords. A solution like @AlfredWGA mentioned would be very much appreciated inside hero. i.e. building and running a custom pipeline. In my case, I would only need to change the stopwords function |
Hey @Jkasnese , if you read there in the getting-started, you will see that you can define your own custom pipeline:
Is that what you were looking for? |
Great! |
Yes, I think so! Thanks! I'll check later how to pass arguments to the remove_stopwords function. It would be nice if it was possible to do something like:
But I'm guessing its with kargs** right? I don't know, I'm a newbie hahaha gonna check this later. Thanks again! |
You can solve it like this:
All items in the I agree that this is not trivial. We will make sure that this will be well explained both in the docstring of the |
I'd like that! Can't make guarantees though, since I'm already involved in many projects ): I'll try to do it until Saturday. I still have some questions, which might be good since I can address them on the docstring. Should I create a new issue (since its a bit off this issue) or message you privately? |
Both solutions work; either open an issue or send me an email: jonathanbesomi__AT__gmail.com |
@jbesomi i think it is also good to add a |
I perfectly agree with what you are proposing; i.e to permit to remove stopwords from a specific language. The only big question (and the main purpose of this discussion tab) is to understand how. There are different alternatives:
In general, I'm against adding too many arguments to functions as this makes it generally more complex to understand and use it ... Also, something we always need to keep in mind is that, from a multilingual perspective, Texthero is composed of two kinds of functions:
Only some of the Your opinions? Technically, do you think is feasible and not too complex to have a global setting for the |
I think @AlfredWGA's solution of having a global setting is a better idea than adding #3 is also very interesting and might be an even better idea as it automates the process. It aligns perfectly well to Texthero's purpose of being easy to use and understand. |
I found a problem of using global setting language. Some of functions cannot be applied to Asian languages, e.g. |
Hey @AlfredWGA ! Apologize, what do you mean by "integrated"? (Also, remove_punctuation is integrated into remove_stopwords after tokenization) I agree. To probably understand better the problem, we should create a document (by opening a new issue or using Google doc or similar) and for different languages make a list of necessary functions. Presumably, except for some functions in Once we have this, it will be easier to see how to solve the problem. Hopefully, we will notice some patterns and we will be able to classify languages together that share a common preprocessing pipeline. Do you agree? Then, a simple idea might be to have a What are your thoughts? |
Sorry for the confusion. The default pipeline for preprocessing Chinese text should look like this,
Punctuations and stopwords should be removed after tokenization (as they might affect the word segmentation results). We can put punctuations into the list of stopwords and remove them together using In this case, if we use |
Hey @AlfredWGA, sorry for the late reply. I agree that a series of Regarding the In the case of The main reason until now we are not requiring For Asian Language, One more thing regarding what you were proposing ( As an alternative to multiple
Looking forward to hearing from you! 👍 |
From my perspective, yes, but except for some strings that won't interfere word segmentation (urls, html tags,
Then the cleaning process of Western and Asian will be unified. What do you think? |
Sounds good! We might call the
Yes, @AlfredWGA How do you suggest we proceed, in relation on how you plan to contribute. |
I'll start implementing |
OK! |
Text preprocessing might be very language-dependent.
It would be great for Texthero to offer text preprocessing in all different languages.
There are probably two ways to support multilanguage:
We might have also cases where the dataset is composed of many languages. What would be the best solution in that case?
The first question we probably have to solve is: does different languages requires very different preprocessing pipeline and therefore different functions?
The text was updated successfully, but these errors were encountered: