-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial Chinese support for hero.lang.zh.preprocessing #128
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General feedback:
To me it seems like the best way to implement this is the following:
We have some config
where we can set the language. If a user does hero.set_language("Chinese")
, the library automatically starts calling the functions that are in /texthero/lang/chinese/...
(or sth. like that) if they are implemented there, and if they're not there, it just calls the standard texthero function.
This could probably be implemented by some other config
file in texthero/lang/chinese/...
that lists all functions that are overwritten for Chinese. So all in all, what I think should happen (but I'm of course open to discussion) is:
User calls texthero function xy
-> texthero checks what language is set. If set to e.g. Chinese, texthero goes to texthero/lang/Chinese
etc. and looks at a config file there. If xy
is in the config file there, texthero calls the function from the Chinese module. If it's not, texthero calls the standard version.
This way, we don't add unnecessary overhead. Especially, developers working on standard functions don't have to worry about implementing everything in the language subfolders. Developers working on the language specific stuff can focus on that.
It seems to me like your way introduces some unnecessary overhead down the line.
Interested in what you think, maybe I misunderstand what you're trying to do!
texthero/_helper.py
Outdated
|
||
""" | ||
|
||
@wrapt.decorator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the wrapt module really necessary here? Maybe have a look at the decorator implemented a few lines above that uses just the built-in functools from the standard library. Then the new dependency isn't needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw in your PR #69 that wrapt
is used too. This module looks clean so I just want to try it, not really necessary and I'm fine to remove it. 😃
return hero.preprocessing.clean(s, pipeline) | ||
|
||
|
||
@root_caller(hero.preprocessing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no way to only "overwrite" those functions that really are different for Chinese, and not mention the others at all? I can see this getting really tedious when we introduce more and more languages: if we now want to add a new function, we have to add a function that does nothing (just pass
) in every language module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see my general comment above)
Hi @henrifroese, you've made a good point. My starting point is that, a lot of standard functions are not applicable on Chinese. For example, Your suggestion is great, which I thought the same. However what can we do deal with the problem above? Should we hide these N/A API to user? |
Dear @AlfredWGA and @henrifroese, Thank you for the PR and the respective review; good job! 👍 Regarding the order of operations in the The only drawback of this solution is that it might be slower for certain operations. For instance, Regarding the fact that some of the functions are not relevant for a specific language (i.e Regarding the other comment(s): agree with Henri P.S @AlfredWGA should I call you Alfred or Guoao? |
What do you think about this: When the language is set to e.g. Chinese and a user calls a function
This also allows us to gradually add support. At the beginning, many functions might still be in the second list; after implementing solutions for Chinese, they are moved to the first list. However, I also agree with this, maybe there's a well-proven path to doing this:
One other thing: maybe convert this to a draft PR: on the right side, under "Reviewers", it should say "Still in progress? Convert to draft", click that. This way, everyone can see this is a work in progress and not ready to be merged yet. |
@jbesomi, thanks for the suggestion, I'll try to look for solutions through other NLP libraries. Also you can call me Guoao. 😃
@henrifroese thanks for your idea. But think of a situation: without documentation or docstring, when a user tries to call a function, he/she won't know if it supports Chinese until that function returns (correct result or Exception), which is kind of annoying, isn't it? How about I do this under
Through this we still maintain a list of supported/not supported functions, and also prevent user from calling unexpected functions and getting Exceptions. What do you think? |
That is extremely simple and does the job! Makes sense!
I only fail to see how we still maintain a list of not supported functions? The way you suggest, if someone calls a texthero function that's not implemented for Chinese yet and doesn't work for Chinese, python will just give an error saying it does not know the function? Or would you keep a separate list of functions that need to be changed for Chinese, but aren't implemented yet? |
Note that this PR introduces these major changes:
Review:
|
@henrifroese I think every standard functions that isn't included in the language module |
Makes sense, I think that's good enough for users 👌 |
@jbesomi Maybe put these functions in
|
0e07af9
to
8c710c2
Compare
Exactly! What if we do that in a separate PR? Would like to do it? |
Thank you!! :) |
Should we avoid calling variables and functions across modules? For example, in https://github.com/jbesomi/texthero/blob/master/texthero/preprocessing.py#L295, I think in most cases it is unnecessary for modules to call each other. Regulating this might be better for other language supports, what do you guys think? |
Hey Guoao, I agree with you that this might be redundant. As a general comment: as for now we are not offering multilingual support, any ideas that provide flexibility or cleanness, or code robustness is very welcome. I just don't understand what's your advice and idea about how to solve that. |
8e27382
to
99d94d9
Compare
Just add custom Series type to align with the standard I am thinking simply not to call functions or variables from other modules, unless that module is language agnostic. For example, Currently I only found few such cases in the code, like By the way, Chinese word segmentation package |
Yes. And we will have to make it clear to users that if they want to use For the rest: I agree! |
Should I create another PR? That issue should be solved before this PR could continue. |
@henrifroese Do you think we should proceed with that? |
@AlfredWGA I'm not sure what you mean. It's difficult for us to not to import functions from other modules at the moment (e.g. I'm not sure how we would not use the tokenize function in representation right now?). Or maybe I am misunderstanding you? |
Hi @henrifroese. Some modules need to deal with specific languages, therefore shouldn't be imported directly within other modules (because the user choose what language they want to deal with). Currently For |
Okay I get it now, thanks 👌. So presumably solution would be
I think a separate PR for this definitely makes sense. |
Why does the error |
See #171 |
Just fix the problem, thank you! @jbesomi |
This is the first step towards the Chinese support. All the docstrings are the same with original.
hero.lang.hero_zh
modulepreprocessing.py
for Chinese, removing some inapplicable functions from the original APISee #68.