-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature requests] support utf-8 characters in feature name #2478
Comments
I guess the feature_name function in xgb/cat is maintained in python-side, so it is easy for utf8 encoding. But this requires the different implementation in each language package, and different model save/load solution. In LightGBM, it is maintained in cpp side, and save in model file, and thus is hard for utf8. If we want to support the utf-8 feature name, the model save/load logic may change, and cause more backward compatibility problems. |
A workaround is to save an additional file for features name, and force its name to ,<model_file_name>+".fn". And that the encodnig of that file could be utf-8, and autoload by python/R itself when loading the model file, not by cpp. |
Could lightgbm automatically replace and restore utf8 feature names in python/R side, before and after cpp part ? or maintain feature transformation in python/R, support utf8 indirectly ? transformation dict can also written to model file |
@OnlyFor I am not familiar with the string encoding, but I think that is not trivial. |
@guolinke Encoding feature names will hurt the model file readability for humans, I guess. |
@guolinke WDYT #2478 (comment)? |
Closed in favor of being in #2302. We decided to keep all feature requests in one place. Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature. |
Hello, every one I meet this error below,How can I solve it.. |
For now, there cannot be any non-ASCII symbols in string model representation. So, you should simply rename your feature names before passing them into LightGBM. |
Hello, every one I meet this error below,How can I solve it.. How can I solve this? |
@rajibrj43 Hi! Indeed, there are no any non-ASCII symbols in your feature names. And I cannot reproduce your issue - LightGBM trains just fine with those feature names.
|
Hello, I have noticed this issue recently and I think the current behavior is not great, So for now, I create and put a work-around for Python. import types
# gbm is an instance of LGBMModel.
# you have feature_names
gbm.booster_._feature_names = feature_names
gbm.booster_.feature_name = types.MethodType(lambda self: self._feature_names, gbm.booster_)
# NOTICE: `pickle` can't dump `lambda`, so you can use `dill` or `cloudpickle` In the future, I (or someone) will remake python-package to include feature_names in it (outside cpp). |
ok,I WILL HAVE A TRY,THANKS.
发自我的小米手机
在 OMOTO Tsukasa <notifications@github.com>,2019年11月18日 下午3:07写道:
So for now, I create and put a work-around for Python.
import types
# gbm is an instance of LGBMModel.
# you have feature_names
gbm.booster_._feature_names = feature_names
gbm.booster_.feature_name = types.MethodType(lambda self: self._feature_names, gbm.booster_)
# NOTICE: `pickle` can't dump `lambda`, so you can use `dill` or `cloudpickle`
In the future, I (or someone) will remake python-package to include feature_names in it (outside cpp).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#2478?email_source=notifications&email_token=AI2M6S7G522XOXGZAMD6Z4DQUI5MLA5CNFSM4I3VEC32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEJNDYI#issuecomment-554881505>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AI2M6SZ57WGAIKA64FQFLOTQUI5MLANCNFSM4I3VEC3Q>.
|
@StrikerRUS : This bug is more frequent than you seem to realize. It arises not because of original database columns / feature names being non-ASCII, but from the use of one-hot encoder (e.g |
Because you used random numbers for all variables, including Name:) Try some non-ASCII characters with one-hot encoder... @StrikerRUS, once you replicate, could you please /reopen and repair? |
@mirekphd I see, Speaking about reopening, please refer to #2478 (comment). |
This commit reverts 0d59859. Also see: - microsoft#2226 - microsoft#2478 - microsoft#2229 I reproduced the issue and as @kidotaka gave us a great survey in microsoft#2226, I don't conclude that the cause is UTF-8, but "an empty string (character)". Therefore, I revert "throw error when meet non ascii (microsoft#2229)" whose commit hash is 0d59859, and add support feture names as UTF-8 again.
This commit reverts 0d59859. Also see: - microsoft#2226 - microsoft#2478 - microsoft#2229 I reproduced the issue and as @kidotaka gave us a great survey in microsoft#2226, I don't conclude that the cause is UTF-8, but "an empty string (character)". Therefore, I revert "throw error when meet non ascii (microsoft#2229)" whose commit hash is 0d59859, and add support feture names as UTF-8 again.
* Support UTF-8 characters in feature name again This commit reverts 0d59859. Also see: - #2226 - #2478 - #2229 I reproduced the issue and as @kidotaka gave us a great survey in #2226, I don't conclude that the cause is UTF-8, but "an empty string (character)". Therefore, I revert "throw error when meet non ascii (#2229)" whose commit hash is 0d59859, and add support feture names as UTF-8 again. * add tests * fix check-docs tests * update * fix tests * update .travis.yml * fix tests * update test_r_package.sh * update test_r_package.sh * update test_r_package.sh * add a test for R-package * update test_r_package.sh * update test_r_package.sh * update test_r_package.sh * fix test for R-package * update test_r_package.sh * update test_r_package.sh * update test_r_package.sh * update test_r_package.sh * update * updte * update * remove unneeded comments
@henry0312 Can we mark this issue as resolved via #2976 or is it better to wait @jameslamb's PR for R part? |
I believe that we can but It's better for us to wait for passing R tests because I'm not an expert in R. |
I already created #2983 to capture the R-specific work @henry0312 |
Thank you @jameslamb ! Sorry, I didn't notice it. Then I'm putting a tick in our feature requests hub for this issue, because model file supports UTF-8 after #2976. R-specific progress will be tracked in your separate issue. |
import lightgbm as lgb param = {'objective': 'regression', lgbm = lgb.train(params=param, y_pred_lgbm = lgbm.predict(X_cv) i'm getting error: |
@Shubhammishra-21 please use the latest master branch. |
Recode variables( names), with tildes, ... |
I have got the error |
Do not support special JSON characters in feature name: how to use LGBM in columns with Russian names? I have a train part of a dataframe with Russian names:
And when I apply LightGBM on it, it rases the same issue:
So how can I use LightGBM on this kind of features? Do I need to transform them to English column names? |
Hi @antoinecomp !
This error is not about Russian feature names. LightGBM handles them fine now. This issue is about JSON special chars, i.e. LightGBM/include/LightGBM/utils/common.h Lines 848 to 854 in 792c930
You just should remove commas ( |
Do not support non-ascii characters in feature name ?
Could you please consider backward compatibility ?
I use xgboost and catboost and sklearn at the same time, only lightgbm has encoding compatibility problems...
thx
The text was updated successfully, but these errors were encountered: