Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature requests] support utf-8 characters in feature name #2478

Closed
OnlyFor opened this issue Sep 30, 2019 · 28 comments
Closed

[feature requests] support utf-8 characters in feature name #2478

OnlyFor opened this issue Sep 30, 2019 · 28 comments

Comments

@OnlyFor
Copy link

OnlyFor commented Sep 30, 2019

Do not support non-ascii characters in feature name ?
Could you please consider backward compatibility ?

I use xgboost and catboost and sklearn at the same time, only lightgbm has encoding compatibility problems...

thx

@guolinke
Copy link
Collaborator

I guess the feature_name function in xgb/cat is maintained in python-side, so it is easy for utf8 encoding. But this requires the different implementation in each language package, and different model save/load solution.

In LightGBM, it is maintained in cpp side, and save in model file, and thus is hard for utf8.

If we want to support the utf-8 feature name, the model save/load logic may change, and cause more backward compatibility problems.

@guolinke
Copy link
Collaborator

A workaround is to save an additional file for features name, and force its name to ,<model_file_name>+".fn". And that the encodnig of that file could be utf-8, and autoload by python/R itself when loading the model file, not by cpp.

@OnlyFor
Copy link
Author

OnlyFor commented Sep 30, 2019

Could lightgbm automatically replace and restore utf8 feature names in python/R side, before and after cpp part ? or maintain feature transformation in python/R, support utf8 indirectly ? transformation dict can also written to model file

@guolinke
Copy link
Collaborator

@OnlyFor I am not familiar with the string encoding, but I think that is not trivial.
Maybe we can use something like base64 to decode and encode for feature names.

@StrikerRUS StrikerRUS changed the title [python][feature requests] support utf-8 characters in feature name [feature requests] support utf-8 characters in feature name Sep 30, 2019
@StrikerRUS
Copy link
Collaborator

@guolinke Encoding feature names will hurt the model file readability for humans, I guess.

@StrikerRUS
Copy link
Collaborator

@guolinke Should this issue be included in #2302?

@StrikerRUS
Copy link
Collaborator

@StrikerRUS
Copy link
Collaborator

@guolinke WDYT #2478 (comment)?

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

@PointCloudNiphon
Copy link

Hello, every one I meet this error below,How can I solve it..
LightGBMError: Do not support non-ascii characters in feature name.

@StrikerRUS
Copy link
Collaborator

@PointCloudNiphon Hi!

For now, there cannot be any non-ASCII symbols in string model representation. So, you should simply rename your feature names before passing them into LightGBM.

@rajibrj43
Copy link

 Hello, every one I meet this error below,How can I solve it..
LightGBMError: Do not support non-ascii characters in feature name.
Bellow are my features. I can not find the ASCII character in my feature.
| ID | Name | Age | Photo | Nationality | Flag | Overall | Potential | Club | Club Logo | Value | Wage | Special | Preferred Foot | International Reputation | Weak Foot | Skill Moves | Work Rate | Body Type | Real Face | Position | Jersey Number | Joined | Loaned From | Contract Valid Until | Height | Weight | LS | ST | RS | LW | LF | CF | RF | RW | LAM | CAM | RAM | LM | LCM | CM | RCM | RM | LWB | LDM | CDM | RDM | RWB | LB | LCB | CB | RCB | RB | Crossing | Finishing | HeadingAccuracy | ShortPassing | Volleys | Dribbling | Curve | FKAccuracy | LongPassing | BallControl | Acceleration | SprintSpeed | Agility | Reactions | Balance | ShotPower | Jumping | Stamina | Strength | LongShots | Aggression | Interceptions | Positioning | Vision | Penalties | Composure | Marking | StandingTackle | SlidingTackle | GKDiving | GKHandling | GKKicking | GKPositioning | GKReflexes | Release Clause

How can I solve this?

@StrikerRUS
Copy link
Collaborator

@rajibrj43 Hi! Indeed, there are no any non-ASCII symbols in your feature names. And I cannot reproduce your issue - LightGBM trains just fine with those feature names.

import numpy as np
import lightgbm as lgb

feature_names_from_comment = "| ID | Name | Age | Photo | Nationality | Flag | Overall | Potential | Club | Club Logo | Value | Wage | Special | Preferred Foot | International Reputation | Weak Foot | Skill Moves | Work Rate | Body Type | Real Face | Position | Jersey Number | Joined | Loaned From | Contract Valid Until | Height | Weight | LS | ST | RS | LW | LF | CF | RF | RW | LAM | CAM | RAM | LM | LCM | CM | RCM | RM | LWB | LDM | CDM | RDM | RWB | LB | LCB | CB | RCB | RB | Crossing | Finishing | HeadingAccuracy | ShortPassing | Volleys | Dribbling | Curve | FKAccuracy | LongPassing | BallControl | Acceleration | SprintSpeed | Agility | Reactions | Balance | ShotPower | Jumping | Stamina | Strength | LongShots | Aggression | Interceptions | Positioning | Vision | Penalties | Composure | Marking | StandingTackle | SlidingTackle | GKDiving | GKHandling | GKKicking | GKPositioning | GKReflexes | Release Clause"
feature_names = [i.strip() for i in feature_names_from_comment.split('|') if i]

X = np.random.random((100, len(feature_names)))
y = np.random.random((100,))

lgb.LGBMRegressor().fit(X, y, feature_name=feature_names)

@henry0312
Copy link
Contributor

Hello,

I have noticed this issue recently and I think the current behavior is not great,
howerver, I also agree with #2478 (comment) and #2226 (comment).

So for now, I create and put a work-around for Python.

import types

# gbm is an instance of LGBMModel.
# you have feature_names
gbm.booster_._feature_names = feature_names
gbm.booster_.feature_name = types.MethodType(lambda self: self._feature_names, gbm.booster_)
# NOTICE: `pickle` can't dump `lambda`, so you can use `dill` or `cloudpickle`

In the future, I (or someone) will remake python-package to include feature_names in it (outside cpp).

@PointCloudNiphon
Copy link

PointCloudNiphon commented Nov 18, 2019 via email

@mirekphd
Copy link

mirekphd commented Jan 28, 2020

@PointCloudNiphon Hi!

For now, there cannot be any non-ASCII symbols in string model representation. So, you should simply rename your feature names before passing them into LightGBM.

@StrikerRUS : This bug is more frequent than you seem to realize. It arises not because of original database columns / feature names being non-ASCII, but from the use of one-hot encoder (e.g get_dummies() from pandas), which appends non-ASCII feature levels (variables values) to these ASCII column names. So now after such encoding even categorical feature levels cannot contain regional characters... and they usually do (outside of the US). Your closest competitor, XGBoost does not impose such arbitrary and US-centric restrictions.

@mirekphd
Copy link

mirekphd commented Jan 28, 2020

@rajibrj43 Hi! Indeed, there are no any non-ASCII symbols in your feature names. And I cannot reproduce your issue - LightGBM trains just fine with those feature names.

Because you used random numbers for all variables, including Name:) Try some non-ASCII characters with one-hot encoder... @StrikerRUS, once you replicate, could you please /reopen and repair?

@StrikerRUS
Copy link
Collaborator

@mirekphd I see, get_dummies() adds some headache and requires one more manual step for renaming column names. But please note that LightGBM doesn't require one-hot encoding for categorical variables and normally you won't use that function during a preprocessing phase: https://lightgbm.readthedocs.io/en/latest/Quick-Start.html#categorical-feature-support.

Speaking about reopening, please refer to #2478 (comment).

henry0312 added a commit to henry0312/LightGBM that referenced this issue Apr 6, 2020
This commit reverts 0d59859.
Also see:
- microsoft#2226
- microsoft#2478
- microsoft#2229

I reproduced the issue and as @kidotaka gave us a great survey in microsoft#2226,
I don't conclude that the cause is UTF-8, but "an empty string (character)".
Therefore, I revert "throw error when meet non ascii (microsoft#2229)" whose commit hash
is 0d59859, and add support feture names as UTF-8 again.
henry0312 added a commit to henry0312/LightGBM that referenced this issue Apr 6, 2020
This commit reverts 0d59859.
Also see:
- microsoft#2226
- microsoft#2478
- microsoft#2229

I reproduced the issue and as @kidotaka gave us a great survey in microsoft#2226,
I don't conclude that the cause is UTF-8, but "an empty string (character)".
Therefore, I revert "throw error when meet non ascii (microsoft#2229)" whose commit hash
is 0d59859, and add support feture names as UTF-8 again.
henry0312 added a commit that referenced this issue Apr 10, 2020
* Support UTF-8 characters in feature name again

This commit reverts 0d59859.
Also see:
- #2226
- #2478
- #2229

I reproduced the issue and as @kidotaka gave us a great survey in #2226,
I don't conclude that the cause is UTF-8, but "an empty string (character)".
Therefore, I revert "throw error when meet non ascii (#2229)" whose commit hash
is 0d59859, and add support feture names as UTF-8 again.

* add tests

* fix check-docs tests

* update

* fix tests

* update .travis.yml

* fix tests

* update test_r_package.sh

* update test_r_package.sh

* update test_r_package.sh

* add a test for R-package

* update test_r_package.sh

* update test_r_package.sh

* update test_r_package.sh

* fix test for R-package

* update test_r_package.sh

* update test_r_package.sh

* update test_r_package.sh

* update test_r_package.sh

* update

* updte

* update

* remove unneeded comments
@StrikerRUS
Copy link
Collaborator

@henry0312 Can we mark this issue as resolved via #2976 or is it better to wait @jameslamb's PR for R part?

@henry0312
Copy link
Contributor

I believe that we can but It's better for us to wait for passing R tests because I'm not an expert in R.

@jameslamb
Copy link
Collaborator

I already created #2983 to capture the R-specific work @henry0312

@StrikerRUS
Copy link
Collaborator

Thank you @jameslamb ! Sorry, I didn't notice it. Then I'm putting a tick in our feature requests hub for this issue, because model file supports UTF-8 after #2976. R-specific progress will be tracked in your separate issue.

@Shubhammishra-21
Copy link

import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_cv, label=y_cv)

param = {'objective': 'regression',
'boosting': 'gbdt',
'metric': 'l2_root',
'learning_rate': 0.05,
'num_iterations': 350,
'num_leaves': 31,
'max_depth': -1,
'min_data_in_leaf': 15,
'bagging_fraction': 0.85,
'bagging_freq': 1,
'feature_fraction': 0.55
}

lgbm = lgb.train(params=param,
verbose_eval=50,
train_set=train_data,
valid_sets=[test_data])

y_pred_lgbm = lgbm.predict(X_cv)
print('RMSLE:', sqrt(mean_squared_log_error(np.exp(y_cv), np.exp(y_pred_lgbm))))

i'm getting error:
LightGBMError: Do not support non-ASCII characters in feature name.

@guolinke
Copy link
Collaborator

@Shubhammishra-21 please use the latest master branch.

@mercedesmedaly
Copy link

Recode variables( names), with tildes, ...
symbols that are not allowed in the American language

@franktoffel
Copy link

I have got the error LightGBMError: Do not support special JSON characters in feature name. when using LightGBM 3.0 on a Windows 10 machine. It seems that issues with special characters were fixed with this release, but perhaps not on windows?

@antoinecomp
Copy link

Do not support special JSON characters in feature name: how to use LGBM in columns with Russian names?

I have a train part of a dataframe with Russian names:

shop__56 | sub_type_Сумки, Альбомы, Коврики д/мыши | shop__46 | sub_type_Для дома и офиса (Цифра) | shop__49 | shop__58 | shop__37 | sub_type_Служебные | sub_type_CD локального производства | shop__22 | ... | shop__48 | sub_type_Артбуки, энциклопедии | sub_type_Подарочные издания | sub_type_PSN | shop_id | sub_type_CD фирменного производства | sub_type_DVD | sub_type_PSVita | sub_type_Комиксы, манга | sub_type_Дополнительные издания
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 41 | 0 | 0 | 0 | 0 | 0

And when I apply LightGBM on it, it rases the same issue:

---------------------------------------------------------------------------
LightGBMError                             Traceback (most recent call last)
<ipython-input-125-711fbc08b2b9> in <module>
     11                            min_split_gain=0.0222415,
     12                            min_child_weight=40)
---> 13 model_lgb.fit(X_train, y_train)

/opt/conda/lib/python3.7/site-packages/lightgbm/sklearn.py in fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model)
    777                                        verbose=verbose, feature_name=feature_name,
    778                                        categorical_feature=categorical_feature,
--> 779                                        callbacks=callbacks, init_model=init_model)
    780         return self
    781 

/opt/conda/lib/python3.7/site-packages/lightgbm/sklearn.py in fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model)
    615                               evals_result=evals_result, fobj=self._fobj, feval=eval_metrics_callable,
    616                               verbose_eval=verbose, feature_name=feature_name,
--> 617                               callbacks=callbacks, init_model=init_model)
    618 
    619         if evals_result:

/opt/conda/lib/python3.7/site-packages/lightgbm/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
    229     # construct booster
    230     try:
--> 231         booster = Booster(params=params, train_set=train_set)
    232         if is_valid_contain_train:
    233             booster.set_train_data_name(train_data_name)

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in __init__(self, params, train_set, model_file, model_str, silent)
   2051                     break
   2052             # construct booster object
-> 2053             train_set.construct()
   2054             # copy the parameters from train_set
   2055             params.update(train_set.get_params())

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in construct(self)
   1323                                 init_score=self.init_score, predictor=self._predictor,
   1324                                 silent=self.silent, feature_name=self.feature_name,
-> 1325                                 categorical_feature=self.categorical_feature, params=self.params)
   1326             if self.free_raw_data:
   1327                 self.data = None

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in _lazy_init(self, data, label, reference, weight, group, init_score, predictor, silent, feature_name, categorical_feature, params)
   1149             raise TypeError('Wrong predictor type {}'.format(type(predictor).__name__))
   1150         # set feature names
-> 1151         return self.set_feature_name(feature_name)
   1152 
   1153     def __init_from_np2d(self, mat, params_str, ref_dataset):

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in set_feature_name(self, feature_name)
   1630                 self.handle,
   1631                 c_array(ctypes.c_char_p, c_feature_name),
-> 1632                 ctypes.c_int(len(feature_name))))
   1633         return self
   1634 

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in _safe_call(ret)
     53     """
     54     if ret != 0:
---> 55         raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
     56 
     57 

LightGBMError: Do not support special JSON characters in feature name.

So how can I use LightGBM on this kind of features? Do I need to transform them to English column names?

@StrikerRUS
Copy link
Collaborator

Hi @antoinecomp !

Do not support special JSON characters in feature name

This error is not about Russian feature names. LightGBM handles them fine now. This issue is about JSON special chars, i.e.

if (char_code == 34 // "
|| char_code == 44 // ,
|| char_code == 58 // :
|| char_code == 91 // [
|| char_code == 93 // ]
|| char_code == 123 // {
|| char_code == 125 // }

You just should remove commas (,) (and any other special chars listed above, if any) from your feature names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests