[feature requests] support utf-8 characters in feature name #2478

OnlyFor · 2019-09-30T02:09:11Z

Do not support non-ascii characters in feature name ?
Could you please consider backward compatibility ?

I use xgboost and catboost and sklearn at the same time, only lightgbm has encoding compatibility problems...

thx

guolinke · 2019-09-30T02:30:28Z

I guess the feature_name function in xgb/cat is maintained in python-side, so it is easy for utf8 encoding. But this requires the different implementation in each language package, and different model save/load solution.

In LightGBM, it is maintained in cpp side, and save in model file, and thus is hard for utf8.

If we want to support the utf-8 feature name, the model save/load logic may change, and cause more backward compatibility problems.

guolinke · 2019-09-30T02:41:32Z

A workaround is to save an additional file for features name, and force its name to ,<model_file_name>+".fn". And that the encodnig of that file could be utf-8, and autoload by python/R itself when loading the model file, not by cpp.

OnlyFor · 2019-09-30T03:17:39Z

Could lightgbm automatically replace and restore utf8 feature names in python/R side, before and after cpp part ? or maintain feature transformation in python/R, support utf8 indirectly ? transformation dict can also written to model file

guolinke · 2019-09-30T03:51:37Z

@OnlyFor I am not familiar with the string encoding, but I think that is not trivial.
Maybe we can use something like base64 to decode and encode for feature names.

StrikerRUS · 2019-10-02T11:25:16Z

@guolinke Encoding feature names will hurt the model file readability for humans, I guess.

StrikerRUS · 2019-10-09T14:16:07Z

@guolinke Should this issue be included in #2302?

StrikerRUS · 2019-10-13T02:09:03Z

dmlc/xgboost#4937 (comment).

StrikerRUS · 2019-11-06T22:47:11Z

@guolinke WDYT #2478 (comment)?

StrikerRUS · 2019-11-11T15:14:17Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

PointCloudNiphon · 2019-11-14T12:27:48Z

Hello, every one I meet this error below,How can I solve it..
LightGBMError: Do not support non-ascii characters in feature name.

StrikerRUS · 2019-11-14T14:29:50Z

@PointCloudNiphon Hi!

For now, there cannot be any non-ASCII symbols in string model representation. So, you should simply rename your feature names before passing them into LightGBM.

rajibrj43 · 2019-11-15T15:38:57Z

How can I solve this?

StrikerRUS · 2019-11-15T18:44:56Z

@rajibrj43 Hi! Indeed, there are no any non-ASCII symbols in your feature names. And I cannot reproduce your issue - LightGBM trains just fine with those feature names.

import numpy as np
import lightgbm as lgb

feature_names_from_comment = "| ID | Name | Age | Photo | Nationality | Flag | Overall | Potential | Club | Club Logo | Value | Wage | Special | Preferred Foot | International Reputation | Weak Foot | Skill Moves | Work Rate | Body Type | Real Face | Position | Jersey Number | Joined | Loaned From | Contract Valid Until | Height | Weight | LS | ST | RS | LW | LF | CF | RF | RW | LAM | CAM | RAM | LM | LCM | CM | RCM | RM | LWB | LDM | CDM | RDM | RWB | LB | LCB | CB | RCB | RB | Crossing | Finishing | HeadingAccuracy | ShortPassing | Volleys | Dribbling | Curve | FKAccuracy | LongPassing | BallControl | Acceleration | SprintSpeed | Agility | Reactions | Balance | ShotPower | Jumping | Stamina | Strength | LongShots | Aggression | Interceptions | Positioning | Vision | Penalties | Composure | Marking | StandingTackle | SlidingTackle | GKDiving | GKHandling | GKKicking | GKPositioning | GKReflexes | Release Clause"
feature_names = [i.strip() for i in feature_names_from_comment.split('|') if i]

X = np.random.random((100, len(feature_names)))
y = np.random.random((100,))

lgb.LGBMRegressor().fit(X, y, feature_name=feature_names)

henry0312 · 2019-11-18T07:09:52Z

Hello,

I have noticed this issue recently and I think the current behavior is not great,
howerver, I also agree with #2478 (comment) and #2226 (comment).

So for now, I create and put a work-around for Python.

import types

# gbm is an instance of LGBMModel.
# you have feature_names
gbm.booster_._feature_names = feature_names
gbm.booster_.feature_name = types.MethodType(lambda self: self._feature_names, gbm.booster_)
# NOTICE: `pickle` can't dump `lambda`, so you can use `dill` or `cloudpickle`

In the future, I (or someone) will remake python-package to include feature_names in it (outside cpp).

PointCloudNiphon · 2019-11-18T15:37:07Z

ok,I WILL HAVE A TRY,THANKS. 发自我的小米手机在 OMOTO Tsukasa <notifications@github.com>，2019年11月18日下午3:07写道： So for now, I create and put a work-around for Python. import types # gbm is an instance of LGBMModel. # you have feature_names gbm.booster_._feature_names = feature_names gbm.booster_.feature_name = types.MethodType(lambda self: self._feature_names, gbm.booster_) # NOTICE: `pickle` can't dump `lambda`, so you can use `dill` or `cloudpickle` In the future, I (or someone) will remake python-package to include feature_names in it (outside cpp). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#2478?email_source=notifications&email_token=AI2M6S7G522XOXGZAMD6Z4DQUI5MLA5CNFSM4I3VEC32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEJNDYI#issuecomment-554881505>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AI2M6SZ57WGAIKA64FQFLOTQUI5MLANCNFSM4I3VEC3Q>.

mirekphd · 2020-01-28T21:26:22Z

@PointCloudNiphon Hi!

For now, there cannot be any non-ASCII symbols in string model representation. So, you should simply rename your feature names before passing them into LightGBM.

@StrikerRUS : This bug is more frequent than you seem to realize. It arises not because of original database columns / feature names being non-ASCII, but from the use of one-hot encoder (e.g get_dummies() from pandas), which appends non-ASCII feature levels (variables values) to these ASCII column names. So now after such encoding even categorical feature levels cannot contain regional characters... and they usually do (outside of the US). Your closest competitor, XGBoost does not impose such arbitrary and US-centric restrictions.

mirekphd · 2020-01-28T21:31:48Z

@rajibrj43 Hi! Indeed, there are no any non-ASCII symbols in your feature names. And I cannot reproduce your issue - LightGBM trains just fine with those feature names.

Because you used random numbers for all variables, including Name:) Try some non-ASCII characters with one-hot encoder... @StrikerRUS, once you replicate, could you please /reopen and repair?

StrikerRUS · 2020-01-29T18:30:56Z

@mirekphd I see, get_dummies() adds some headache and requires one more manual step for renaming column names. But please note that LightGBM doesn't require one-hot encoding for categorical variables and normally you won't use that function during a preprocessing phase: https://lightgbm.readthedocs.io/en/latest/Quick-Start.html#categorical-feature-support.

Speaking about reopening, please refer to #2478 (comment).

@kidotaka

This commit reverts 0d59859. Also see: - microsoft#2226 - microsoft#2478 - microsoft#2229 I reproduced the issue and as @kidotaka gave us a great survey in microsoft#2226, I don't conclude that the cause is UTF-8, but "an empty string (character)". Therefore, I revert "throw error when meet non ascii (microsoft#2229)" whose commit hash is 0d59859, and add support feture names as UTF-8 again.

@kidotaka

This commit reverts 0d59859. Also see: - microsoft#2226 - microsoft#2478 - microsoft#2229 I reproduced the issue and as @kidotaka gave us a great survey in microsoft#2226, I don't conclude that the cause is UTF-8, but "an empty string (character)". Therefore, I revert "throw error when meet non ascii (microsoft#2229)" whose commit hash is 0d59859, and add support feture names as UTF-8 again.

@kidotaka

* Support UTF-8 characters in feature name again This commit reverts 0d59859. Also see: - #2226 - #2478 - #2229 I reproduced the issue and as @kidotaka gave us a great survey in #2226, I don't conclude that the cause is UTF-8, but "an empty string (character)". Therefore, I revert "throw error when meet non ascii (#2229)" whose commit hash is 0d59859, and add support feture names as UTF-8 again. * add tests * fix check-docs tests * update * fix tests * update .travis.yml * fix tests * update test_r_package.sh * update test_r_package.sh * update test_r_package.sh * add a test for R-package * update test_r_package.sh * update test_r_package.sh * update test_r_package.sh * fix test for R-package * update test_r_package.sh * update test_r_package.sh * update test_r_package.sh * update test_r_package.sh * update * updte * update * remove unneeded comments

StrikerRUS · 2020-04-10T14:16:29Z

@henry0312 Can we mark this issue as resolved via #2976 or is it better to wait @jameslamb's PR for R part?

henry0312 · 2020-04-10T14:56:39Z

I believe that we can but It's better for us to wait for passing R tests because I'm not an expert in R.

jameslamb · 2020-04-10T21:31:01Z

I already created #2983 to capture the R-specific work @henry0312

StrikerRUS · 2020-04-10T22:20:37Z

Thank you @jameslamb ! Sorry, I didn't notice it. Then I'm putting a tick in our feature requests hub for this issue, because model file supports UTF-8 after #2976. R-specific progress will be tracked in your separate issue.

Shubhammishra-21 · 2020-04-27T07:52:33Z

import lightgbm as lgb
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_cv, label=y_cv)

param = {'objective': 'regression',
'boosting': 'gbdt',
'metric': 'l2_root',
'learning_rate': 0.05,
'num_iterations': 350,
'num_leaves': 31,
'max_depth': -1,
'min_data_in_leaf': 15,
'bagging_fraction': 0.85,
'bagging_freq': 1,
'feature_fraction': 0.55
}

lgbm = lgb.train(params=param,
verbose_eval=50,
train_set=train_data,
valid_sets=[test_data])

y_pred_lgbm = lgbm.predict(X_cv)
print('RMSLE:', sqrt(mean_squared_log_error(np.exp(y_cv), np.exp(y_pred_lgbm))))

i'm getting error:
LightGBMError: Do not support non-ASCII characters in feature name.

guolinke · 2020-04-27T10:18:29Z

@Shubhammishra-21 please use the latest master branch.

mercedesmedaly · 2020-07-19T07:40:56Z

Recode variables( names), with tildes, ...
symbols that are not allowed in the American language

franktoffel · 2020-09-19T15:58:30Z

I have got the error LightGBMError: Do not support special JSON characters in feature name. when using LightGBM 3.0 on a Windows 10 machine. It seems that issues with special characters were fixed with this release, but perhaps not on windows?

antoinecomp · 2021-03-11T11:41:07Z

Do not support special JSON characters in feature name: how to use LGBM in columns with Russian names?

I have a train part of a dataframe with Russian names:

shop__56 | sub_type_Сумки, Альбомы, Коврики д/мыши | shop__46 | sub_type_Для дома и офиса (Цифра) | shop__49 | shop__58 | shop__37 | sub_type_Служебные | sub_type_CD локального производства | shop__22 | ... | shop__48 | sub_type_Артбуки, энциклопедии | sub_type_Подарочные издания | sub_type_PSN | shop_id | sub_type_CD фирменного производства | sub_type_DVD | sub_type_PSVita | sub_type_Комиксы, манга | sub_type_Дополнительные издания
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | 0
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 41 | 0 | 0 | 0 | 0 | 0

And when I apply LightGBM on it, it rases the same issue:

---------------------------------------------------------------------------
LightGBMError                             Traceback (most recent call last)
<ipython-input-125-711fbc08b2b9> in <module>
     11                            min_split_gain=0.0222415,
     12                            min_child_weight=40)
---> 13 model_lgb.fit(X_train, y_train)

/opt/conda/lib/python3.7/site-packages/lightgbm/sklearn.py in fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model)
    777                                        verbose=verbose, feature_name=feature_name,
    778                                        categorical_feature=categorical_feature,
--> 779                                        callbacks=callbacks, init_model=init_model)
    780         return self
    781 

/opt/conda/lib/python3.7/site-packages/lightgbm/sklearn.py in fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model)
    615                               evals_result=evals_result, fobj=self._fobj, feval=eval_metrics_callable,
    616                               verbose_eval=verbose, feature_name=feature_name,
--> 617                               callbacks=callbacks, init_model=init_model)
    618 
    619         if evals_result:

/opt/conda/lib/python3.7/site-packages/lightgbm/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
    229     # construct booster
    230     try:
--> 231         booster = Booster(params=params, train_set=train_set)
    232         if is_valid_contain_train:
    233             booster.set_train_data_name(train_data_name)

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in __init__(self, params, train_set, model_file, model_str, silent)
   2051                     break
   2052             # construct booster object
-> 2053             train_set.construct()
   2054             # copy the parameters from train_set
   2055             params.update(train_set.get_params())

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in construct(self)
   1323                                 init_score=self.init_score, predictor=self._predictor,
   1324                                 silent=self.silent, feature_name=self.feature_name,
-> 1325                                 categorical_feature=self.categorical_feature, params=self.params)
   1326             if self.free_raw_data:
   1327                 self.data = None

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in _lazy_init(self, data, label, reference, weight, group, init_score, predictor, silent, feature_name, categorical_feature, params)
   1149             raise TypeError('Wrong predictor type {}'.format(type(predictor).__name__))
   1150         # set feature names
-> 1151         return self.set_feature_name(feature_name)
   1152 
   1153     def __init_from_np2d(self, mat, params_str, ref_dataset):

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in set_feature_name(self, feature_name)
   1630                 self.handle,
   1631                 c_array(ctypes.c_char_p, c_feature_name),
-> 1632                 ctypes.c_int(len(feature_name))))
   1633         return self
   1634 

/opt/conda/lib/python3.7/site-packages/lightgbm/basic.py in _safe_call(ret)
     53     """
     54     if ret != 0:
---> 55         raise LightGBMError(decode_string(_LIB.LGBM_GetLastError()))
     56 
     57 

LightGBMError: Do not support special JSON characters in feature name.

So how can I use LightGBM on this kind of features? Do I need to transform them to English column names?

StrikerRUS · 2021-03-12T00:21:43Z

Hi @antoinecomp !

Do not support special JSON characters in feature name

This error is not about Russian feature names. LightGBM handles them fine now. This issue is about JSON special chars, i.e.

LightGBM/include/LightGBM/utils/common.h

Lines 848 to 854 in 792c930

    
           if (char_code == 34      // " 
        
               || char_code == 44   // , 
        
               || char_code == 58   // : 
        
               || char_code == 91   // [ 
        
               || char_code == 93   // ] 
        
               || char_code == 123  // { 
        
               || char_code == 125  // }

You just should remove commas (,) (and any other special chars listed above, if any) from your feature names.

StrikerRUS changed the title ~~[python][feature requests] support utf-8 characters in feature name~~ [feature requests] support utf-8 characters in feature name Sep 30, 2019

StrikerRUS added enhancement feature request labels Sep 30, 2019

guolinke mentioned this issue Nov 11, 2019

Feature Requests & Voting Hub #2302

Open

StrikerRUS closed this as completed Nov 11, 2019

henry0312 mentioned this issue Apr 6, 2020

Support UTF-8 characters in feature name again #2976

Merged

jameslamb mentioned this issue Nov 21, 2023

Lift restrictions on feature names ("LightGBMError: Do not support special JSON characters in feature name") #6202

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature requests] support utf-8 characters in feature name #2478

[feature requests] support utf-8 characters in feature name #2478

OnlyFor commented Sep 30, 2019

guolinke commented Sep 30, 2019

guolinke commented Sep 30, 2019

OnlyFor commented Sep 30, 2019

guolinke commented Sep 30, 2019

StrikerRUS commented Oct 2, 2019

StrikerRUS commented Oct 9, 2019

StrikerRUS commented Oct 13, 2019

StrikerRUS commented Nov 6, 2019

StrikerRUS commented Nov 11, 2019

PointCloudNiphon commented Nov 14, 2019

StrikerRUS commented Nov 14, 2019

rajibrj43 commented Nov 15, 2019

StrikerRUS commented Nov 15, 2019

henry0312 commented Nov 18, 2019

PointCloudNiphon commented Nov 18, 2019 via email

mirekphd commented Jan 28, 2020 •

edited

Loading

mirekphd commented Jan 28, 2020 •

edited

Loading

StrikerRUS commented Jan 29, 2020

StrikerRUS commented Apr 10, 2020

henry0312 commented Apr 10, 2020

jameslamb commented Apr 10, 2020

StrikerRUS commented Apr 10, 2020

Shubhammishra-21 commented Apr 27, 2020

guolinke commented Apr 27, 2020

mercedesmedaly commented Jul 19, 2020

franktoffel commented Sep 19, 2020

antoinecomp commented Mar 11, 2021

StrikerRUS commented Mar 12, 2021

[feature requests] support utf-8 characters in feature name #2478

[feature requests] support utf-8 characters in feature name #2478

Comments

OnlyFor commented Sep 30, 2019

guolinke commented Sep 30, 2019

guolinke commented Sep 30, 2019

OnlyFor commented Sep 30, 2019

guolinke commented Sep 30, 2019

StrikerRUS commented Oct 2, 2019

StrikerRUS commented Oct 9, 2019

StrikerRUS commented Oct 13, 2019

StrikerRUS commented Nov 6, 2019

StrikerRUS commented Nov 11, 2019

PointCloudNiphon commented Nov 14, 2019

StrikerRUS commented Nov 14, 2019

rajibrj43 commented Nov 15, 2019

StrikerRUS commented Nov 15, 2019

henry0312 commented Nov 18, 2019

PointCloudNiphon commented Nov 18, 2019 via email

mirekphd commented Jan 28, 2020 • edited Loading

mirekphd commented Jan 28, 2020 • edited Loading

StrikerRUS commented Jan 29, 2020

StrikerRUS commented Apr 10, 2020

henry0312 commented Apr 10, 2020

jameslamb commented Apr 10, 2020

StrikerRUS commented Apr 10, 2020

Shubhammishra-21 commented Apr 27, 2020

guolinke commented Apr 27, 2020

mercedesmedaly commented Jul 19, 2020

franktoffel commented Sep 19, 2020

antoinecomp commented Mar 11, 2021

StrikerRUS commented Mar 12, 2021

mirekphd commented Jan 28, 2020 •

edited

Loading

mirekphd commented Jan 28, 2020 •

edited

Loading