Wrong size of feature_names #2226

jacquespeeters · 2019-06-12T10:30:48Z

Environment info

Operating System:

No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 9.5 (stretch)
Release:	9.5
Codename:	stretch

CPU/GPU model:

C++/Python/R version:

Python 3.6.6 (default, Jul 27 2018, 16:21:42) 
[GCC 6.3.0 20170516] on linux

LightGBM version or commit hash:
lightgbm==2.2.3

Error message

Reproducible examples

It is difficult to make it reproducible :)
I've checked various issues related to this such as
#379 which led to this merge #426
#540

Please find bellow the code I used

    lgb_train = lgb.Dataset(X_train, label=y_train, categorical_feature=categorical_feature)
    lgb_valid = lgb.Dataset(X_valid, label=y_valid, categorical_feature=categorical_feature)

    param = {
        # 'objective': 'binary',
        'objective': 'xentropy',
        # 'metric': 'auc',
        'random_state': 1,
        "verbosity": -1,
        'learning_rate': 0.05,
        'num_threads': 16,
    }

    X_cols
    print('len(X_cols)', len(X_cols))
    print('len(X_cols)', len(list(set(X_cols))))
    print('len set with replace space with underscore ', len(set([c.replace(' ', '_') for c in X_cols])))
    X_train
    model_gbm = lgb.train(params=param,
                          train_set=lgb_train,
                          num_boost_round=4000,
                          valid_sets=[lgb_train, lgb_valid],
                          early_stopping_rounds=20,
                          feature_name=X_cols,
                          verbose_eval=20,
                          )

Which lead to this

I've the feeling that the problem is linked to this, but unsure how to solve it. #426

I'm curently commenting feature_name=X_cols, to avoid bug in my pipeline.

The text was updated successfully, but these errors were encountered:

guolinke · 2019-06-13T02:07:29Z

could you provide the content of X_cols ?

jacquespeeters · 2019-06-13T08:31:47Z

X_cols.txt
Sorry, here it is.

guolinke · 2019-06-14T05:06:33Z

it seems there are some non-ASCII characters in feature names, I think they cause this fail.
We only accept ASCII for most strings in LightGBM.

jacquespeeters · 2019-06-14T13:03:58Z

Thank you for your answer. I'll have look whenever possible.
Should I close the issue or not (eg if you want to improve error message)?

guolinke · 2019-06-14T13:06:28Z

maybe we can do this in python side.
@StrikerRUS could we throw exception when users passe the non-ascii characters in feature_names?

StrikerRUS · 2019-06-14T13:15:03Z

@guolinke I think yes, we can. But why not to do this in more general way at cpp side? And I guess that not only feature_names field is affected.

guolinke · 2019-06-14T13:40:18Z

@StrikerRUS It is not very straight-forward to support the non-ascii characters (e.g. utf8) in cpp side.
And this will change the encoding of the model file as well, since the feature name is saved to model file..
It may hurt the backward compatibility.

StrikerRUS · 2019-06-14T14:36:40Z

@guolinke I meant not support utf-8, which requires to change model file encoding, but raise the error when meet non-ascii symbol. Is it possible at cpp side?

StrikerRUS · 2019-06-14T14:38:39Z

@guolinke Something like this: https://stackoverflow.com/questions/48212992/how-to-find-out-if-there-is-any-non-ascii-character-in-a-string-with-a-file-path.

kidotaka · 2019-12-10T05:53:39Z

is it really an encoding problem?
I have not seen the error with a lot of Japanese Kanji.
I inspected this issue after #2229 merge because I want to use non-ascii.

The above attached X_cols.txt contains 2 non-ascii characters.
codepoint=224 (U+00E0) name=LATIN SMALL LETTER A WITH GRAVE
codepoint=233 (U+00E9) name=LATIN SMALL LETTER E WITH ACUTE
These characters are used in France and Italy.

c_str function in basic.py use utf-8 encoding, so these characters safely converted to utf-8 encoded bytes.

def c_str(string):
"""Convert a Python string to C string."""
return ctypes.c_char_p(string.encode('utf-8'))

model_to_string function in basic.py use decode function with no argument.
The type of string_buffer.value is bytes.
This means default utf-8 is used to decode.

    ret = string_buffer.value.decode()
    ret += _dump_pandas_categorical(self.pandas_categorical)
    return ret

How about model file encoding?
The binary option is used in gbdt_model_text.cpp.

bool GBDT::SaveModelToFile(int start_iteration, int num_iteration, const char* filename) const {
/*! \brief File to write models */
std::ofstream output_file;
output_file.open(filename, std::ios::out | std::ios::binary);

I think it's not an encoding problem.
Actually, I can use the above 2 characters with python 3.6.6 and lightgbm 2.2.3 in conda env on windows.

I doubt a blank feature name.
X_cols.txt have a blank line at 1726.
The blank feature name easily cause LightGBMError: "Wrong size of feature names."
This is a kind of split problem.

import lightgbm as lgb
import pandas as pd
import codecs
import numpy as np
import unicodedata

def main():
    X_cols = create_X_cols()
    report_non_ascii(X_cols)
    rows = 1
    X_train = pd.DataFrame(data=np.zeros((rows,len(X_cols))), columns=X_cols)
    y_train = pd.DataFrame(data=np.ones((rows,1)), columns=['label'])

    lgb_train = lgb.Dataset(X_train, label=y_train)

    param = {
        'objective': 'xentropy',
        'random_state' : 1,
        'verbosity': -1,
        'learning_rate': 0.05,
        'num_threads': 4,
    }

    model = lgb.train(params=param,
                    train_set=lgb_train,
                    num_boost_round=10,
                    feature_name=X_cols,
                    verbose_eval=20)
    print(len(model.feature_name()))
    model.save_model("model.txt")
    
def create_X_cols():
    with codecs.open('X_cols.txt', mode='r', encoding='utf-8') as f:
        lines = f.readlines()
    print("length={}".format(len(lines)))
    
    print("LF contains? {}".format(any(map(lambda x: "\n" in x, lines))))
    LF_removed = list(map(lambda x: x.replace("\n",""), lines))
    print("LF contains? {}".format(any(map(lambda x: "\n" in x, LF_removed))))
    print("space contains? {}".format(any(map(lambda x: " " in x, LF_removed))))
    no_duplicate = list(set(LF_removed))
    print("blank line contains? {}".format(any(map(lambda x: "" == x, no_duplicate))))
    no_blank_line = [c for c in  no_duplicate if c != '']
    print("blank line contains? {}".format(any(map(lambda x: "" == x, no_blank_line))))
    print("length={}".format(len(no_blank_line)))
    return sorted(no_blank_line)

def report_non_ascii(columns):
    non_ascii_codepoints = sorted(set([ord(c) for col in columns for c in col if ord(c) > 127]))
    print("non ascii codepoints={}".format(non_ascii_codepoints))
    for codepoint in non_ascii_codepoints:
        print("codepoint={} \tname={}".format(codepoint, unicodedata.name(chr(codepoint))))

if __name__ == '__main__':
    main()

lock · 2020-02-28T23:28:33Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@kidotaka

This commit reverts 0d59859. Also see: - microsoft#2226 - microsoft#2478 - microsoft#2229 I reproduced the issue and as @kidotaka gave us a great survey in microsoft#2226, I don't conclude that the cause is UTF-8, but "an empty string (character)". Therefore, I revert "throw error when meet non ascii (microsoft#2229)" whose commit hash is 0d59859, and add support feture names as UTF-8 again.

@kidotaka

This commit reverts 0d59859. Also see: - microsoft#2226 - microsoft#2478 - microsoft#2229 I reproduced the issue and as @kidotaka gave us a great survey in microsoft#2226, I don't conclude that the cause is UTF-8, but "an empty string (character)". Therefore, I revert "throw error when meet non ascii (microsoft#2229)" whose commit hash is 0d59859, and add support feture names as UTF-8 again.

@kidotaka

* Support UTF-8 characters in feature name again This commit reverts 0d59859. Also see: - #2226 - #2478 - #2229 I reproduced the issue and as @kidotaka gave us a great survey in #2226, I don't conclude that the cause is UTF-8, but "an empty string (character)". Therefore, I revert "throw error when meet non ascii (#2229)" whose commit hash is 0d59859, and add support feture names as UTF-8 again. * add tests * fix check-docs tests * update * fix tests * update .travis.yml * fix tests * update test_r_package.sh * update test_r_package.sh * update test_r_package.sh * add a test for R-package * update test_r_package.sh * update test_r_package.sh * update test_r_package.sh * fix test for R-package * update test_r_package.sh * update test_r_package.sh * update test_r_package.sh * update test_r_package.sh * update * updte * update * remove unneeded comments

guolinke mentioned this issue Jun 14, 2019

throw error when meet non ascii #2229

Merged

guolinke closed this as completed in #2229 Jul 18, 2019

henry0312 mentioned this issue Nov 18, 2019

[feature requests] support utf-8 characters in feature name #2478

Closed

henry0312 mentioned this issue Apr 6, 2020

Support UTF-8 characters in feature name again #2976

Merged

lock bot locked as resolved and limited conversation to collaborators May 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong size of feature_names #2226

Wrong size of feature_names #2226

jacquespeeters commented Jun 12, 2019

guolinke commented Jun 13, 2019

jacquespeeters commented Jun 13, 2019

guolinke commented Jun 14, 2019

jacquespeeters commented Jun 14, 2019

guolinke commented Jun 14, 2019

StrikerRUS commented Jun 14, 2019

guolinke commented Jun 14, 2019

StrikerRUS commented Jun 14, 2019

StrikerRUS commented Jun 14, 2019

kidotaka commented Dec 10, 2019

lock bot commented Feb 28, 2020

Wrong size of feature_names #2226

Wrong size of feature_names #2226

Comments

jacquespeeters commented Jun 12, 2019

Environment info

Error message

Reproducible examples

guolinke commented Jun 13, 2019

jacquespeeters commented Jun 13, 2019

guolinke commented Jun 14, 2019

jacquespeeters commented Jun 14, 2019

guolinke commented Jun 14, 2019

StrikerRUS commented Jun 14, 2019

guolinke commented Jun 14, 2019

StrikerRUS commented Jun 14, 2019

StrikerRUS commented Jun 14, 2019

kidotaka commented Dec 10, 2019

lock bot commented Feb 28, 2020