Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong size of feature_names #2226

Closed
jacquespeeters opened this issue Jun 12, 2019 · 11 comments · Fixed by #2229
Closed

Wrong size of feature_names #2226

jacquespeeters opened this issue Jun 12, 2019 · 11 comments · Fixed by #2229

Comments

@jacquespeeters
Copy link

Environment info

Operating System:

No LSB modules are available.
Distributor ID:	Debian
Description:	Debian GNU/Linux 9.5 (stretch)
Release:	9.5
Codename:	stretch

CPU/GPU model:

C++/Python/R version:

Python 3.6.6 (default, Jul 27 2018, 16:21:42) 
[GCC 6.3.0 20170516] on linux

LightGBM version or commit hash:
lightgbm==2.2.3

Error message

Reproducible examples

It is difficult to make it reproducible :)
I've checked various issues related to this such as
#379 which led to this merge #426
#540

Please find bellow the code I used

    lgb_train = lgb.Dataset(X_train, label=y_train, categorical_feature=categorical_feature)
    lgb_valid = lgb.Dataset(X_valid, label=y_valid, categorical_feature=categorical_feature)

    param = {
        # 'objective': 'binary',
        'objective': 'xentropy',
        # 'metric': 'auc',
        'random_state': 1,
        "verbosity": -1,
        'learning_rate': 0.05,
        'num_threads': 16,
    }

    X_cols
    print('len(X_cols)', len(X_cols))
    print('len(X_cols)', len(list(set(X_cols))))
    print('len set with replace space with underscore ', len(set([c.replace(' ', '_') for c in X_cols])))
    X_train
    model_gbm = lgb.train(params=param,
                          train_set=lgb_train,
                          num_boost_round=4000,
                          valid_sets=[lgb_train, lgb_valid],
                          early_stopping_rounds=20,
                          feature_name=X_cols,
                          verbose_eval=20,
                          )

Which lead to this

image

I've the feeling that the problem is linked to this, but unsure how to solve it. #426

I'm curently commenting feature_name=X_cols, to avoid bug in my pipeline.

@guolinke
Copy link
Collaborator

could you provide the content of X_cols ?

@jacquespeeters
Copy link
Author

X_cols.txt
Sorry, here it is.

@guolinke
Copy link
Collaborator

it seems there are some non-ASCII characters in feature names, I think they cause this fail.
We only accept ASCII for most strings in LightGBM.

@jacquespeeters
Copy link
Author

Thank you for your answer. I'll have look whenever possible.
Should I close the issue or not (eg if you want to improve error message)?

@guolinke
Copy link
Collaborator

maybe we can do this in python side.
@StrikerRUS could we throw exception when users passe the non-ascii characters in feature_names?

@StrikerRUS
Copy link
Collaborator

@guolinke I think yes, we can. But why not to do this in more general way at cpp side? And I guess that not only feature_names field is affected.

@guolinke
Copy link
Collaborator

@StrikerRUS It is not very straight-forward to support the non-ascii characters (e.g. utf8) in cpp side.
And this will change the encoding of the model file as well, since the feature name is saved to model file..
It may hurt the backward compatibility.

@StrikerRUS
Copy link
Collaborator

@guolinke I meant not support utf-8, which requires to change model file encoding, but raise the error when meet non-ascii symbol. Is it possible at cpp side?

@StrikerRUS
Copy link
Collaborator

@guolinke Something like this: https://stackoverflow.com/questions/48212992/how-to-find-out-if-there-is-any-non-ascii-character-in-a-string-with-a-file-path.

@kidotaka
Copy link

is it really an encoding problem?
I have not seen the error with a lot of Japanese Kanji.
I inspected this issue after #2229 merge because I want to use non-ascii.

The above attached X_cols.txt contains 2 non-ascii characters.
codepoint=224 (U+00E0) name=LATIN SMALL LETTER A WITH GRAVE
codepoint=233 (U+00E9) name=LATIN SMALL LETTER E WITH ACUTE
These characters are used in France and Italy.

c_str function in basic.py use utf-8 encoding, so these characters safely converted to utf-8 encoded bytes.

def c_str(string):
"""Convert a Python string to C string."""
return ctypes.c_char_p(string.encode('utf-8'))

model_to_string function in basic.py use decode function with no argument.
The type of string_buffer.value is bytes.
This means default utf-8 is used to decode.

    ret = string_buffer.value.decode()
    ret += _dump_pandas_categorical(self.pandas_categorical)
    return ret

How about model file encoding?
The binary option is used in gbdt_model_text.cpp.

bool GBDT::SaveModelToFile(int start_iteration, int num_iteration, const char* filename) const {
/*! \brief File to write models */
std::ofstream output_file;
output_file.open(filename, std::ios::out | std::ios::binary);

I think it's not an encoding problem.
Actually, I can use the above 2 characters with python 3.6.6 and lightgbm 2.2.3 in conda env on windows.

I doubt a blank feature name.
X_cols.txt have a blank line at 1726.
The blank feature name easily cause LightGBMError: "Wrong size of feature names."
This is a kind of split problem.

import lightgbm as lgb
import pandas as pd
import codecs
import numpy as np
import unicodedata

def main():
    X_cols = create_X_cols()
    report_non_ascii(X_cols)
    rows = 1
    X_train = pd.DataFrame(data=np.zeros((rows,len(X_cols))), columns=X_cols)
    y_train = pd.DataFrame(data=np.ones((rows,1)), columns=['label'])

    lgb_train = lgb.Dataset(X_train, label=y_train)

    param = {
        'objective': 'xentropy',
        'random_state' : 1,
        'verbosity': -1,
        'learning_rate': 0.05,
        'num_threads': 4,
    }

    model = lgb.train(params=param,
                    train_set=lgb_train,
                    num_boost_round=10,
                    feature_name=X_cols,
                    verbose_eval=20)
    print(len(model.feature_name()))
    model.save_model("model.txt")
    
def create_X_cols():
    with codecs.open('X_cols.txt', mode='r', encoding='utf-8') as f:
        lines = f.readlines()
    print("length={}".format(len(lines)))
    
    print("LF contains? {}".format(any(map(lambda x: "\n" in x, lines))))
    LF_removed = list(map(lambda x: x.replace("\n",""), lines))
    print("LF contains? {}".format(any(map(lambda x: "\n" in x, LF_removed))))
    print("space contains? {}".format(any(map(lambda x: " " in x, LF_removed))))
    no_duplicate = list(set(LF_removed))
    print("blank line contains? {}".format(any(map(lambda x: "" == x, no_duplicate))))
    no_blank_line = [c for c in  no_duplicate if c != '']
    print("blank line contains? {}".format(any(map(lambda x: "" == x, no_blank_line))))
    print("length={}".format(len(no_blank_line)))
    return sorted(no_blank_line)

def report_non_ascii(columns):
    non_ascii_codepoints = sorted(set([ord(c) for col in columns for c in col if ord(c) > 127]))
    print("non ascii codepoints={}".format(non_ascii_codepoints))
    for codepoint in non_ascii_codepoints:
        print("codepoint={} \tname={}".format(codepoint, unicodedata.name(chr(codepoint))))

if __name__ == '__main__':
    main()

@lock
Copy link

lock bot commented Feb 28, 2020

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

henry0312 added a commit to henry0312/LightGBM that referenced this issue Apr 6, 2020
This commit reverts 0d59859.
Also see:
- microsoft#2226
- microsoft#2478
- microsoft#2229

I reproduced the issue and as @kidotaka gave us a great survey in microsoft#2226,
I don't conclude that the cause is UTF-8, but "an empty string (character)".
Therefore, I revert "throw error when meet non ascii (microsoft#2229)" whose commit hash
is 0d59859, and add support feture names as UTF-8 again.
henry0312 added a commit to henry0312/LightGBM that referenced this issue Apr 6, 2020
This commit reverts 0d59859.
Also see:
- microsoft#2226
- microsoft#2478
- microsoft#2229

I reproduced the issue and as @kidotaka gave us a great survey in microsoft#2226,
I don't conclude that the cause is UTF-8, but "an empty string (character)".
Therefore, I revert "throw error when meet non ascii (microsoft#2229)" whose commit hash
is 0d59859, and add support feture names as UTF-8 again.
henry0312 added a commit that referenced this issue Apr 10, 2020
* Support UTF-8 characters in feature name again

This commit reverts 0d59859.
Also see:
- #2226
- #2478
- #2229

I reproduced the issue and as @kidotaka gave us a great survey in #2226,
I don't conclude that the cause is UTF-8, but "an empty string (character)".
Therefore, I revert "throw error when meet non ascii (#2229)" whose commit hash
is 0d59859, and add support feture names as UTF-8 again.

* add tests

* fix check-docs tests

* update

* fix tests

* update .travis.yml

* fix tests

* update test_r_package.sh

* update test_r_package.sh

* update test_r_package.sh

* add a test for R-package

* update test_r_package.sh

* update test_r_package.sh

* update test_r_package.sh

* fix test for R-package

* update test_r_package.sh

* update test_r_package.sh

* update test_r_package.sh

* update test_r_package.sh

* update

* updte

* update

* remove unneeded comments
@lock lock bot locked as resolved and limited conversation to collaborators May 5, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants