Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OverflowError when training with 100k+ iterations #2265

Closed
louis925 opened this issue Jul 15, 2019 · 6 comments
Closed

OverflowError when training with 100k+ iterations #2265

louis925 opened this issue Jul 15, 2019 · 6 comments

Comments

@louis925
Copy link

louis925 commented Jul 15, 2019

Environment info

Operating System: Windows 7 SP2 (and same issue on macOS 10.13.6 but it crashes python kernel without any message)

CPU/GPU model: CPU

C++/Python/R version: Python 3.6

LightGBM version or commit hash: 2.2.3 (and 2.2.0)

Error message

When training lightgbm with more than 100,000 iterations, the model can finish training (still enough memory) but fail when it try to exit the training process.

[358000]	training's mape: 0.000139252
[360000]	training's mape: 0.00013805
[362000]	training's mape: 0.000136836
[364000]	training's mape: 0.000135664
[366000]	training's mape: 0.000134525
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-22-f940fa105e9d> in <module>()
     11 
     12 # train model
---> 13 model = lgb.train(params, lgb_train, valid_sets=lgb_train, **lgb_other_params)
     14 
     15 y_pred = model.predict(df_test[cols_feats])

c:\python36\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
    240         booster.best_score[dataset_name][eval_name] = score
    241     if not keep_training_booster:
--> 242         booster.model_from_string(booster.model_to_string(), False).free_dataset()
    243     return booster
    244 

c:\python36\lib\site-packages\lightgbm\basic.py in model_to_string(self, num_iteration, start_iteration)
   2096         # if buffer length is not long enough, re-allocate a buffer
   2097         if actual_len > buffer_len:
-> 2098             string_buffer = ctypes.create_string_buffer(actual_len)
   2099             ptr_string_buffer = ctypes.c_char_p(*[ctypes.addressof(string_buffer)])
   2100             _safe_call(_LIB.LGBM_BoosterSaveModelToString(

c:\python36\lib\ctypes\__init__.py in create_string_buffer(init, size)
     58         return buf
     59     elif isinstance(init, int):
---> 60         buftype = c_char * init
     61         buf = buftype()
     62         return buf

OverflowError: The '_length_' attribute is too large

However, if I set the keep_training_booster=True, it can finish the entire training without problem. So this seems to happen only when Lightgbm is trying to turn the model into a string before removing it.

Reproducible examples

You can try with any regression problem with ~50,000 samples and 150 features, and train it with ~300,000 iterations but small learning rate like 0.001.

params = {
    'boosting_type': 'gbdt', 'task': 'train', 'objective': 'mse', 'metric': 'mse',
    'feature_fraction': 0.9, 'learning_rate': 0.001, 'num_leaves': 255,
}
lgb_other_params = {'num_boost_round': 366000, 'verbose_eval': 2000}
lgb_train = lgb.Dataset(df_train[cols_feats], df_train[col_target]).construct()
model = lgb.train(params, lgb_train, valid_sets=lgb_train, **lgb_other_params)

where df_train in our case has about 50,000 samples and 150 features and it still fit in our 16GB memory during training. But only fail when exiting the training with keep_training_booster=False.

@guolinke
Copy link
Collaborator

c:\python36\lib\ctypes\__init__.py in create_string_buffer(init, size)
     58         return buf
     59     elif isinstance(init, int):
---> 60         buftype = c_char * init
     61         buf = buftype()
     62         return buf

OverflowError: The '_length_' attribute is too large

It seems this error is caused by ctypes...

@louis925
Copy link
Author

louis925 commented Jul 15, 2019

Just curious. Do you think it is possible to bypass the model_from_string(booster.model_to_string()) part? Because I noticed the lightgbm model spent lots of time trying to convert the model to string (which can only use 1 thread) before crashing in this case.

@guolinke
Copy link
Collaborator

keep_training_booster=True is the only solution for now.

@StrikerRUS
Copy link
Collaborator

@guolinke Do you think that this issue is fixable?

@guolinke
Copy link
Collaborator

guolinke commented Jul 24, 2019

There could be a work around, for example, returning multiple small strings and concat them outside ctypes.
A quick fix is, use a file to save/restore model, instead of string.

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants