OverflowError when training with 100k+ iterations #2265

louis925 · 2019-07-15T06:20:31Z

Environment info

Operating System: Windows 7 SP2 (and same issue on macOS 10.13.6 but it crashes python kernel without any message)

CPU/GPU model: CPU

C++/Python/R version: Python 3.6

LightGBM version or commit hash: 2.2.3 (and 2.2.0)

Error message

When training lightgbm with more than 100,000 iterations, the model can finish training (still enough memory) but fail when it try to exit the training process.

[358000]	training's mape: 0.000139252
[360000]	training's mape: 0.00013805
[362000]	training's mape: 0.000136836
[364000]	training's mape: 0.000135664
[366000]	training's mape: 0.000134525
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
<ipython-input-22-f940fa105e9d> in <module>()
     11 
     12 # train model
---> 13 model = lgb.train(params, lgb_train, valid_sets=lgb_train, **lgb_other_params)
     14 
     15 y_pred = model.predict(df_test[cols_feats])

c:\python36\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
    240         booster.best_score[dataset_name][eval_name] = score
    241     if not keep_training_booster:
--> 242         booster.model_from_string(booster.model_to_string(), False).free_dataset()
    243     return booster
    244 

c:\python36\lib\site-packages\lightgbm\basic.py in model_to_string(self, num_iteration, start_iteration)
   2096         # if buffer length is not long enough, re-allocate a buffer
   2097         if actual_len > buffer_len:
-> 2098             string_buffer = ctypes.create_string_buffer(actual_len)
   2099             ptr_string_buffer = ctypes.c_char_p(*[ctypes.addressof(string_buffer)])
   2100             _safe_call(_LIB.LGBM_BoosterSaveModelToString(

c:\python36\lib\ctypes\__init__.py in create_string_buffer(init, size)
     58         return buf
     59     elif isinstance(init, int):
---> 60         buftype = c_char * init
     61         buf = buftype()
     62         return buf

OverflowError: The '_length_' attribute is too large

However, if I set the keep_training_booster=True, it can finish the entire training without problem. So this seems to happen only when Lightgbm is trying to turn the model into a string before removing it.

Reproducible examples

You can try with any regression problem with ~50,000 samples and 150 features, and train it with ~300,000 iterations but small learning rate like 0.001.

params = {
    'boosting_type': 'gbdt', 'task': 'train', 'objective': 'mse', 'metric': 'mse',
    'feature_fraction': 0.9, 'learning_rate': 0.001, 'num_leaves': 255,
}
lgb_other_params = {'num_boost_round': 366000, 'verbose_eval': 2000}
lgb_train = lgb.Dataset(df_train[cols_feats], df_train[col_target]).construct()
model = lgb.train(params, lgb_train, valid_sets=lgb_train, **lgb_other_params)

where df_train in our case has about 50,000 samples and 150 features and it still fit in our 16GB memory during training. But only fail when exiting the training with keep_training_booster=False.

The text was updated successfully, but these errors were encountered:

guolinke · 2019-07-15T07:35:08Z

c:\python36\lib\ctypes\__init__.py in create_string_buffer(init, size)
     58         return buf
     59     elif isinstance(init, int):
---> 60         buftype = c_char * init
     61         buf = buftype()
     62         return buf

OverflowError: The '_length_' attribute is too large

It seems this error is caused by ctypes...

louis925 · 2019-07-15T20:24:11Z

Just curious. Do you think it is possible to bypass the model_from_string(booster.model_to_string()) part? Because I noticed the lightgbm model spent lots of time trying to convert the model to string (which can only use 1 thread) before crashing in this case.

guolinke · 2019-07-16T02:45:13Z

keep_training_booster=True is the only solution for now.

StrikerRUS · 2019-07-24T10:35:01Z

@guolinke Do you think that this issue is fixable?

guolinke · 2019-07-24T11:43:43Z

There could be a work around, for example, returning multiple small strings and concat them outside ctypes.
A quick fix is, use a file to save/restore model, instead of string.

StrikerRUS · 2019-08-01T17:11:20Z

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

jameslamb added the question label Jul 27, 2019

guolinke added feature request help wanted labels Aug 1, 2019

guolinke mentioned this issue Aug 1, 2019

Feature Requests & Voting Hub #2302

Open

guolinke closed this as completed Aug 1, 2019

guolinke mentioned this issue Aug 1, 2019

multi-class classification num_class limited #2219

Closed

StrikerRUS mentioned this issue Jan 9, 2020

OverflowError: The '_length_' attribute is too large #2675

Closed

StrikerRUS mentioned this issue Oct 5, 2020

Save_model doesn't support model larger than 2G？ if num_boost_round is too large，the saved model cannot work. #3289

Closed

StrikerRUS mentioned this issue Jan 26, 2021

Error saving very large LightGBM models #3858

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OverflowError when training with 100k+ iterations #2265

OverflowError when training with 100k+ iterations #2265

louis925 commented Jul 15, 2019 •

edited

Loading

guolinke commented Jul 15, 2019

louis925 commented Jul 15, 2019 •

edited

Loading

guolinke commented Jul 16, 2019

StrikerRUS commented Jul 24, 2019

guolinke commented Jul 24, 2019 •

edited

Loading

StrikerRUS commented Aug 1, 2019

OverflowError when training with 100k+ iterations #2265

OverflowError when training with 100k+ iterations #2265

Comments

louis925 commented Jul 15, 2019 • edited Loading

Environment info

Error message

Reproducible examples

guolinke commented Jul 15, 2019

louis925 commented Jul 15, 2019 • edited Loading

guolinke commented Jul 16, 2019

StrikerRUS commented Jul 24, 2019

guolinke commented Jul 24, 2019 • edited Loading

StrikerRUS commented Aug 1, 2019

louis925 commented Jul 15, 2019 •

edited

Loading

louis925 commented Jul 15, 2019 •

edited

Loading

guolinke commented Jul 24, 2019 •

edited

Loading