Trained tree unexpectedly contains only root #4826

FiksII · 2021-11-24T17:29:47Z

Description

There is a synthetic dataset with 1 feature and 1 target, number of the records = 120.
On the full dataset, I get just 1 trained tree without splits, only root.
But, on the dataset with fewer records, I get normal results.

Reproducible example

df_data = pandas.read_csv('issued_dataset.csv')
X = df_data[['X']]
y = df_data['y']

lgbm_params = {
    'boosting_type': 'gbrt',
    'n_estimators': 100,
    'learning_rate': 0.1,
    'max_depth': -1,
    'min_child_samples': 10,
    'min_child_weight': 0,
    'min_split_gain': 0.0,
    'n_jobs': -1,
    'num_leaves': 10,
    'reg_alpha': 0.0,
    'reg_lambda': 0.0,
    'subsample': 1,
    'subsample_for_bin': 200000,
    'subsample_freq': 1,
    'verbose': -1,
    'metric': 'l1',
}

for n in range(10, len(df_data) + 1):
    dataset = lightgbm.Dataset(data=X.iloc[:n], label=y.iloc[:n], categorical_feature=[])
    model = lightgbm.train(lgbm_params, dataset)
    print(f'{n=}, {model.num_trees()}')

n=10, 1
n=11, 1
n=12, 1
n=13, 1
n=14, 1
n=15, 1
n=16, 1
n=17, 1
n=18, 1
n=19, 1
n=20, 100
n=21, 100
n=22, 100
n=23, 100
n=24, 100
n=25, 100
n=26, 100
n=27, 100
n=28, 100
n=29, 100
n=30, 100
n=31, 100
n=32, 100
n=33, 100
n=34, 100
n=35, 100
n=36, 100
n=37, 100
n=38, 100
n=39, 100
n=40, 100
n=41, 100
n=42, 100
n=43, 100
n=44, 100
n=45, 100
n=46, 100
n=47, 100
n=48, 100
n=49, 100
n=50, 100
n=51, 100
n=52, 100
n=53, 100
n=54, 100
n=55, 100
n=56, 100
n=57, 100
n=58, 100
n=59, 100
n=60, 100
n=61, 100
n=62, 100
n=63, 100
n=64, 100
n=65, 100
n=66, 100
n=67, 100
n=68, 100
n=69, 100
n=70, 100
n=71, 100
n=72, 100
n=73, 100
n=74, 100
n=75, 100
n=76, 100
n=77, 100
n=78, 100
n=79, 100
n=80, 100
n=81, 100
n=82, 100
n=83, 100
n=84, 100
n=85, 100
n=86, 100
n=87, 100
n=88, 100
n=89, 100
n=90, 100
n=91, 100
n=92, 100
n=93, 100
n=94, 100
n=95, 100
n=96, 100
n=97, 100
n=98, 100
n=99, 100
n=100, 100
n=101, 100
n=102, 100
n=103, 100
n=104, 100
n=105, 100
n=106, 100
n=107, 100
n=108, 100
n=109, 100
n=110, 100
n=111, 100
n=112, 100
n=113, 100
n=114, 1
n=115, 100
n=116, 100
n=117, 100
n=118, 100
n=119, 100
n=120, 1

dataset = lightgbm.Dataset(data=X, label=y, categorical_feature=[])
lgbm_params['verbose'] = 2
lightgbm.train(lgbm_params, dataset)

[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.016667
[LightGBM] [Debug] init for col-wise cost 0.000005 seconds, init for row-wise cost 0.000230 seconds
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000245 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 31
[LightGBM] [Info] Number of data points in the train set: 120, number of used features: 1
[LightGBM] [Info] Start training from score 15.000000
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 1 and depth = 1

Environment info

lightgbm version 3.3.1
python 3.8.8
Windows 10 x64

Additional Comments

Necessary file:
issued_dataset.csv

The text was updated successfully, but these errors were encountered:

jameslamb · 2021-11-29T04:52:45Z

Thanks very much for your interest in LightGBM and for creating this issue.

I created the following minimal, reproducible example by modifying your provided code in the following ways

adding import statements
reading dataset over the internet to avoid dealing with local filepaths
removing parameters until I had it down to the smallest possible set that still reproduces the problem
reducing num_boost_round

import lightgbm
import pandas

data_url = "https://github.com/microsoft/LightGBM/files/7597692/issued_dataset.csv"
df_data = pandas.read_csv(data_url)

X = df_data[['X']]
y = df_data['y']

for n in range(10, len(df_data) + 1):
    dataset = lightgbm.Dataset(
        data=X.iloc[:n],
        label=y.iloc[:n],
        categorical_feature=[],
        params={"max_bin": 256},
        free_raw_data=False
    )
    model = lightgbm.train(
        params={
            'boosting_type': 'gbrt',
            'objective': 'regression',
            'verbose': -1,
            'min_data_in_leaf': 1,
        },
        train_set=dataset,
        num_boost_round=50
    )
    print(f'{n=}, {model.num_trees()}')

I can see the same behavior you've described. Using small amounts of data, LightGBM builds num_boost_round trees. At exactly 120 observations (the full dataset), LightGBM only builds a single tree and shows the following warning at every iteration.

[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements

If you're curious, this warning comes from here:

LightGBM/src/boosting/gbdt.cpp

Line 443 in b0137de

    
           Log::Warning("Stopped training because there are no more leaves that meet the split requirements");

I expect that something about adding one more observation changed something about the distribution of either the target or the feature (there is only 1) in a way that led to this stopping condition. With so few observations (120), one sample can have a large impact.

I'll investigate this a bit more.

jameslamb · 2021-11-29T05:39:27Z

aha! I think I figured it out!

The target in this dataset only has two unique values.

np.unique(y)
# array([10., 20.])

Adding the final row in the dataset results in the mean of y being exactly 15.0. 60 observations where y = 10, 60 observations where y = 20.

In your dataset, X is monotonically increasing, and after that last observation is added, the mean of X is identical for y=10 and y=20.

pandas.DataFrame(
   {
       "y": y,
       "x": X.values.flatten(),
       "cumulative_sum": np.cumsum(y),
       "cumulative_mean": np.cumsum(y) / (np.array(range(120)) + 1)
   }
).groupby(["y"]).mean()

I think that in this situation, it's not possible for LightGBM to create a split with positive gain for the regression objective. Even after setting min_data_in_leaf=1, min_gain_to_split=0.0, and min_sum_hessian_in_leaf=0.0.

import lightgbm
import pandas

data_url = "https://github.com/microsoft/LightGBM/files/7597692/issued_dataset.csv"
df_data = pandas.read_csv(data_url)

X = df_data[['X']]
y = df_data['y']

n = 120
dataset = lightgbm.Dataset(
    data=X.iloc[:n],
    label=y.iloc[:n],
    categorical_feature=[],
)
model = lightgbm.train(
    params={
        'boosting_type': 'gbrt',
        'objective': 'regression',
        'verbose': -1,
        'min_data_in_leaf': 1,
        'min_gain_to_split': 0.0,
        'min_sum_hessian_in_leaf': 0.0,
    },
    train_set=dataset,
    num_boost_round=50
)
print(f'{n=}, {model.num_trees()}')

I think this plot illustrates the problem well.

import seaborn as sns
sns.scatterplot(x=X[["X"]].values.flatten(), y=y)

In this case, there's very little that a tree-based supervised learning approach can tell you about how to predict y from X.

FiksII · 2021-11-29T06:38:49Z

@jameslamb First of all, thank you for your reply. Yes, this is my fault, I made incorrect example.
Here is almost the same dataset. I just added another column: 'x2'. It contains 1 and 2. And it splits 2 independent groups.
issued_dataset_2.csv

We can see 2 groups:

import pandas
import lightgbm

df_data = pandas.read_csv('issued_dataset_2.csv')

X = df_data[['x1', 'x2']]
y = df_data['y']

idx = X['x2'] == 1
plot(X.loc[idx, 'x1'].values, y.loc[idx].values)

and

idx = X['x2'] == 2
plot(X.loc[idx, 'x1'].values, y.loc[idx].values)

But LightGBM returns the same "tree".

n = 120

dataset = lightgbm.Dataset(
    data=X.iloc[:n],
    label=y.iloc[:n],
    categorical_feature=[],
)
model = lightgbm.train(
    params={
        'boosting_type': 'gbrt',
        'objective': 'regression',
        'verbose': -1,
        'min_data_in_leaf': 1,
        'min_gain_to_split': 0.0,
        'min_sum_hessian_in_leaf': 0.0,
    },
    train_set=dataset,
    num_boost_round=50
)
print(f'{n=}, {model.num_trees()}')

n=120, 1

At the same time, DecitionTreeRegressor coped with it

from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import plot_tree
from sklearn.metrics import mean_absolute_error

estimator = DecisionTreeRegressor(max_depth=50)
estimator.fit(X, y)
print(mean_absolute_error(estimator.predict(X), y))

0

And it builds the tree:

estimator = DecisionTreeRegressor(max_depth=5)
estimator.fit(X, y)
plot_tree(estimator, fontsize=9, )

FiksII · 2021-11-29T07:48:38Z

I think I get it. The problem is in splitting criterion. Because our data is absolutely symmetric with respect to the partition by x1 and x2. As I understand, here we could split only by 2 equal (in terms of gain) datasets.
With another criterion we won't have such problems or the 1st split should not maximize variance gain. But here, it seems this caused the issue.

But I'm curious if he can't handle such a simple case?

jameslamb · 2021-12-28T06:40:37Z

here we could split only by 2 equal (in terms of gain) datasets

Yep, exactly! If you look at the nodes produced by the first split from your DecisionTreeRegressor example, note that they both have leaf value 15.0 (exactly equal to the mean of the target). LightGBM wouldn't make such a split because it would say "such a split won't change anything about the model's predictions".

If you want to exert tighter control over the tree-building process, you can force LightGBM to make a given split by using a concept called "forcedsplits".

The code below could be used to try to force LightGBM to reproduce results similar to what you saw with DecisionTreeRegressor.

import lightgbm
import pandas
import json
from sklearn.metrics import mean_squared_error

data_url = "https://github.com/microsoft/LightGBM/files/7616028/issued_dataset_2.csv"
df_data = pandas.read_csv(data_url)

X = df_data[['x1', 'x2']]
y = df_data['y']

forced_split = {
    "feature": 1,
    "threshold": 1.5,
    "right": {
        "feature": 0,
        "threshold": 25.0,
    },
    "left": {
        "feature": 0,
        "threshold": 25.0,
    }
}

with open("forced_split.json", "w") as f:
    f.write(json.dumps(forced_split))

n = 120
dataset = lightgbm.Dataset(
    data=X.iloc[:n],
    label=y.iloc[:n],
    categorical_feature=[],
)
model = lightgbm.train(
    params={
        'boosting_type': 'gbrt',
        'boost_from_average': False,
        'objective': 'regression',
        'verbose': 1,
        'min_data_in_leaf': 1,
        'min_gain_to_split': 0.0,
        'min_sum_hessian_in_leaf': 0.0,
        'forcedsplits_filename': "forced_split.json"
    },
    train_set=dataset,
    num_boost_round=500
)
print(f'{n=}, {model.num_trees()}')

That code produces a model with num_boost_round trees, and the predictions approach those you saw with DecisionTreeRegressor...LightGBM is able to split off some observations to either 10.0 or 20.0, and it locks in to predicting close to 15.0 (the target mean) for other observations.

mean_squared_error(
    model.predict(X[['x1', 'x2']]),
    y
)
# 7.5000000000000115

model.predict(X[['x1', 'x2']]

Please note that manually overriding LightGBM's tree-building decisions is a complex task that can be difficult to get right (see the discussion in #4591 and #4725, for examples).

Broadly speaking, for a wide range of datasets and use cases, gradient boosting should outperform other non-boosted tree-based models like DecisionTreeRegressor. However, for these sorts of cases you might find that DecisionTreeRegressor or even just application code with if statements will be a better choice.

shiyu1994 · 2021-12-30T05:59:09Z

I've tried commenting the following line

LightGBM/src/io/config_auto.cpp

Line 403 in af5b40e

CHECK_GE(min_gain_to_split, 0.0);

This would allow us to set min_gain_to_split to a negative value.
Setting min_gain_to_split to a negative value produces 50 trees with the above dataset.
So I think we may remove the check for min_gain_to_split. This allows LightGBM to do exploration in early stages when fitting dataset like Y = X1 (exclusive or with) X2, where X1 and X2 are features taking values 0 and 1.

AtroXWorf · 2022-01-05T16:07:36Z

From the Parameters description:

Note: the forced split logic will be ignored, if the split makes gain worse

I would also second allowing for the possibility to apply forced splits, even if they are worse than doing nothing. Sometimes you might be forced to have certain features in a model because of e.g. business constraints - even, if it makes the model performance worse.

jameslamb added the question label Nov 25, 2021

jameslamb added the awaiting response label Nov 29, 2021

FiksII closed this as completed Nov 29, 2021

no-response bot removed the awaiting response label Nov 29, 2021

FiksII reopened this Nov 29, 2021

jameslamb mentioned this issue Sep 6, 2023

[python-package] LGBMClassifier produces empty trees #6080

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trained tree unexpectedly contains only root #4826

Trained tree unexpectedly contains only root #4826

FiksII commented Nov 24, 2021

jameslamb commented Nov 29, 2021

jameslamb commented Nov 29, 2021

FiksII commented Nov 29, 2021 •

edited

Loading

FiksII commented Nov 29, 2021 •

edited

Loading

jameslamb commented Dec 28, 2021

shiyu1994 commented Dec 30, 2021

AtroXWorf commented Jan 5, 2022

Trained tree unexpectedly contains only root #4826

Trained tree unexpectedly contains only root #4826

Comments

FiksII commented Nov 24, 2021

Description

Reproducible example

Environment info

Additional Comments

jameslamb commented Nov 29, 2021

jameslamb commented Nov 29, 2021

FiksII commented Nov 29, 2021 • edited Loading

FiksII commented Nov 29, 2021 • edited Loading

jameslamb commented Dec 28, 2021

shiyu1994 commented Dec 30, 2021

AtroXWorf commented Jan 5, 2022

FiksII commented Nov 29, 2021 •

edited

Loading

FiksII commented Nov 29, 2021 •

edited

Loading