Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trained tree unexpectedly contains only root #4826

Open
FiksII opened this issue Nov 24, 2021 · 7 comments
Open

Trained tree unexpectedly contains only root #4826

FiksII opened this issue Nov 24, 2021 · 7 comments
Labels

Comments

@FiksII
Copy link

FiksII commented Nov 24, 2021

Description

There is a synthetic dataset with 1 feature and 1 target, number of the records = 120.
On the full dataset, I get just 1 trained tree without splits, only root.
But, on the dataset with fewer records, I get normal results.

Reproducible example

df_data = pandas.read_csv('issued_dataset.csv')
X = df_data[['X']]
y = df_data['y']

lgbm_params = {
    'boosting_type': 'gbrt',
    'n_estimators': 100,
    'learning_rate': 0.1,
    'max_depth': -1,
    'min_child_samples': 10,
    'min_child_weight': 0,
    'min_split_gain': 0.0,
    'n_jobs': -1,
    'num_leaves': 10,
    'reg_alpha': 0.0,
    'reg_lambda': 0.0,
    'subsample': 1,
    'subsample_for_bin': 200000,
    'subsample_freq': 1,
    'verbose': -1,
    'metric': 'l1',
}

for n in range(10, len(df_data) + 1):
    dataset = lightgbm.Dataset(data=X.iloc[:n], label=y.iloc[:n], categorical_feature=[])
    model = lightgbm.train(lgbm_params, dataset)
    print(f'{n=}, {model.num_trees()}')

n=10, 1
n=11, 1
n=12, 1
n=13, 1
n=14, 1
n=15, 1
n=16, 1
n=17, 1
n=18, 1
n=19, 1
n=20, 100
n=21, 100
n=22, 100
n=23, 100
n=24, 100
n=25, 100
n=26, 100
n=27, 100
n=28, 100
n=29, 100
n=30, 100
n=31, 100
n=32, 100
n=33, 100
n=34, 100
n=35, 100
n=36, 100
n=37, 100
n=38, 100
n=39, 100
n=40, 100
n=41, 100
n=42, 100
n=43, 100
n=44, 100
n=45, 100
n=46, 100
n=47, 100
n=48, 100
n=49, 100
n=50, 100
n=51, 100
n=52, 100
n=53, 100
n=54, 100
n=55, 100
n=56, 100
n=57, 100
n=58, 100
n=59, 100
n=60, 100
n=61, 100
n=62, 100
n=63, 100
n=64, 100
n=65, 100
n=66, 100
n=67, 100
n=68, 100
n=69, 100
n=70, 100
n=71, 100
n=72, 100
n=73, 100
n=74, 100
n=75, 100
n=76, 100
n=77, 100
n=78, 100
n=79, 100
n=80, 100
n=81, 100
n=82, 100
n=83, 100
n=84, 100
n=85, 100
n=86, 100
n=87, 100
n=88, 100
n=89, 100
n=90, 100
n=91, 100
n=92, 100
n=93, 100
n=94, 100
n=95, 100
n=96, 100
n=97, 100
n=98, 100
n=99, 100
n=100, 100
n=101, 100
n=102, 100
n=103, 100
n=104, 100
n=105, 100
n=106, 100
n=107, 100
n=108, 100
n=109, 100
n=110, 100
n=111, 100
n=112, 100
n=113, 100
n=114, 1
n=115, 100
n=116, 100
n=117, 100
n=118, 100
n=119, 100
n=120, 1

dataset = lightgbm.Dataset(data=X, label=y, categorical_feature=[])
lgbm_params['verbose'] = 2
lightgbm.train(lgbm_params, dataset)

[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.016667
[LightGBM] [Debug] init for col-wise cost 0.000005 seconds, init for row-wise cost 0.000230 seconds
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000245 seconds.
You can set force_col_wise=true to remove the overhead.
[LightGBM] [Info] Total Bins 31
[LightGBM] [Info] Number of data points in the train set: 120, number of used features: 1
[LightGBM] [Info] Start training from score 15.000000
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Debug] Trained a tree with leaves = 1 and depth = 1

Environment info

lightgbm version 3.3.1
python 3.8.8
Windows 10 x64

Additional Comments

Necessary file:
issued_dataset.csv

@jameslamb
Copy link
Collaborator

Thanks very much for your interest in LightGBM and for creating this issue.

I created the following minimal, reproducible example by modifying your provided code in the following ways

  • adding import statements
  • reading dataset over the internet to avoid dealing with local filepaths
  • removing parameters until I had it down to the smallest possible set that still reproduces the problem
  • reducing num_boost_round
import lightgbm
import pandas

data_url = "https://github.com/microsoft/LightGBM/files/7597692/issued_dataset.csv"
df_data = pandas.read_csv(data_url)

X = df_data[['X']]
y = df_data['y']

for n in range(10, len(df_data) + 1):
    dataset = lightgbm.Dataset(
        data=X.iloc[:n],
        label=y.iloc[:n],
        categorical_feature=[],
        params={"max_bin": 256},
        free_raw_data=False
    )
    model = lightgbm.train(
        params={
            'boosting_type': 'gbrt',
            'objective': 'regression',
            'verbose': -1,
            'min_data_in_leaf': 1,
        },
        train_set=dataset,
        num_boost_round=50
    )
    print(f'{n=}, {model.num_trees()}')

I can see the same behavior you've described. Using small amounts of data, LightGBM builds num_boost_round trees. At exactly 120 observations (the full dataset), LightGBM only builds a single tree and shows the following warning at every iteration.

[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements

If you're curious, this warning comes from here:

Log::Warning("Stopped training because there are no more leaves that meet the split requirements");

I expect that something about adding one more observation changed something about the distribution of either the target or the feature (there is only 1) in a way that led to this stopping condition. With so few observations (120), one sample can have a large impact.

I'll investigate this a bit more.

@jameslamb
Copy link
Collaborator

aha! I think I figured it out!

The target in this dataset only has two unique values.

np.unique(y)
# array([10., 20.])

Adding the final row in the dataset results in the mean of y being exactly 15.0. 60 observations where y = 10, 60 observations where y = 20.

In your dataset, X is monotonically increasing, and after that last observation is added, the mean of X is identical for y=10 and y=20.

pandas.DataFrame(
   {
       "y": y,
       "x": X.values.flatten(),
       "cumulative_sum": np.cumsum(y),
       "cumulative_mean": np.cumsum(y) / (np.array(range(120)) + 1)
   }
).groupby(["y"]).mean()

image

I think that in this situation, it's not possible for LightGBM to create a split with positive gain for the regression objective. Even after setting min_data_in_leaf=1, min_gain_to_split=0.0, and min_sum_hessian_in_leaf=0.0.

import lightgbm
import pandas

data_url = "https://github.com/microsoft/LightGBM/files/7597692/issued_dataset.csv"
df_data = pandas.read_csv(data_url)

X = df_data[['X']]
y = df_data['y']

n = 120
dataset = lightgbm.Dataset(
    data=X.iloc[:n],
    label=y.iloc[:n],
    categorical_feature=[],
)
model = lightgbm.train(
    params={
        'boosting_type': 'gbrt',
        'objective': 'regression',
        'verbose': -1,
        'min_data_in_leaf': 1,
        'min_gain_to_split': 0.0,
        'min_sum_hessian_in_leaf': 0.0,
    },
    train_set=dataset,
    num_boost_round=50
)
print(f'{n=}, {model.num_trees()}')

I think this plot illustrates the problem well.

image

import seaborn as sns
sns.scatterplot(x=X[["X"]].values.flatten(), y=y)

In this case, there's very little that a tree-based supervised learning approach can tell you about how to predict y from X.

@FiksII
Copy link
Author

FiksII commented Nov 29, 2021

@jameslamb First of all, thank you for your reply. Yes, this is my fault, I made incorrect example.
Here is almost the same dataset. I just added another column: 'x2'. It contains 1 and 2. And it splits 2 independent groups.
issued_dataset_2.csv

We can see 2 groups:

import pandas
import lightgbm

df_data = pandas.read_csv('issued_dataset_2.csv')

X = df_data[['x1', 'x2']]
y = df_data['y']

idx = X['x2'] == 1
plot(X.loc[idx, 'x1'].values, y.loc[idx].values)

image

and

idx = X['x2'] == 2
plot(X.loc[idx, 'x1'].values, y.loc[idx].values)

image

But LightGBM returns the same "tree".

n = 120

dataset = lightgbm.Dataset(
    data=X.iloc[:n],
    label=y.iloc[:n],
    categorical_feature=[],
)
model = lightgbm.train(
    params={
        'boosting_type': 'gbrt',
        'objective': 'regression',
        'verbose': -1,
        'min_data_in_leaf': 1,
        'min_gain_to_split': 0.0,
        'min_sum_hessian_in_leaf': 0.0,
    },
    train_set=dataset,
    num_boost_round=50
)
print(f'{n=}, {model.num_trees()}')

n=120, 1

At the same time, DecitionTreeRegressor coped with it

from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import plot_tree
from sklearn.metrics import mean_absolute_error

estimator = DecisionTreeRegressor(max_depth=50)
estimator.fit(X, y)
print(mean_absolute_error(estimator.predict(X), y))

0

And it builds the tree:

estimator = DecisionTreeRegressor(max_depth=5)
estimator.fit(X, y)
plot_tree(estimator, fontsize=9, )

image

@FiksII
Copy link
Author

FiksII commented Nov 29, 2021

I think I get it. The problem is in splitting criterion. Because our data is absolutely symmetric with respect to the partition by x1 and x2. As I understand, here we could split only by 2 equal (in terms of gain) datasets.
With another criterion we won't have such problems or the 1st split should not maximize variance gain. But here, it seems this caused the issue.

But I'm curious if he can't handle such a simple case?

@jameslamb
Copy link
Collaborator

here we could split only by 2 equal (in terms of gain) datasets

Yep, exactly! If you look at the nodes produced by the first split from your DecisionTreeRegressor example, note that they both have leaf value 15.0 (exactly equal to the mean of the target). LightGBM wouldn't make such a split because it would say "such a split won't change anything about the model's predictions".

If you want to exert tighter control over the tree-building process, you can force LightGBM to make a given split by using a concept called "forcedsplits".

The code below could be used to try to force LightGBM to reproduce results similar to what you saw with DecisionTreeRegressor.

import lightgbm
import pandas
import json
from sklearn.metrics import mean_squared_error

data_url = "https://github.com/microsoft/LightGBM/files/7616028/issued_dataset_2.csv"
df_data = pandas.read_csv(data_url)

X = df_data[['x1', 'x2']]
y = df_data['y']

forced_split = {
    "feature": 1,
    "threshold": 1.5,
    "right": {
        "feature": 0,
        "threshold": 25.0,
    },
    "left": {
        "feature": 0,
        "threshold": 25.0,
    }
}

with open("forced_split.json", "w") as f:
    f.write(json.dumps(forced_split))

n = 120
dataset = lightgbm.Dataset(
    data=X.iloc[:n],
    label=y.iloc[:n],
    categorical_feature=[],
)
model = lightgbm.train(
    params={
        'boosting_type': 'gbrt',
        'boost_from_average': False,
        'objective': 'regression',
        'verbose': 1,
        'min_data_in_leaf': 1,
        'min_gain_to_split': 0.0,
        'min_sum_hessian_in_leaf': 0.0,
        'forcedsplits_filename': "forced_split.json"
    },
    train_set=dataset,
    num_boost_round=500
)
print(f'{n=}, {model.num_trees()}')

That code produces a model with num_boost_round trees, and the predictions approach those you saw with DecisionTreeRegressor...LightGBM is able to split off some observations to either 10.0 or 20.0, and it locks in to predicting close to 15.0 (the target mean) for other observations.

mean_squared_error(
    model.predict(X[['x1', 'x2']]),
    y
)
# 7.5000000000000115

model.predict(X[['x1', 'x2']]

image

Please note that manually overriding LightGBM's tree-building decisions is a complex task that can be difficult to get right (see the discussion in #4591 and #4725, for examples).

Broadly speaking, for a wide range of datasets and use cases, gradient boosting should outperform other non-boosted tree-based models like DecisionTreeRegressor. However, for these sorts of cases you might find that DecisionTreeRegressor or even just application code with if statements will be a better choice.

@shiyu1994
Copy link
Collaborator

I've tried commenting the following line

CHECK_GE(min_gain_to_split, 0.0);

This would allow us to set min_gain_to_split to a negative value.
Setting min_gain_to_split to a negative value produces 50 trees with the above dataset.
So I think we may remove the check for min_gain_to_split. This allows LightGBM to do exploration in early stages when fitting dataset like Y = X1 (exclusive or with) X2, where X1 and X2 are features taking values 0 and 1.

@AtroXWorf
Copy link

From the Parameters description:

Note: the forced split logic will be ignored, if the split makes gain worse

I would also second allowing for the possibility to apply forced splits, even if they are worse than doing nothing. Sometimes you might be forced to have certain features in a model because of e.g. business constraints - even, if it makes the model performance worse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants