-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trained tree unexpectedly contains only root #4826
Comments
Thanks very much for your interest in LightGBM and for creating this issue. I created the following minimal, reproducible example by modifying your provided code in the following ways
import lightgbm
import pandas
data_url = "https://github.com/microsoft/LightGBM/files/7597692/issued_dataset.csv"
df_data = pandas.read_csv(data_url)
X = df_data[['X']]
y = df_data['y']
for n in range(10, len(df_data) + 1):
dataset = lightgbm.Dataset(
data=X.iloc[:n],
label=y.iloc[:n],
categorical_feature=[],
params={"max_bin": 256},
free_raw_data=False
)
model = lightgbm.train(
params={
'boosting_type': 'gbrt',
'objective': 'regression',
'verbose': -1,
'min_data_in_leaf': 1,
},
train_set=dataset,
num_boost_round=50
)
print(f'{n=}, {model.num_trees()}') I can see the same behavior you've described. Using small amounts of data, LightGBM builds
If you're curious, this warning comes from here: LightGBM/src/boosting/gbdt.cpp Line 443 in b0137de
I expect that something about adding one more observation changed something about the distribution of either the target or the feature (there is only 1) in a way that led to this stopping condition. With so few observations (120), one sample can have a large impact. I'll investigate this a bit more. |
aha! I think I figured it out! The target in this dataset only has two unique values. np.unique(y)
# array([10., 20.]) Adding the final row in the dataset results in the mean of In your dataset, pandas.DataFrame(
{
"y": y,
"x": X.values.flatten(),
"cumulative_sum": np.cumsum(y),
"cumulative_mean": np.cumsum(y) / (np.array(range(120)) + 1)
}
).groupby(["y"]).mean() I think that in this situation, it's not possible for LightGBM to create a split with positive gain for the import lightgbm
import pandas
data_url = "https://github.com/microsoft/LightGBM/files/7597692/issued_dataset.csv"
df_data = pandas.read_csv(data_url)
X = df_data[['X']]
y = df_data['y']
n = 120
dataset = lightgbm.Dataset(
data=X.iloc[:n],
label=y.iloc[:n],
categorical_feature=[],
)
model = lightgbm.train(
params={
'boosting_type': 'gbrt',
'objective': 'regression',
'verbose': -1,
'min_data_in_leaf': 1,
'min_gain_to_split': 0.0,
'min_sum_hessian_in_leaf': 0.0,
},
train_set=dataset,
num_boost_round=50
)
print(f'{n=}, {model.num_trees()}') I think this plot illustrates the problem well. import seaborn as sns
sns.scatterplot(x=X[["X"]].values.flatten(), y=y) In this case, there's very little that a tree-based supervised learning approach can tell you about how to predict |
@jameslamb First of all, thank you for your reply. Yes, this is my fault, I made incorrect example. We can see 2 groups:
and
But LightGBM returns the same "tree".
At the same time, DecitionTreeRegressor coped with it
And it builds the tree:
|
I think I get it. The problem is in splitting criterion. Because our data is absolutely symmetric with respect to the partition by x1 and x2. As I understand, here we could split only by 2 equal (in terms of gain) datasets. But I'm curious if he can't handle such a simple case? |
Yep, exactly! If you look at the nodes produced by the first split from your If you want to exert tighter control over the tree-building process, you can force LightGBM to make a given split by using a concept called "forcedsplits". The code below could be used to try to force LightGBM to reproduce results similar to what you saw with import lightgbm
import pandas
import json
from sklearn.metrics import mean_squared_error
data_url = "https://github.com/microsoft/LightGBM/files/7616028/issued_dataset_2.csv"
df_data = pandas.read_csv(data_url)
X = df_data[['x1', 'x2']]
y = df_data['y']
forced_split = {
"feature": 1,
"threshold": 1.5,
"right": {
"feature": 0,
"threshold": 25.0,
},
"left": {
"feature": 0,
"threshold": 25.0,
}
}
with open("forced_split.json", "w") as f:
f.write(json.dumps(forced_split))
n = 120
dataset = lightgbm.Dataset(
data=X.iloc[:n],
label=y.iloc[:n],
categorical_feature=[],
)
model = lightgbm.train(
params={
'boosting_type': 'gbrt',
'boost_from_average': False,
'objective': 'regression',
'verbose': 1,
'min_data_in_leaf': 1,
'min_gain_to_split': 0.0,
'min_sum_hessian_in_leaf': 0.0,
'forcedsplits_filename': "forced_split.json"
},
train_set=dataset,
num_boost_round=500
)
print(f'{n=}, {model.num_trees()}') That code produces a model with mean_squared_error(
model.predict(X[['x1', 'x2']]),
y
)
# 7.5000000000000115
model.predict(X[['x1', 'x2']] Please note that manually overriding LightGBM's tree-building decisions is a complex task that can be difficult to get right (see the discussion in #4591 and #4725, for examples). Broadly speaking, for a wide range of datasets and use cases, gradient boosting should outperform other non-boosted tree-based models like |
I've tried commenting the following line LightGBM/src/io/config_auto.cpp Line 403 in af5b40e
This would allow us to set min_gain_to_split to a negative value.Setting min_gain_to_split to a negative value produces 50 trees with the above dataset.So I think we may remove the check for min_gain_to_split . This allows LightGBM to do exploration in early stages when fitting dataset like Y = X1 (exclusive or with) X2 , where X1 and X2 are features taking values 0 and 1 .
|
From the Parameters description:
I would also second allowing for the possibility to apply forced splits, even if they are worse than doing nothing. Sometimes you might be forced to have certain features in a model because of e.g. business constraints - even, if it makes the model performance worse. |
Description
There is a synthetic dataset with 1 feature and 1 target, number of the records = 120.
On the full dataset, I get just 1 trained tree without splits, only root.
But, on the dataset with fewer records, I get normal results.
Reproducible example
Environment info
lightgbm version 3.3.1
python 3.8.8
Windows 10 x64
Additional Comments
Necessary file:
issued_dataset.csv
The text was updated successfully, but these errors were encountered: