Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] multiclass classification gives different samples for same split #4220

Open
Tracked by #5153
jmoralez opened this issue Apr 23, 2021 · 1 comment
Open
Tracked by #5153

Comments

@jmoralez
Copy link
Collaborator

Description

When using lgb.DaskLGBMClassifier with multiclass classification the same split produces different numbers of samples being sent to each child.

Reproducible example

import dask.array as da
import lightgbm as lgb
import numpy as np
from dask.distributed import Client
from sklearn.datasets import make_blobs

client = Client(n_workers=2, threads_per_worker=2)

X, y = make_blobs(n_samples=1_000, centers=[[-4, -4], [4, 4], [-4, 4]])
dX = da.from_array(X, chunks=(100, 2))
dy = da.from_array(y, chunks=100)
clf = lgb.DaskLGBMClassifier().fit(dX, dy)

trees_df = clf.booster_.trees_to_dataframe()
trees_df['threshold'] = trees_df['threshold'].astype(np.float64)
# find left childs of the root node when it splits on x1<=0
relevant = trees_df.loc[lambda x: (x.node_depth == 1) & (x.split_feature == 'Column_0') & np.isclose(x.threshold, 0), ['tree_index', 'left_child']]
relevant = relevant.rename(columns={'left_child': 'node_index'})
print(trees_df.merge(relevant)['count'].value_counts().head().to_markdown())
count
485 54
516 28
523 2
667 1
635 1

Running the exact same thing using lgb.LGBMClassifier returns 667 everytime (which is the number of samples with X[:, 0] <= 0).

Environment info

LightGBM version or commit hash: 1e95cb0

Additional Comments

I believe this may be the reason why sometimes the tests for multiclass classification fail. I've been struggling with a case where one sample seems to have gone the wrong way in a split because it gets a relatively big probability of being of another class.

@jmoralez
Copy link
Collaborator Author

jmoralez commented Apr 5, 2022

Adding some more info here. This seems to be a sync problem like the one in #4026. The example above only gets the correct number of samples where Column_0 <= 0 (667) on the first iteration, i.e.:

import dask.array as da
import lightgbm as lgb
import numpy as np
from dask.distributed import Client
from sklearn.datasets import make_blobs

client = Client(n_workers=2, threads_per_worker=2)

X, y = make_blobs(n_samples=1_000, centers=[[-4, -4], [4, 4], [-4, 4]])
dX = da.from_array(X, chunks=(100, 2))
dy = da.from_array(y, chunks=100)
clf = lgb.DaskLGBMClassifier(n_estimators=5).fit(dX, dy)

trees_df = clf.booster_.trees_to_dataframe()
trees_df['threshold'] = trees_df['threshold'].astype(np.float64)
# find left childs of the root node when it splits on x1<=0
relevant = trees_df.loc[lambda x: (x.node_depth == 1) & (x.split_feature == 'Column_0') & np.isclose(x.threshold, 0), ['tree_index', 'left_child']]
relevant = relevant.rename(columns={'left_child': 'node_index'})
print(trees_df.merge(relevant)[['tree_index', 'count']].to_markdown())
tree_index count
0 1 667
1 4 638
2 7 605
3 10 587
4 13 577

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants