[dask] multiclass classification gives different samples for same split #4220

jmoralez · 2021-04-23T02:43:24Z

Description

When using lgb.DaskLGBMClassifier with multiclass classification the same split produces different numbers of samples being sent to each child.

Reproducible example

import dask.array as da
import lightgbm as lgb
import numpy as np
from dask.distributed import Client
from sklearn.datasets import make_blobs

client = Client(n_workers=2, threads_per_worker=2)

X, y = make_blobs(n_samples=1_000, centers=[[-4, -4], [4, 4], [-4, 4]])
dX = da.from_array(X, chunks=(100, 2))
dy = da.from_array(y, chunks=100)
clf = lgb.DaskLGBMClassifier().fit(dX, dy)

trees_df = clf.booster_.trees_to_dataframe()
trees_df['threshold'] = trees_df['threshold'].astype(np.float64)
# find left childs of the root node when it splits on x1<=0
relevant = trees_df.loc[lambda x: (x.node_depth == 1) & (x.split_feature == 'Column_0') & np.isclose(x.threshold, 0), ['tree_index', 'left_child']]
relevant = relevant.rename(columns={'left_child': 'node_index'})
print(trees_df.merge(relevant)['count'].value_counts().head().to_markdown())

	count
485	54
516	28
523	2
667	1
635	1

Running the exact same thing using lgb.LGBMClassifier returns 667 everytime (which is the number of samples with X[:, 0] <= 0).

Environment info

LightGBM version or commit hash: 1e95cb0

Additional Comments

I believe this may be the reason why sometimes the tests for multiclass classification fail. I've been struggling with a case where one sample seems to have gone the wrong way in a split because it gets a relatively big probability of being of another class.

The text was updated successfully, but these errors were encountered:

jmoralez · 2022-04-05T17:36:35Z

Adding some more info here. This seems to be a sync problem like the one in #4026. The example above only gets the correct number of samples where Column_0 <= 0 (667) on the first iteration, i.e.:

import dask.array as da
import lightgbm as lgb
import numpy as np
from dask.distributed import Client
from sklearn.datasets import make_blobs

client = Client(n_workers=2, threads_per_worker=2)

X, y = make_blobs(n_samples=1_000, centers=[[-4, -4], [4, 4], [-4, 4]])
dX = da.from_array(X, chunks=(100, 2))
dy = da.from_array(y, chunks=100)
clf = lgb.DaskLGBMClassifier(n_estimators=5).fit(dX, dy)

trees_df = clf.booster_.trees_to_dataframe()
trees_df['threshold'] = trees_df['threshold'].astype(np.float64)
# find left childs of the root node when it splits on x1<=0
relevant = trees_df.loc[lambda x: (x.node_depth == 1) & (x.split_feature == 'Column_0') & np.isclose(x.threshold, 0), ['tree_index', 'left_child']]
relevant = relevant.rename(columns={'left_child': 'node_index'})
print(trees_df.merge(relevant)[['tree_index', 'count']].to_markdown())

	tree_index	count
0	1	667
1	4	638
2	7	605
3	10	587
4	13	577

jameslamb added bug dask labels Apr 23, 2021

jameslamb mentioned this issue May 20, 2021

release 3.3.0 #4310

Closed

21 tasks

jmoralez mentioned this issue Jul 13, 2021

[tests][dask] Increase number of partitions in data #4149

Closed

jameslamb mentioned this issue Apr 14, 2022

[RFC] 4.0.0 Release #5153

Closed

60 tasks

jmoralez mentioned this issue May 30, 2022

[dask] Random failures in Dask tests during teardown #3829

Closed

jameslamb mentioned this issue Feb 16, 2024

[python-package] Using Pycaret - ValueError: Number of features of the model must match the input #6321

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] multiclass classification gives different samples for same split #4220

[dask] multiclass classification gives different samples for same split #4220

jmoralez commented Apr 23, 2021

jmoralez commented Apr 5, 2022