Incorrect array shape returned by xgboost.dask.predict on multiclass predictions #5984

jameskrach · 2020-08-05T19:14:19Z

The function xgboost.dask.predict assumes that the shape of it's output is always (n_rows, ), but for multiclass classification it would be (n_rows, n_classes).

Here's a minimum working example:

import xgboost as xgb
import dask.dataframe as dd
import dask.distributed
import pandas as pd
from sklearn.datasets import make_classification


cluster = dask.distributed.LocalCluster(n_workers=2, threads_per_worker=1)
client = dask.distributed.Client(cluster)
X, y = make_classification(n_samples=1000, n_informative=5, n_classes=3)
X_ = dd.from_array(X, chunksize=500)
y_ = dd.from_array(y, chunksize=500)
dtrain = xgb.dask.DaskDMatrix(client, data=X_, label=y_)

model = xgb.dask.train(
    client,
    {"objective": "multi:softprob", "num_class": 3},
    dtrain=dtrain
)

preds = xgb.dask.predict(client, model, dtrain)

print("Dask inferred shape: ", preds.shape)
print("Computed shape: ", preds.compute().shape)

The offending line is this one

I don't know what the maintainers would consider an idiomatic solution, but something like this works (but is gross). I'm not familiar enough with the serialized configuration to know if those keys exist for other learners/if a KeyError would be raised by some set of params that I'm unfamiliar with.

import json

...

def predict():
    ...
        
    arrays = []
    cfg = json.loads(booster.save_config())
    pred_dim = int(cfg["learner"]["learner_model_param"].get("num_class", 1))

    for i, shape in enumerate(shapes):
        if pred_dim > 1:
            pred_shape = (shape[0], pred_dim)
        else:
            pred_shape = (shape[0], )

        arrays.append(da.from_delayed(results[i], shape=pred_shape,
                                      dtype=numpy.float32))
    predictions = await da.concatenate(arrays, axis=0)

Happy to provide any clarifications if this isn't clear enough.

The text was updated successfully, but these errors were encountered:

hcho3 · 2020-08-05T19:16:34Z

@trivialfis This might be a blocking issue.

trivialfis · 2020-08-05T20:13:30Z

Will fix it first thing tomorrow.

trivialfis · 2020-08-06T09:24:41Z

That's weird, we have tests on this and right now it's outputting the correct shape for me.

trivialfis · 2020-08-06T09:27:07Z

I have to increase the test size to obtain incorrect shape.

hcho3 added the Blocking label Aug 5, 2020

jameskrach mentioned this issue Aug 6, 2020

[Breaking] Fix .predict() method and add .predict_proba() in xgboost.dask.DaskXGBClassifier #5986

Merged

trivialfis added Blocking and removed Blocking labels Aug 6, 2020

trivialfis mentioned this issue Aug 6, 2020

Fix dask predict shape infer. #5989

Merged

trivialfis closed this as completed in #5989 Aug 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect array shape returned by xgboost.dask.predict on multiclass predictions #5984

Incorrect array shape returned by xgboost.dask.predict on multiclass predictions #5984

jameskrach commented Aug 5, 2020

hcho3 commented Aug 5, 2020

trivialfis commented Aug 5, 2020

trivialfis commented Aug 6, 2020

trivialfis commented Aug 6, 2020

Incorrect array shape returned by xgboost.dask.predict on multiclass predictions #5984

Incorrect array shape returned by xgboost.dask.predict on multiclass predictions #5984

Comments

jameskrach commented Aug 5, 2020

hcho3 commented Aug 5, 2020

trivialfis commented Aug 5, 2020

trivialfis commented Aug 6, 2020

trivialfis commented Aug 6, 2020