Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect array shape returned by xgboost.dask.predict on multiclass predictions #5984

Closed
jameskrach opened this issue Aug 5, 2020 · 4 comments · Fixed by #5989
Closed
Labels

Comments

@jameskrach
Copy link
Contributor

The function xgboost.dask.predict assumes that the shape of it's output is always (n_rows, ), but for multiclass classification it would be (n_rows, n_classes).

Here's a minimum working example:

import xgboost as xgb
import dask.dataframe as dd
import dask.distributed
import pandas as pd
from sklearn.datasets import make_classification


cluster = dask.distributed.LocalCluster(n_workers=2, threads_per_worker=1)
client = dask.distributed.Client(cluster)
X, y = make_classification(n_samples=1000, n_informative=5, n_classes=3)
X_ = dd.from_array(X, chunksize=500)
y_ = dd.from_array(y, chunksize=500)
dtrain = xgb.dask.DaskDMatrix(client, data=X_, label=y_)

model = xgb.dask.train(
    client,
    {"objective": "multi:softprob", "num_class": 3},
    dtrain=dtrain
)

preds = xgb.dask.predict(client, model, dtrain)

print("Dask inferred shape: ", preds.shape)
print("Computed shape: ", preds.compute().shape)

The offending line is this one

I don't know what the maintainers would consider an idiomatic solution, but something like this works (but is gross). I'm not familiar enough with the serialized configuration to know if those keys exist for other learners/if a KeyError would be raised by some set of params that I'm unfamiliar with.

import json

...

def predict():
    ...
        
    arrays = []
    cfg = json.loads(booster.save_config())
    pred_dim = int(cfg["learner"]["learner_model_param"].get("num_class", 1))

    for i, shape in enumerate(shapes):
        if pred_dim > 1:
            pred_shape = (shape[0], pred_dim)
        else:
            pred_shape = (shape[0], )

        arrays.append(da.from_delayed(results[i], shape=pred_shape,
                                      dtype=numpy.float32))
    predictions = await da.concatenate(arrays, axis=0)


Happy to provide any clarifications if this isn't clear enough.

@hcho3
Copy link
Collaborator

hcho3 commented Aug 5, 2020

@trivialfis This might be a blocking issue.

@trivialfis
Copy link
Member

Will fix it first thing tomorrow.

@trivialfis
Copy link
Member

That's weird, we have tests on this and right now it's outputting the correct shape for me.

@trivialfis
Copy link
Member

I have to increase the test size to obtain incorrect shape.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants