[dask] DaskRegressor.predict() fails on DataFrame / Series input #3861

jameslamb · 2021-01-26T19:47:55Z

How you are using LightGBM?

LightGBM component: Python-package

Environment info

Operating System: Ubuntu 18.04

C++ compiler version: gcc 8.3.0

CMake version: 3.13.4

Python version:

output of 'conda info'

     active environment : saturn
    active env location : /opt/conda/envs/saturn
            shell level : 0
       user config file : /home/jovyan/.condarc
 populated config files : /opt/conda/.condarc
          conda version : 4.8.2
    conda-build version : not installed
         python version : 3.7.7.final.0
       virtual packages : __glibc=2.28
       base environment : /opt/conda  (writable)
           channel URLs : https://conda.saturncloud.io/pkgs/linux-64
                          https://conda.saturncloud.io/pkgs/noarch
                          https://conda.anaconda.org/conda-forge/linux-64
                          https://conda.anaconda.org/conda-forge/noarch
                          https://repo.anaconda.com/pkgs/main/linux-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/linux-64
                          https://repo.anaconda.com/pkgs/r/noarch
          package cache : /opt/conda/pkgs
                          /home/jovyan/.conda/pkgs
       envs directories : /opt/conda/envs
                          /home/jovyan/.conda/envs
               platform : linux-64
             user-agent : conda/4.8.2 requests/2.22.0 CPython/3.7.7 Linux/4.14.203-156.332.amzn2.x86_64 debian/10 glibc/2.28
                UID:GID : 1000:100
             netrc file : None
           offline mode : False

LightGBM version or commit hash: https://github.com/microsoft/LightGBM/tree/9f70e9685dfb5c82f2ee87176a8433a6b7a4b98f

Error message and / or logs

Training with lightgbm.dask.DaskLGBMRegressor succeeds, and .predict() fails with this error.

ValueError: Metadata inference failed in `_predict_part`.

You have supplied a custom function and Dask is unable to 
determine the type of output that that function returns. 

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
------------------------
TypeError('Unknown type of parameter:y, got:Series')

Traceback:
---------
  File "/opt/conda/envs/saturn/lib/python3.7/site-packages/dask/dataframe/utils.py", line 174, in raise_on_meta_error
    yield
  File "/opt/conda/envs/saturn/lib/python3.7/site-packages/dask/dataframe/core.py", line 5165, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/opt/conda/envs/saturn/lib/python3.7/site-packages/lightgbm/dask.py", line 319, in _predict_part
    **kwargs
  File "/opt/conda/envs/saturn/lib/python3.7/site-packages/lightgbm/sklearn.py", line 707, in predict
    pred_leaf=pred_leaf, pred_contrib=pred_contrib, **kwargs)
  File "/opt/conda/envs/saturn/lib/python3.7/site-packages/lightgbm/basic.py", line 3118, in predict
    predictor = self._to_predictor(deepcopy(kwargs))
  File "/opt/conda/envs/saturn/lib/python3.7/site-packages/lightgbm/basic.py", line 3204, in _to_predictor
    predictor = _InnerPredictor(booster_handle=self.handle, pred_parameter=pred_parameter)
  File "/opt/conda/envs/saturn/lib/python3.7/site-packages/lightgbm/basic.py", line 638, in __init__
    self.pred_parameter = param_dict_to_str(pred_parameter)
  File "/opt/conda/envs/saturn/lib/python3.7/site-packages/lightgbm/basic.py", line 221, in param_dict_to_str
    % (key, type(val).__name__))

Reproducible example(s)

I'll update this with a better, smaller reproducible example soon. I'm rushing right now to finish something else for work, but wanted to be sure I document this so search engines return this issue if others google that error message.

I'm training and trying to .predict() on a Dask DataFrame. Something like this.

import dask.dataframe as dd
import lightgbm as lgb

taxi_train = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-01.csv",
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    storage_options={"anon": True},
    assume_missing=True,
).sample(frac=0.01, replace=False)

def prep_df(df: dd.DataFrame, target_col: str) -> dd.DataFrame:
    """
    Prepare a raw taxi dataframe for training.
        * computes the target ('tip_fraction')
        * adds features
        * removes unused features
    """
    numeric_feat = [
        "pickup_weekday",
        "pickup_weekofyear",
        "pickup_hour",
        "pickup_week_hour",
        "pickup_minute",
        "passenger_count",
    ]
    categorical_feat = [
        "PULocationID",
        "DOLocationID",
    ]
    features = numeric_feat + categorical_feat
    df = df[df.fare_amount > 0]  # avoid divide-by-zero
    df[target_col] = df.tip_amount / df.fare_amount

    df["pickup_weekday"] = df.tpep_pickup_datetime.dt.weekday
    df["pickup_weekofyear"] = df.tpep_pickup_datetime.dt.isocalendar().week
    df["pickup_hour"] = df.tpep_pickup_datetime.dt.hour
    df["pickup_week_hour"] = (df.pickup_weekday * 24) + df.pickup_hour
    df["pickup_minute"] = df.tpep_pickup_datetime.dt.minute
    df = df[features + [target_col]].astype(float).fillna(-1)

    return df

target_col = "tip_fraction"
taxi_train = prep_df(taxi, target_col)

taxi_train = taxi_train.persist()
_ = wait(taxi_train)

features = [c for c in taxi_train.columns if c != target_col]

data = taxi_train[features]
label = taxi_train[target_col]

dask_reg = lgb.dask.DaskLGBMRegressor(
    silent=False,
    max_depth=8,
    random_state=708,
    learning_rate=0.05,
    tree_learner="data",
    n_estimators=100,
    n_jobs=-1,
    categorical_features=[6, 7]
)

dask_reg.fit(
    client=client,
    X=data,
    y=label,
)

taxi_test = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-02.csv",
    parse_dates=["tpep_pickup_datetime", "tpep_dropoff_datetime"],
    storage_options={"anon": True},
    assume_missing=True,
).sample(frac=0.01, replace=False)

taxi_test = prep_df(taxi_test, target_col=target_col)

taxi_test = taxi_test.persist()
_ = wait(taxi_test)

preds = dask_reg.predict(
    X=taxi_test[features]
)

See the output of conda env export below for versions of Dask and its dependencies.

output of 'conda env export'

name: saturn
channels:
  - https://conda.saturncloud.io/pkgs
  - conda-forge
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - argon2-cffi=20.1.0=py37h7b6447c_1
  - async_generator=1.10=py37h28b3542_0
  - attrs=20.3.0=pyhd3eb1b0_0
  - backcall=0.2.0=py_0
  - blas=1.0=mkl
  - bleach=3.2.2=pyhd3eb1b0_0
  - bokeh=2.2.3=py37_0
  - boto3=1.16.59=pyhd3eb1b0_0
  - botocore=1.19.59=pyhd3eb1b0_0
  - brotlipy=0.7.0=py37h27cfd23_1003
  - ca-certificates=2021.1.19=h06a4308_0
  - cairo=1.14.12=h8948797_3
  - certifi=2020.12.5=py37h06a4308_0
  - cffi=1.14.0=py37h2e261b9_0
  - click=7.1.2=pyhd3eb1b0_0
  - cloudpickle=1.6.0=py_0
  - cryptography=3.3.1=py37h3c74f83_0
  - cycler=0.10.0=py37_0
  - cytoolz=0.11.0=py37h7b6447c_0
  - dask-glm=0.2.0=py37_0
  - dask-ml=1.7.0=py_0
  - dbus=1.13.18=hb2f20db_0
  - decorator=4.4.2=py_0
  - defusedxml=0.6.0=py_0
  - docutils=0.15.2=py37_0
  - entrypoints=0.3=py37_0
  - expat=2.2.10=he6710b0_2
  - fastparquet=0.5.0=py37h6323ea4_1
  - fontconfig=2.13.0=h9420a91_0
  - freetype=2.10.4=h5ab3b9f_0
  - fribidi=1.0.10=h7b6447c_0
  - fsspec=0.8.3=py_0
  - glib=2.63.1=h5a9c865_0
  - graphite2=1.3.14=h23475e2_0
  - graphviz=2.40.1=h21bd128_2
  - gst-plugins-base=1.14.0=hbbd80ab_1
  - gstreamer=1.14.0=hb453b48_1
  - h5py=2.10.0=py37hd6299e0_1
  - harfbuzz=1.8.8=hffaf4a1_0
  - hdf5=1.10.6=hb1b8bf9_0
  - heapdict=1.0.1=py_0
  - icu=58.2=he6710b0_3
  - importlib-metadata=2.0.0=py_1
  - importlib_metadata=2.0.0=1
  - intel-openmp=2020.2=254
  - ipykernel=5.3.4=py37h5ca1d4c_0
  - ipython=7.19.0=py37hb070fc8_0
  - ipython_genutils=0.2.0=pyhd3eb1b0_1
  - ipywidgets=7.6.3=pyhd3eb1b0_1
  - jedi=0.18.0=py37h06a4308_1
  - jinja2=2.11.2=pyhd3eb1b0_0
  - jmespath=0.10.0=py_0
  - joblib=1.0.0=pyhd3eb1b0_0
  - jpeg=9b=h024ee3a_2
  - jsonschema=3.2.0=py_2
  - jupyter_client=6.1.7=py_0
  - jupyter_core=4.7.0=py37h06a4308_0
  - jupyterlab_pygments=0.1.2=py_0
  - jupyterlab_widgets=1.0.0=pyhd3eb1b0_1
  - kiwisolver=1.3.0=py37h2531618_0
  - lcms2=2.11=h396b838_0
  - ld_impl_linux-64=2.33.1=h53a641e_7
  - libedit=3.1.20191231=h14c3975_1
  - libffi=3.2.1=hf484d3e_1007
  - libgcc-ng=9.1.0=hdf63c60_0
  - libgfortran-ng=7.3.0=hdf63c60_0
  - libllvm10=10.0.1=hbcb73fb_5
  - libpng=1.6.37=hbc83047_0
  - libsodium=1.0.18=h7b6447c_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - libtiff=4.1.0=h2733197_1
  - libuuid=1.0.3=h1bed415_2
  - libxcb=1.14=h7b6447c_0
  - libxml2=2.9.10=hb55368b_3
  - llvmlite=0.34.0=py37h269e1b5_4
  - locket=0.2.1=py37h06a4308_1
  - lz4-c=1.9.3=h2531618_0
  - markupsafe=1.1.1=py37h14c3975_1
  - matplotlib=3.3.2=h06a4308_0
  - matplotlib-base=3.3.2=py37h817c723_0
  - mistune=0.8.4=py37h14c3975_1001
  - mkl=2020.2=256
  - mkl-service=2.3.0=py37he8ac12f_0
  - mkl_fft=1.2.0=py37h23d657b_0
  - mkl_random=1.1.1=py37h0573a6f_0
  - msgpack-python=1.0.1=py37hff7bd54_0
  - multipledispatch=0.6.0=py37_0
  - nbclient=0.5.1=py_0
  - nbconvert=6.0.7=py37_0
  - nbformat=5.1.2=pyhd3eb1b0_1
  - ncurses=6.2=he6710b0_1
  - nest-asyncio=1.4.3=pyhd3eb1b0_0
  - notebook=6.2.0=py37h06a4308_0
  - numba=0.51.2=py37h04863e7_1
  - numpy=1.19.2=py37h54aff64_0
  - numpy-base=1.19.2=py37hfa32c7d_0
  - olefile=0.46=py37_0
  - openssl=1.1.1i=h27cfd23_0
  - packaging=20.8=pyhd3eb1b0_0
  - pandas=1.1.0=py37he6710b0_0
  - pandoc=2.11=hb0f4dca_0
  - pandocfilters=1.4.3=py37h06a4308_1
  - pango=1.42.4=h049681c_0
  - parso=0.8.1=pyhd3eb1b0_0
  - partd=1.1.0=py_0
  - pcre=8.44=he6710b0_0
  - pexpect=4.8.0=pyhd3eb1b0_3
  - pickleshare=0.7.5=pyhd3eb1b0_1003
  - pillow=8.1.0=py37he98fc37_0
  - pip=20.3.3=py37h06a4308_0
  - pixman=0.40.0=h7b6447c_0
  - prometheus_client=0.9.0=pyhd3eb1b0_0
  - prompt-toolkit=3.0.8=py_0
  - psutil=5.7.2=py37h7b6447c_0
  - ptyprocess=0.7.0=pyhd3eb1b0_2
  - pycparser=2.20=py_2
  - pygments=2.7.4=pyhd3eb1b0_0
  - pyopenssl=20.0.1=pyhd3eb1b0_1
  - pyparsing=2.4.7=pyhd3eb1b0_0
  - pyqt=5.9.2=py37h05f1152_2
  - pyrsistent=0.17.3=py37h7b6447c_0
  - pysocks=1.7.1=py37_1
  - python=3.7.7=hcf32534_0_cpython
  - python-dateutil=2.8.1=py_0
  - pytz=2020.5=pyhd3eb1b0_0
  - pyyaml=5.4.1=py37h27cfd23_1
  - pyzmq=20.0.0=py37h2531618_1
  - qt=5.9.7=h5867ecd_1
  - readline=8.0=h7b6447c_0
  - s3fs=0.4.2=py_0
  - s3transfer=0.3.4=pyhd3eb1b0_0
  - scikit-learn=0.23.2=py37h0573a6f_0
  - scipy=1.5.2=py37h0b6359f_0
  - send2trash=1.5.0=pyhd3eb1b0_1
  - setuptools=52.0.0=py37h06a4308_0
  - sip=4.19.8=py37hf484d3e_0
  - six=1.15.0=py37h06a4308_0
  - sortedcontainers=2.3.0=pyhd3eb1b0_0
  - sqlite=3.33.0=h62c20be_0
  - tbb=2020.3=hfd86e86_0
  - tblib=1.7.0=py_0
  - terminado=0.9.2=py37h06a4308_0
  - testpath=0.4.4=py_0
  - threadpoolctl=2.1.0=pyh5ca1d4c_0
  - thrift=0.11.0=py37hf484d3e_0
  - tk=8.6.10=hbc83047_0
  - toolz=0.11.1=pyhd3eb1b0_0
  - tornado=6.1=py37h27cfd23_0
  - traitlets=5.0.5=py_0
  - typing_extensions=3.7.4.3=py_0
  - urllib3=1.25.11=py_0
  - wcwidth=0.2.5=py_0
  - webencodings=0.5.1=py37_1
  - wheel=0.36.2=pyhd3eb1b0_0
  - widgetsnbextension=3.5.1=py37_0
  - xz=5.2.5=h7b6447c_0
  - yaml=0.2.5=h7b6447c_0
  - zeromq=4.3.3=he6710b0_3
  - zict=2.0.0=py_0
  - zipp=3.4.0=pyhd3eb1b0_0
  - zlib=1.2.11=h7b6447c_3
  - zstd=1.4.5=h9ceee32_0
  - pip:
    - chardet==4.0.0
    - dask==2021.1.1
    - dask-saturn==0.2.2
    - distributed==2021.1.1
    - idna==2.10
    - lightgbm==3.1.1.99
    - requests==2.25.1
prefix: /opt/conda/envs/saturn

References

I think that changing the uses of map_blocks() and map_partitions based on this description from the Dask docs could fix this issue.

meta

An empty pd.DataFrame or pd.Series that matches the dtypes and column names of the output. This metadata is necessary for many algorithms in dask dataframe to work. For ease of use, some alternative inputs are also available. Instead of a DataFrame, a dict of {name: dtype} or iterable of (name, dtype) can be provided (note that the order of the names should match the order of the columns). Instead of a series, a tuple of (name, dtype) can be used. If not provided, dask will try to infer the metadata. This may lead to unexpected results, so providing meta is recommended. For more information, see dask.dataframe.utils.make_meta.

But I'm confused and concerned about this error showing up, since it does not show up in any of the tests at https://github.com/microsoft/LightGBM/blob/9f70e9685dfb5c82f2ee87176a8433a6b7a4b98f/tests/python_package_test/test_dask.py, and we test against Dask DataFrame inputs there.

For anyone new to LightGBM looking to help with this before I get to it, here's the place where we're using _predict_part() in map_partitions() -->

LightGBM/python-package/lightgbm/dask.py

Lines 351 to 360 in 9f70e96

 if isinstance(data, dd._Frame): 

 return data.map_partitions( 

 _predict_part, 

 model=model, 

 raw_score=raw_score, 

 pred_proba=pred_proba, 

 pred_leaf=pred_leaf, 

 pred_contrib=pred_contrib, 

 **kwargs 

 ).values

The text was updated successfully, but these errors were encountered:

jameslamb · 2021-01-26T19:48:39Z

I marked this "good first issue" only because I think that for someone who's experienced with Dask, they might be able to fix this without needing too much LightGBM knowledge.

jmoralez · 2021-01-27T02:32:17Z

Hi, James. There's a .values at the end of the map_partitions

jameslamb · 2021-01-27T02:37:20Z

😱😱😱 good eye! I think that behavior is inconsistent and should change, but it still doesn't explain the bug, right? Because if that method returned a Dask DataFrame, presumably you'd get this same error calling .compute() on that result, right?

jmoralez · 2021-01-27T02:47:04Z

Quick question, is taxi_train != taxi_test? I saw at the start it's already defined as taxi_test so I couldn't reproduce. I got the output array correctly and was driven away from the fact that you got an error and focused on that the output was an array even though I expected a dataframe haha.

jameslamb · 2021-01-27T03:21:10Z

is taxi_train != taxi_test?

Was going too fast, sorry. That was a very hastily-written issue and it needs a better reproducible example when I can. I just edited it to define taxi_train correctly.

jameslamb · 2021-01-27T22:48:42Z

I think that behavior is inconsistent and should change

I take it back, now I remember why there's a .values. This is so .predict() in the Dask interface always returns a Dask array regardless of input type, just like .predict() in the sklearn interface always returns a numpy array.

import numpy as np
import pandas as pd
from lightgbm import LGBMRegressor

reg = LGBMRegressor()

np.random.random(10)

num_features = 20
num_rows = 1000

X = pd.DataFrame({
    "col" + str(i): np.random.random(num_rows)
    for i in range(num_features)
})

y = np.random.random(num_rows)
reg.fit(X, y)

preds = reg.predict(X)

print(f"input type: {type(X)}, \npred type: {type(preds)}")

jmoralez · 2021-01-28T04:39:54Z

That makes sense. I couldn't reproduce the error, I tried on local and remote clusters.

jameslamb · 2021-01-28T05:07:34Z

I see some issues that suggest that this error might happen when an input contains NaNs:

Documentation for custom Aggregation does not cover NA edge case dask/dask#6410
dask dataframe .std() produces error in cases where there are NaNs in output dask/dask#4534

I've also found the place where this happens (I think). There are some calls in DataFrame.map_partitions() that try to figure out the metadata of the response from the function being called with .map_partitions().

https://github.com/dask/dask/blob/d9fbc42c92d9c82c328c519cb20df0df04ea82cb/dask/dataframe/core.py#L5360-L5365

This is the internal function in Dask that raises the error in the original post here: https://github.com/dask/dask/blob/e54976954a4e983493923769c31d89b68e72fd6f/dask/dataframe/utils.py#L157

I'll try soon to create a clean reproducible example. I believe I know how to fix this, but without that repro we won't be able to test a fix.

jmoralez · 2021-02-03T05:10:14Z

Not sure if it's entirely related but the predict also fails if there are categoricals.

import dask
import lightgbm as lgb
from dask.distributed import Client

client = Client()
dtypes = {
    'name': 'category',
    'id': int,
    'x': float,
    'y': float
}

ddf = dask.datasets.timeseries(freq='1H', dtypes=dtypes)
X, y = ddf.drop('y', 1), ddf.y
reg = lgb.dask.DaskLGBMRegressor().fit(X, y)
reg.predict(X)

ValueError: Metadata inference failed in `_predict_part`.                
                                    
You have supplied a custom function and Dask is unable to                
determine the type of output that that function returns.                 
                                                                         
To resolve this please provide a meta= keyword.                          
The docstring of the Dask function you ran should have more information.                                                                          
                                                                         
Original error is below:                                                                                                                          
------------------------                                                 
ValueError("could not convert string to float: 'Alice'")                 
                                    
Traceback:                                                               
---------
  File "/home/josemz/programs/anaconda3/envs/lightgbm/lib/python3.7/site-packages/dask/dataframe/utils.py", line 167, in raise_on_meta_error
    yield
  File "/home/josemz/programs/anaconda3/envs/lightgbm/lib/python3.7/site-packages/dask/dataframe/core.py", line 5310, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/home/josemz/programs/anaconda3/envs/lightgbm/lib/python3.7/site-packages/lightgbm/dask.py", line 352, in _predict_part
    **kwargs
  File "/home/josemz/programs/anaconda3/envs/lightgbm/lib/python3.7/site-packages/lightgbm/sklearn.py", line 697, in predict
    X = _LGBMCheckArray(X, accept_sparse=True, force_all_finite=False)
  File "/home/josemz/programs/anaconda3/envs/lightgbm/lib/python3.7/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/home/josemz/programs/anaconda3/envs/lightgbm/lib/python3.7/site-packages/sklearn/utils/validation.py", line 616, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
  File "/home/josemz/programs/anaconda3/envs/lightgbm/lib/python3.7/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)

If I do:

Xc = X.compute()
reg.to_local().predict(Xc)

It works as expected. I use categoricals a lot so I would really like to see this work, I'd like to work on this. Do you think it should be a separate issue or is related?

jameslamb · 2021-02-03T05:14:47Z

AH @jmoralez !!!! Maybe you found the secret to reproducing this!

jameslamb · 2021-02-03T05:17:34Z

Thank you for the nice reproducible example, this could be the issue!

I'd like to work on this

I'd love if you can fix this. Do you have time to work on it over the next few days? Sorry for the rush, but this is one of the issues I want to fix before we do a 3.2.0 release of lightgbm: #3872 (comment). If you're not comfortable committing to working on this over the next few days, I'll make this my next priority and start on it tomorrow. If you are comfortable with the time constraints, then this is all yours.

jmoralez · 2021-02-03T05:39:36Z

Oh I meant it as in: if no one's taking it I'd like to check it out, haha. I'm not sure I'd be able to pull it off, I prefer maybe helping you with some findings or discussions.

jameslamb · 2021-02-03T05:44:21Z

Haha ok, thanks! I actually get some dedicated time to work on LightGBM at work...so how about I try this tomorrow and open a draft PR, and maybe I'll @ you for a review and other ideas?

Now that you found a small reproducible example, it should go quickly.

jameslamb · 2021-02-04T04:03:00Z

Alright, I think I have a fix for this in #3908. I wanted to post some more debugging information here.

Thanks to your huge help discovering that category cols was the issue, @jmoralez , I came up with the reproducible example below. I wanted something a little lower-level than using dask.datasets.timeseries, just so we could more easily change it while experimenting.

```python
import dask
import dask.array as da
import dask.dataframe as dd
import pandas as pd
import numpy as np
import lightgbm as lgb
from dask.distributed import LocalCluster, Client

cluster = LocalCluster(n_workers=3)
client = Client(cluster)
client

def _create_data() -> pd.DataFrame:
    num_rows = 1000
    return pd.DataFrame({
        "float_col1": pd.Series(np.random.random(num_rows), dtype="float"),
        "float_col2": pd.Series(np.random.random(num_rows), dtype="float"),
        "cat_col": pd.Series(np.random.choice(["a", "b", "y", "z"], num_rows), dtype="category"),
    })

parts = [dask.delayed(_create_data)() for _ in range(5)]
ddf = dd.from_delayed(
    parts,
    meta={
        "float_col1": "float",
        "float_col2": "float",
        "cat_col": "category"
    }
)

label = da.random.random((5000, 1), (1000, 1)).to_dask_dataframe()[0]
reg = lgb.DaskLGBMRegressor()

reg.fit(X=ddf, y=label)

# this will fail
preds = reg.predict(ddf)
preds.compute()

.predict() will fail with the error message in #3861 (comment). However, I noticed something really important deeper in the stack trace

File "/opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py", line 598, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: 'UNKNOWN_CATEGORIES'

That comes from the logic in dask.DataFrame.map_partitions() that try to execute the function being mapped on a small amount of data to check if it works.

full error log

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/utils.py", line 167, in raise_on_meta_error
    yield
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/core.py", line 5268, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/opt/LightGBM/python-package/lightgbm/dask.py", line 366, in _predict_part
    result = model.predict(
  File "/opt/LightGBM/python-package/lightgbm/sklearn.py", line 697, in predict
    X = _LGBMCheckArray(X, accept_sparse=True, force_all_finite=False)
  File "/opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "/opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py", line 598, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: '__UNKNOWN_CATEGORIES__'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/LightGBM/python-package/lightgbm/dask.py", line 728, in predict
    return _predict(
  File "/opt/LightGBM/python-package/lightgbm/dask.py", line 426, in _predict
    return data.map_partitions(
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/core.py", line 652, in map_partitions
    return map_partitions(func, self, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/core.py", line 5318, in map_partitions
    meta = _emulate(func, *args, udf=True, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/core.py", line 5268, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/opt/conda/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/utils.py", line 188, in raise_on_meta_error
    raise ValueError(msg) from e
ValueError: Metadata inference failed in `_predict_part`.

You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
------------------------
ValueError("could not convert string to float: '__UNKNOWN_CATEGORIES__'")

Traceback:
---------
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/utils.py", line 167, in raise_on_meta_error
    yield
  File "/opt/conda/lib/python3.8/site-packages/dask/dataframe/core.py", line 5268, in _emulate
    return func(*_extract_meta(args, True), **_extract_meta(kwargs, True))
  File "/opt/LightGBM/python-package/lightgbm/dask.py", line 366, in _predict_part
    result = model.predict(
  File "/opt/LightGBM/python-package/lightgbm/sklearn.py", line 697, in predict
    X = _LGBMCheckArray(X, accept_sparse=True, force_all_finite=False)
  File "/opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "/opt/conda/lib/python3.8/site-packages/sklearn/utils/validation.py", line 598, in check_array
    array = np.asarray(array, order=order, dtype=dtype)
  File "/opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py", line 85, in asarray
    return array(a, dtype, copy=False, order=order)

…3908) * add support for pandas categorical columns * remove commented code * quotes * syntax error * fix shape for ranker test * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * Update tests/python_package_test/test_dask.py * trying * fix tests * remove unnecessary debugging stuff * skip accuracy checks on categorical * use category columns as categorical features Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

github-actions · 2023-08-23T16:23:36Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added bug good first issue dask labels Jan 26, 2021

jameslamb mentioned this issue Jan 27, 2021

[dask] Add type hints in Dask package #3866

Merged

jameslamb mentioned this issue Jan 28, 2021

v3.2.0 release #3872

Merged

jameslamb self-assigned this Feb 3, 2021

jameslamb mentioned this issue Feb 4, 2021

[dask] Support Dask dataframes with 'category' columns (fixes #3861) #3908

Merged

StrikerRUS closed this as completed in #3908 Feb 6, 2021

jameslamb mentioned this issue Feb 9, 2021

[dask] [docs] Fix inaccuracies in API docs for Dask module (fixes #3871) #3930

Merged

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dask] DaskRegressor.predict() fails on DataFrame / Series input #3861

[dask] DaskRegressor.predict() fails on DataFrame / Series input #3861

jameslamb commented Jan 26, 2021 •

edited

Loading

jameslamb commented Jan 26, 2021

jmoralez commented Jan 27, 2021

jameslamb commented Jan 27, 2021

jmoralez commented Jan 27, 2021

jameslamb commented Jan 27, 2021

jameslamb commented Jan 27, 2021

jmoralez commented Jan 28, 2021

jameslamb commented Jan 28, 2021

jmoralez commented Feb 3, 2021

jameslamb commented Feb 3, 2021 •

edited

Loading

jameslamb commented Feb 3, 2021 •

edited

Loading

jmoralez commented Feb 3, 2021

jameslamb commented Feb 3, 2021

jameslamb commented Feb 4, 2021 •

edited

Loading

github-actions bot commented Aug 23, 2023

[dask] DaskRegressor.predict() fails on DataFrame / Series input #3861

[dask] DaskRegressor.predict() fails on DataFrame / Series input #3861

Comments

jameslamb commented Jan 26, 2021 • edited Loading

How you are using LightGBM?

Environment info

Error message and / or logs

Reproducible example(s)

References

jameslamb commented Jan 26, 2021

jmoralez commented Jan 27, 2021

jameslamb commented Jan 27, 2021

jmoralez commented Jan 27, 2021

jameslamb commented Jan 27, 2021

jameslamb commented Jan 27, 2021

jmoralez commented Jan 28, 2021

jameslamb commented Jan 28, 2021

jmoralez commented Feb 3, 2021

jameslamb commented Feb 3, 2021 • edited Loading

jameslamb commented Feb 3, 2021 • edited Loading

jmoralez commented Feb 3, 2021

jameslamb commented Feb 3, 2021

jameslamb commented Feb 4, 2021 • edited Loading

github-actions bot commented Aug 23, 2023

jameslamb commented Jan 26, 2021 •

edited

Loading

jameslamb commented Feb 3, 2021 •

edited

Loading

jameslamb commented Feb 3, 2021 •

edited

Loading

jameslamb commented Feb 4, 2021 •

edited

Loading