-
-
Notifications
You must be signed in to change notification settings - Fork 43
Ensure that training and testing data align #32
Comments
Thanks for opening up this issue @mrocklin. Below is an example to reproduce the issue import numpy as np
import dask.dataframe as dd
from dask.distributed import Client
from dask_xgboost import XGBClassifier
client = Client(processes=True,
n_workers=2,
threads_per_worker=1,
memory_limit='3GB')
# Create dataset
np.random.seed(2)
a = np.random.rand(100, 10)
df = pd.DataFrame(a, columns=[f'feature_{i}' for i in range(a.shape[1])])
X = dd.from_pandas(df.iloc[:, :-1], chunksize=50)
y = dd.from_pandas(df.iloc[:, -1], chunksize=51)
# Print out length of each parition for X and y
print(X.map_partitions(len).compute())
print(y.map_partitions(len).compute())
# Fit dask-xgboost classifier
clf = XGBClassifier()
clf.fit(X, y) Running this example will output
for the
for the Traceback details
The relevant portion of the error message is |
I'm in favor of this idea. I'll open up a PR with a proposal to balance the partitions for input data. |
Is there a work around that will ensure training and testing data align? I am reading several CSVs into Dask Dataframes
I confirmed that the lengths are the same: 502732 Then run this:
Then I get this error:
|
Currently there is no easy workaround. It would be good to provide one though. I would think that this work would happen upstream in dask.dataframe , but I don't have a concrete plan here. Help would be welcome. |
Currently if you provide training and testing data that have the same number of partitions, but a different number of rows per partition then the user will get a non-informative error.
Given that we need to have all the data in memory anyway, we could just fix this for the user and balance partitions for them.
cc @jrbourbeau
The text was updated successfully, but these errors were encountered: