Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask_cudf: scikit-learn API leads to impossible train-time: ValueError("feature_names mismatch #6268

Closed
pseudotensor opened this issue Oct 21, 2020 · 12 comments · Fixed by #6472
Assignees

Comments

@pseudotensor
Copy link
Contributor

Same setup as here: #6232 (comment)

When running same dataset, but using dask_cudf with only 1 column, keep hitting this error only with dask_cudf:

task [xgboost.dask]:tcp://127.0.0.1:41213 connected to the tracker
task [xgboost.dask]:tcp://127.0.0.1:34039 connected to the tracker
task [xgboost.dask]:tcp://127.0.0.1:41213 got new rank 0
task [xgboost.dask]:tcp://127.0.0.1:34039 got new rank 1
worker tcp://127.0.0.1:34039 has an empty DMatrix.  All workers associated with this DMatrix: {'tcp://127.0.0.1:41213'}
worker tcp://127.0.0.1:41213 has an empty DMatrix.  All workers associated with this DMatrix: {'tcp://127.0.0.1:34039'}
distributed.worker - WARNING -  Compute Failed
Function:  dispatched_train
args:      ('tcp://127.0.0.1:41213')
kwargs:    {}
Exception: ValueError("feature_names mismatch: ['AGE'] ['f0']\nexpected AGE in input data\ntraining data did not have the following fields: f0",)

With the scikit-learn API, this should be an impossible situation to get into during training, yet it always happens.

However, trying to reproduce this outside our application does not lead to the same error:

import pandas as pd
def fun():
    from dask.distributed import Client, wait
    from dask_cuda import LocalCUDACluster

    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:

            import xgboost as xgb
            import dask_cudf

            target = "default payment next month"
            Xpd = pd.read_csv("creditcard.csv")
            Xpd = Xpd[['AGE', target]]
            Xpd.to_csv("creditcard_1.csv")
            X = dask_cudf.read_csv("creditcard_1.csv")
            y = X[target]
            X = X.drop(target, axis=1)

            kwargs_fit = {}
            kwargs_cudf_fit = kwargs_fit.copy()

            valid_X = dask_cudf.read_csv("creditcard_1.csv")
            valid_y = valid_X[target]
            valid_X = valid_X.drop(target, axis=1)
            kwargs_cudf_fit['eval_set'] = [(valid_X, valid_y)]

            params = {}  # copy.deepcopy(self.model.get_params())
            params['tree_method'] = 'gpu_hist'

            dask_model = xgb.dask.DaskXGBClassifier(**params)
            dask_model.fit(X, y, eval_set=kwargs_cudf_fit.get('eval_set'),
                           sample_weight_eval_set=kwargs_cudf_fit.get('sample_weight_eval_set'), verbose=True)

if __name__ == '__main__':
    fun()

gives

task [xgboost.dask]:tcp://127.0.0.1:41361 connected to the tracker
task [xgboost.dask]:tcp://127.0.0.1:40851 connected to the tracker
task [xgboost.dask]:tcp://127.0.0.1:41361 got new rank 0
task [xgboost.dask]:tcp://127.0.0.1:40851 got new rank 1
worker tcp://127.0.0.1:41361 has an empty DMatrix.  All workers associated with this DMatrix: {'tcp://127.0.0.1:40851'}
worker tcp://127.0.0.1:41361 has an empty DMatrix.  All workers associated with this DMatrix: {'tcp://127.0.0.1:40851'}
/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py:4773: RuntimeWarning: coroutine 'Client._close' was never awaited
  c.close(timeout=2)
/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py:4773: RuntimeWarning: coroutine 'Client._close' was never awaited
  c.close(timeout=2)

However, I'm still reporting since clearly it should be impossible to happen with the sklearn API that we are using.

Inside the application where this example is used, we have many more imports, so there may be some conflict, similar to the "dill" issue I posted before. But this seems to be more relevant to xgboost proper.

While I try to find a MRE, do you have any advice or thoughts?

Also, this only happens on a multi-GPU machine. The exact code in our application on a single-GPU machine runs through just fine with dask_cudf without such problems.

@teju85

@pseudotensor
Copy link
Contributor Author

FYI this problem does not occur if I just call:

dask_model.fit(X, y, verbose=True)

So it seems to be a problem with the eval_set.

However, I confirmed that before calling the fit with eval_set that the dask_cudf frame .columns has the expected 'AGE' column.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Oct 21, 2020

The other odd message is that the Dmatrix is empty, and this shows up in the attempted MRE example above as well.

Neither the training data or eval_set are empty frames.

Perhaps some of the early stopping PRs submitted for 1.3.0 already fixed this kind of problem?

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Oct 21, 2020

As an aside, related to using dask_cudf, the same example sometimes fails in other ways:

Exception in thread Thread-437:
Traceback (most recent call last):
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/jon/minicondadai/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/tracker.py", line 326, in run
    self.accept_slaves(nslave)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/tracker.py", line 271, in accept_slaves
    if s.cmd == 'print':
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/tracker.py", line 271, in accept_slaves
    if s.cmd == 'print':
  File "/opt/pycharm-community-2017.2.3/helpers/pydev/_pydevd_bundle/pydevd_comm.py", line 399, in process_command
    self.process_net_command(self.global_debugger_holder.global_dbg, cmd_id, seq, text)
AttributeError: 'function' object has no attribute 'process_net_command'

Socket RecvAll Error:Connection reset by peer, shutting down process

As well as:

 Traceback (most recent call last):
   File "/data/jon/miniconda3/models.py", line 100, in cudf_fit
     sample_weight_eval_set=kwargs_cudf_fit.get('sample_weight_eval_set'), verbose=True)
   File "/home/jon/miniconda3/lib/python3.6/site-packages/xgboost/dask.py", line 828, in fit
     evals=evals, verbose_eval=verbose)
   File "/home/jon/miniconda3/lib/python3.6/site-packages/xgboost/dask.py", line 459, in train
     workers=workers)
   File "/home/jon/miniconda3/lib/python3.6/site-packages/distributed/client.py", line 1779, in map
     actors=actor,
   File "/home/jon/miniconda3/lib/python3.6/site-packages/distributed/client.py", line 2590, in _graph_to_futures
     "tasks": valmap(dumps_task, dsk3),
   File "cytoolz/dicttoolz.pyx", line 181, in cytoolz.dicttoolz.valmap
   File "cytoolz/dicttoolz.pyx", line 206, in cytoolz.dicttoolz.valmap
   File "/home/jon/miniconda3/lib/python3.6/site-packages/distributed/worker.py", line 3354, in dumps_task
     return {"function": dumps_function(task[0]), "args": warn_dumps(task[1:])}
   File "/home/jon/miniconda3/lib/python3.6/site-packages/distributed/worker.py", line 3318, in dumps_function
     result = pickle.dumps(func)
   File "/home/jon/miniconda3/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 51, in dumps
     result = cloudpickle.dumps(x, **dump_kwargs)
   File "/home/jon/miniconda3/lib/python3.6/site-packages/cloudpickle/cloudpickle_fast.py", line 102, in dumps
     cp.dump(obj)
   File "/home/jon/miniconda3/lib/python3.6/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump
     return Pickler.dump(self, obj)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 409, in dump
     self.save(obj)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 476, in save
     f(self, obj) # Call unbound method with explicit self
   File "/home/jon/miniconda3/lib/python3.6/site-packages/cloudpickle/cloudpickle_fast.py", line 745, in save_function
     *self._dynamic_function_reduce(obj), obj=obj
   File "/home/jon/miniconda3/lib/python3.6/site-packages/cloudpickle/cloudpickle_fast.py", line 687, in _save_reduce_pickle5
     save(state)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 476, in save
     f(self, obj) # Call unbound method with explicit self
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 739, in save_tuple
     save(element)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 476, in save
     f(self, obj) # Call unbound method with explicit self
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 824, in save_dict
     self._batch_setitems(obj.items())
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 850, in _batch_setitems
     save(v)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 476, in save
     f(self, obj) # Call unbound method with explicit self
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 824, in save_dict
     self._batch_setitems(obj.items())
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 850, in _batch_setitems
     save(v)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 524, in save
     self.save_reduce(obj=obj, *rv)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 637, in save_reduce
     save(state)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 476, in save
     f(self, obj) # Call unbound method with explicit self
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 824, in save_dict
     self._batch_setitems(obj.items())
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 850, in _batch_setitems
     save(v)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 524, in save
     self.save_reduce(obj=obj, *rv)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 637, in save_reduce
     save(state)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 476, in save
     f(self, obj) # Call unbound method with explicit self
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 824, in save_dict
     self._batch_setitems(obj.items())
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 850, in _batch_setitems
     save(v)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 476, in save
     f(self, obj) # Call unbound method with explicit self
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 824, in save_dict
     self._batch_setitems(obj.items())
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 850, in _batch_setitems
     save(v)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 524, in save
     self.save_reduce(obj=obj, *rv)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 637, in save_reduce
     save(state)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 476, in save
     f(self, obj) # Call unbound method with explicit self
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 824, in save_dict
     self._batch_setitems(obj.items())
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 850, in _batch_setitems
     save(v)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 476, in save
     f(self, obj) # Call unbound method with explicit self
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 784, in save_list
     self._batch_appends(obj)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 811, in _batch_appends
     save(tmp[0])
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 524, in save
     self.save_reduce(obj=obj, *rv)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 637, in save_reduce
     save(state)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 476, in save
     f(self, obj) # Call unbound method with explicit self
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 824, in save_dict
     self._batch_setitems(obj.items())
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 850, in _batch_setitems
     save(v)
   File "/home/jon/miniconda3/lib/python3.6/pickle.py", line 497, in save
     rv = reduce(self.proto)
 TypeError: can't pickle _thread.RLock objects

The latter pickle issue was more wild. It was as if loggers from global space were trying to be pickled (i.e. when dask cache was not found and went to pickle things). I traced through in debug mode and saw it happening, but I didn't understand why dask would be doing that.

@pseudotensor
Copy link
Contributor Author

Also perhaps relevant, the empty dmatrix message is wrong in that if one asks the shape of the computed frame, it is some number of rows by 1 column, but the _meta attribute suggests it is an empty dataframe with 1 column. It still has 'AGE' there, but just saying the emptiness is wrong and could impact something.

@trivialfis
Copy link
Member

Also perhaps relevant, the empty dmatrix message is wrong in that if one asks the shape of the computed frame

The empty DMatrix just means empty on specific worker. Dask does not balance the dataset among workers perfectly so some of them can be starving. On latest xgboost with reg/cls models you can safely ignore the warning if you don't care about performance at the moment (balanced is better). With ranking/survival this warning is real.

@trivialfis
Copy link
Member

trivialfis commented Oct 22, 2020

The mismatched feature name is new to me. I will try to reproduce it on my end.

AttributeError: 'function' object has no attribute 'process_net_command'

Looks like something is messing up with Python reflection. :-( Any chance you can get the error without pycharm?

As for pickling. I think I need a MRE. Sometimes It's hard the reason about "why is dask pickling this and that".

Thanks for reporting the errors, this will help smoothing the user experience. I understand your frustration, but could you please break the issue into separated ones? It's difficult to track with comments mixed together.

@pseudotensor
Copy link
Contributor Author

Yes, I will break into separate issues once I have some kind of MRE. At this point I'm putting all things in the same issue because they may be related and could help find an MRE.

@pseudotensor
Copy link
Contributor Author

Actually hit same error randomly with more than 1 feature:

2020-10-22 17:14:03,612 C: NA  D:  NA    M:  NA    NODE:SERVER      31536  PDEBUG | (ValueError('Long error message', "feature_names mismatch: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'f24', 'f25', 'f26', 'f27', 'f28', 'f29', 'f30', 'f31', 'f32', 'f33', 'f34', 'f35', 'f36', 'f37', 'f38', 'f39', 'f40', 'f41', 'f42', 'f43', 'f44', 'f45', 'f46', 'f47', 'f48', 'f49', 'f50', 'f51', 'f52', 'f53', 'f54', 'f55', 'f56', 'f57', 'f58', 'f59', 'f60', 'f61', 'f62', 'f63', 'f64', 'f65', 'f66', 'f67', 'f68', 'f69', 'f70', 'f71', 'f72', 'f73', 'f74', 'f75', 'f76', 'f77', 'f78', 'f79', 'f80', 'f81', 'f82', 'f83', 'f84', 'f85', 'f86', 'f87', 'f88', 'f89', 'f90', 'f91', 'f92', 'f93', 'f94', 'f95', 'f96', 'f97', 'f98', 'f99', 'f100', 'f101', 'f102', 'f103', 'f104', 'f105', 'f106', 'f107', 'f108', 'f109', 'f110', 'f111', 'f112', 'f113', 'f114', 'f115', 'f116', 'f117', 'f118', 'f119', 'f120', 'f121', 'f122', 'f123', 'f124', 'f125', 'f126', 'f127', 'f128', 'f129', 'f130', 'f131', 'f132', 'f133', 'f134', 'f135', 'f136', 'f137', 'f138', 'f139', 'f140', 'f141', 'f142', 'f143', 'f144', 'f145', 'f146', 'f147', 'f148', 'f149', 'f150', 'f151', 'f152', 'f153', 'f154', 'f155', 'f156', 'f157', 'f158', 'f159', 'f160', 'f161', 'f162', 'f163', 'f164', 'f165', 'f166', 'f167', 'f168', 'f169', 'f170', 'f171', 'f172', 'f173', 'f174', 'f175', 'f176', 'f177', 'f178', 'f179', 'f180', 'f181', 'f182', 'f183', 'f184', 'f185', 'f186', 'f187', 'f188', 'f189', 'f190', 'f191', 'f192', 'f193', 'f194', 'f195', 'f196', 'f197', 'f198', 'f199', 'f200', 'f201', 'f202', 'f203', 'f204', 'f205', 'f206', 'f207', 'f208', 'f209', 'f210', 'f211', 'f212', 'f213', 'f214', 'f215', 'f216', 'f217', 'f218', 'f219', 'f220', 'f221', 'f222', 'f223', 'f224', 'f225', 'f226', 'f227', 'f228', 'f229', 'f230', 'f231', 'f232', 'f233', 'f234', 'f235', 'f236', 'f237', 'f238', 'f239', 'f240', 'f241', 'f242', 'f243', 'f244', 'f245', 'f246', 'f247', 'f248', 'f249', 'f250', 'f251', 'f252', 'f253', 'f254', 'f255', 'f256', 'f257', 'f258', 'f259', 'f260', 'f261', 'f262', 'f263', 'f264', 'f265', 'f266', 'f267', 'f268', 'f269', 'f270', 'f271', 'f272', 'f273', 'f274', 'f275', 'f276', 'f277', 'f278', 'f279', 'f280', 'f281', 'f282', 'f283', 'f284', 'f285', 'f286', 'f287', 'f288', 'f289', 'f290', 'f291', 'f292', 'f293', 'f294', 'f295', 'f296', 'f297', 'f298', 'f299', 'f300', 'f301', 'f302', 'f303', 'f304', 'f305', 'f306', 'f307', 'f308'] ['0_v1', '100_v88', '101_v89', '102_v9', '103_v90', '104_v92', '105_v93', '106_v94', '107_v95', '108_v96', '109_v97', '10_v109', '110_v98', '111_v99', '113_CVTE:v107.0', '114_CVTE:v11.0', '115_CVTE:v110.0', '116_CVTE:v112.0', '117_CVTE:v113.0', '118_CVTE:v114.0', '11_v11', '122_CVTE:v125.0', '124_CVTE:v129.0', '126_CVTE:v20.0', '128_CVTE:v22.0', '129_CVTE:v24.0', '12_v111', '131_CVTE:v3.0', '132_CVTE:v30.0', '133_CVTE:v31.0', '135_CVTE:v36.0', '136_CVTE:v38.0', '138_CVTE:v41.0', '13_v114', '141_CVTE:v47.0', '144_CVTE:v52.0', '145_CVTE:v53.0', '146_CVTE:v56.0', '14_v115', '150_CVTE:v62.0', '152_CVTE:v66.0', '154_CVTE:v68.0', '155_CVTE:v70.0', '156_CVTE:v71.0', '157_CVTE:v72.0', '15_v116', '162_CVTE:v79.0', '163_CVTE:v82.0', '164_CVTE:v91.0', '16_v117', '170_TruncSVD:v115:v21:v88.0', '170_TruncSVD:v115:v21:v88.1', '171_ClusterDist4:v129:v41:v50:v54:v86.0', '171_ClusterDist4:v129:v41:v50:v54:v86.1', '171_ClusterDist4:v129:v41:v50:v54:v86.2', '171_ClusterDist4:v129:v41:v50:v54:v86.3', '172_TruncSVD:v105:v118:v34:v49:v54:v88.0', '173_WoE:v113:v129:v31:v47:v96.0', '174_CVTE:v29:v82.0', '175_ClusterTE:ClusterID10:v103:v51:v57.0', '176_Freq:v56', '177_CVCatNumEnc:v47:v114.min', '177_CVCatNumEnc:v47:v123.min', '177_CVCatNumEnc:v47:v19.min', '177_CVCatNumEnc:v47:v83.min', '177_CVCatNumEnc:v47:v95.min', '177_CVCatNumEnc:v47:v99.min', '178_NumToCatTE:v115:v16:v50:v69:v86.0', '179_ClusterDist8:v106:v4:v49:v89.0', '179_ClusterDist8:v106:v4:v49:v89.1', '179_ClusterDist8:v106:v4:v49:v89.2', '179_ClusterDist8:v106:v4:v49:v89.3', '179_ClusterDist8:v106:v4:v49:v89.4', '179_ClusterDist8:v106:v4:v49:v89.5', '179_ClusterDist8:v106:v4:v49:v89.6', '179_ClusterDist8:v106:v4:v49:v89.7', '17_v118', '180_NumCatTE:v110:v122:v34:v50:v57:v7:v85.0', '181_ClusterDist4:v106:v121:v14:v41:v42:v50:v58:v72.0', '181_ClusterDist4:v106:v121:v14:v41:v42:v50:v58:v72.1', '181_ClusterDist4:v106:v121:v14:v41:v42:v50:v58:v72.2', '181_ClusterDist4:v106:v121:v14:v41:v42:v50:v58:v72.3', '182_WoE:v129:v30:v38:v56:v66.0', '183_Freq:v103:v110:v31:v35:v49:v56:v59', '187_TruncSVD:v104:v109:v115:v17:v42:v50.0', '188_NumToCatTE:v114:v86.0', '189_NumToCatWoE:v6:v80.0', '18_v119', '190_TruncSVD:v10:v101:v129:v14:v18:v33.0', '190_TruncSVD:v10:v101:v129:v14:v18:v33.1', '190_TruncSVD:v10:v101:v129:v14:v18:v33.2', '191_NumToCatTE:v14:v20:v41:v50.0', '192_WoE:v129:v3:v31:v66:v70.0', '193_NumCatTE:v106:v30:v50.0', '195_NumCatTE:v112:v121:v67:v9:v97.0', '196_ClusterTE:ClusterID10:v116:v20:v35:v6:v87.0', '197_NumCatTE:v122:v47:v54:v59:v66:v69.0', '198_CVTE:v14:v31:v47:v66.0', '199_CVCatNumEnc:v82:v102.mean', '199_CVCatNumEnc:v82:v116.mean', '199_CVCatNumEnc:v82:v123.mean', '199_CVCatNumEnc:v82:v26.mean', '199_CVCatNumEnc:v82:v36.mean', '199_CVCatNumEnc:v82:v84.mean', '19_v12', '1_v10', '200_NumToCatWoE:v101:v60.0', '201_CVCatNumEnc:v12:v128.max', '201_CVCatNumEnc:v12:v14.max', '202_CVCatNumEnc:v66:v101.max', '202_CVCatNumEnc:v66:v115.max', '202_CVCatNumEnc:v66:v122.max', '202_CVCatNumEnc:v66:v13.max', '202_CVCatNumEnc:v66:v38.max', '202_CVCatNumEnc:v66:v41.max', '202_CVCatNumEnc:v66:v65.max', '202_CVCatNumEnc:v66:v70.max', '204_CVCatNumEnc:v79:v105.min', '204_CVCatNumEnc:v79:v58.min', '205_ClusterDist4:v27.0', '205_ClusterDist4:v27.1', '205_ClusterDist4:v27.2', '205_ClusterDist4:v27.3', '208_ClusterDist2:v15:v20:v50:v95.0', '208_ClusterDist2:v15:v20:v50:v95.1', '209_TruncSVD:v114:v124:v128:v5:v53:v59:v83.0', '20_v120', '210_NumToCatTE:v101:v114:v116:v27:v86:v89:v90.0', '211_CVCatNumEnc:v125:v31:v1.mean', '211_CVCatNumEnc:v125:v31:v114.mean', '211_CVCatNumEnc:v125:v31:v119.mean', '212_NumCatTE:v114:v126:v79.0', '213_NumToCatWoE:v109.0', '214_NumToCatTE:v100:v103:v2:v50:v57:v60:v7.0', '216_NumToCatTE:v108:v8.0', '218_CVCatNumEnc:v110:v20:v31:v78:v50.count', '218_CVCatNumEnc:v110:v20:v31:v78:v77.count', '218_CVCatNumEnc:v110:v20:v31:v78:v80.count', '219_ClusterTE:ClusterID10:v62:v84:v99.0', '21_v121', '220_CVTE:v110:v31:v35:v47:v68:v77.0', '221_ClusterDist6:v100:v19:v4:v44:v50:v54:v68.0', '221_ClusterDist6:v100:v19:v4:v44:v50:v54:v68.1', '221_ClusterDist6:v100:v19:v4:v44:v50:v54:v68.2', '221_ClusterDist6:v100:v19:v4:v44:v50:v54:v68.3', '221_ClusterDist6:v100:v19:v4:v44:v50:v54:v68.4', '221_ClusterDist6:v100:v19:v4:v44:v50:v54:v68.5', '222_ClusterDist4:v11:v121:v19:v25:v36:v6.0', '222_ClusterDist4:v11:v121:v19:v25:v36:v6.1', '222_ClusterDist4:v11:v121:v19:v25:v36:v6.2', '222_ClusterDist4:v11:v121:v19:v25:v36:v6.3', '223_CVCatNumEnc:v44:v47:v66:v106.sd', '224_CVCatNumEnc:v12:v79:v109.sd', '224_CVCatNumEnc:v12:v79:v129.sd', '224_CVCatNumEnc:v12:v79:v26.sd', '224_CVCatNumEnc:v12:v79:v34.sd', '225_ClusterTE:ClusterID5:v58.0', '226_Freq:v114:v31:v57', '228_ClusterDist4:v28:v50.0', '228_ClusterDist4:v28:v50.1', '228_ClusterDist4:v28:v50.2', '228_ClusterDist4:v28:v50.3', '229_ClusterDist8:v103:v35.0', '229_ClusterDist8:v103:v35.1', '229_ClusterDist8:v103:v35.2', '229_ClusterDist8:v103:v35.3', '229_ClusterDist8:v103:v35.4', '229_ClusterDist8:v103:v35.5', '229_ClusterDist8:v103:v35.6', '229_ClusterDist8:v103:v35.7', '22_v122', '230_NumToCatWoE:v27:v59.0', '231_ClusterDist6:v124:v48:v50:v61.0', '231_ClusterDist6:v124:v48:v50:v61.1', '231_ClusterDist6:v124:v48:v50:v61.2', '231_ClusterDist6:v124:v48:v50:v61.3', '231_ClusterDist6:v124:v48:v50:v61.4', '231_ClusterDist6:v124:v48:v50:v61.5', '232_CVCatNumEnc:v35:v61:v79:v94:v16.sd', '232_CVCatNumEnc:v35:v61:v79:v94:v38.sd', '234_TruncSVD:v48:v6:v89.0', '235_Freq:v110:v30:v42:v47:v72:v97', '238_NumToCatWoE:v34:v46:v50:v86.0', '239_ClusterDist10:v13:v84:v86.0', '239_ClusterDist10:v13:v84:v86.1', '239_ClusterDist10:v13:v84:v86.2', '239_ClusterDist10:v13:v84:v86.3', '239_ClusterDist10:v13:v84:v86.4', '239_ClusterDist10:v13:v84:v86.5', '239_ClusterDist10:v13:v84:v86.6', '239_ClusterDist10:v13:v84:v86.7', '239_ClusterDist10:v13:v84:v86.8', '239_ClusterDist10:v13:v84:v86.9', '23_v123', '240_NumToCatWoE:v7.0', '241_TruncSVD:v109:v129:v14:v50:v60:v61.0', '241_TruncSVD:v109:v129:v14:v50:v60:v61.1', '241_TruncSVD:v109:v129:v14:v50:v60:v61.2', '242_ClusterDist2:v11:v122:v130.0', '242_ClusterDist2:v11:v122:v130.1', '243_Freq:v3:v93', '244_WoE:v129:v71:v82.0', '246_TruncSVD:v34:v80:v97.0', '248_ClusterDist6:v10:v13:v15:v50.0', '248_ClusterDist6:v10:v13:v15:v50.1', '248_ClusterDist6:v10:v13:v15:v50.2', '248_ClusterDist6:v10:v13:v15:v50.3', '248_ClusterDist6:v10:v13:v15:v50.4', '248_ClusterDist6:v10:v13:v15:v50.5', '249_CVCatNumEnc:v36:v104.count', '249_CVCatNumEnc:v36:v29.count', '24_v124', '250_Freq:v31', '251_TruncSVD:v111:v23:v26:v27:v39:v42.0', '251_TruncSVD:v111:v23:v26:v27:v39:v42.1', '251_TruncSVD:v111:v23:v26:v27:v39:v42.2', '253_Freq:v48:v75:v91', '255_NumToCatTE:v105:v114:v130:v20:v33:v37:v68:v96.0', '256_NumCatTE:v114:v131:v38:v64:v82.0', '258_ClusterTE:ClusterID5:v120:v34:v50.0', '259_ClusterDist4:v1:v116:v131:v2:v85.0', '259_ClusterDist4:v1:v116:v131:v2:v85.1', '259_ClusterDist4:v1:v116:v131:v2:v85.2', '259_ClusterDist4:v1:v116:v131:v2:v85.3', '25_v126', '260_WoE:v110:v22:v96.0', '261_NumToCatTE:v122:v25.0', '26_v127', '27_v128', '28_v129', '29_v13', '2_v100', '30_v130', '31_v131', '32_v14', '33_v15', '34_v16', '35_v17', '36_v18', '37_v19', '38_v2', '39_v20', '3_v101', '40_v21', '41_v23', '42_v25', '43_v26', '44_v27', '45_v28', '46_v29', '47_v32', '48_v33', '49_v34', '4_v102', '50_v35', '51_v36', '52_v37', '53_v38', '54_v39', '55_v4', '56_v40', '57_v41', '58_v42', '59_v43', '5_v103', '60_v44', '61_v45', '62_v46', '63_v48', '64_v49', '65_v5', '66_v50', '67_v51', '68_v53', '69_v54', '6_v104', '70_v55', '71_v57', '72_v58', '73_v59', '74_v6', '75_v60', '76_v61', '77_v62', '78_v63', '79_v64', '7_v105', '80_v65'"),)

Again, no good reason, doesn't normally fail, but here I'm no longer using eval_set for dask_cudf. Just somehow still the feature names get nuked.

@trivialfis trivialfis self-assigned this Oct 31, 2020
@trivialfis
Copy link
Member

Is it possible that a worker got a dataframe with some features gone? Just guessing.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 5, 2020

@trivialfis While trying to repro the crash:

I was able to reproduce this feature name issue.

import pandas as pd
def fun():
    from dask.distributed import Client, wait
    from dask_cuda import LocalCUDACluster

    with LocalCUDACluster() as cluster:
        with Client(cluster) as client:

            import xgboost as xgb
            import dask_cudf

            import pickle
            (model, X, y, kwargs) = pickle.load(open("xgbissue6469.pkl", "rb"))
            import dask.dataframe as dd
            X = dd.from_pandas(X, chunksize=5000).persist()
            y = dd.from_pandas(y, chunksize=5000).persist()
            valid_X = kwargs['eval_set'][0][0]
            valid_y = kwargs['eval_set'][0][1]
            valid_X = dd.from_pandas(valid_X, chunksize=5000).persist()
            valid_y = dd.from_pandas(valid_y, chunksize=5000).persist()
            kwargs['eval_set'] = [(valid_X, valid_y)]
            model.fit(X, y, **kwargs)

            print("here")

if __name__ == '__main__':
    fun()

xgbissue6268.zip

Don't let the inner name fool you, that's the other issue name for the crash.

This gives:

(base) jon@mr-dl10:/data/jon/h2oai.fullcondatest3$ python dask_cudf_scitkit_issue6469.py 
[12:34:43] task [xgboost.dask]:tcp://127.0.0.1:39775 got new rank 0
[12:34:43] task [xgboost.dask]:tcp://127.0.0.1:43449 got new rank 1
worker tcp://127.0.0.1:43449 has an empty DMatrix.  
worker tcp://127.0.0.1:39775 has an empty DMatrix.  
distributed.worker - WARNING -  Compute Failed
Function:  dispatched_train
args:      ('tcp://127.0.0.1:43449', [b'DMLC_NUM_WORKER=2', b'DMLC_TRACKER_URI=127.0.0.1', b'DMLC_TRACKER_PORT=9091', b'DMLC_TASK_ID=[xgboost.dask]:tcp://127.0.0.1:43449'], {'feature_names': None, 'feature_types': None, 'meta_names': ['labels'], 'missing': nan, 'parts': None, 'is_quantile': False}, 139913663896712, [({'feature_names': None, 'feature_types': None, 'meta_names': ['labels'], 'missing': nan, 'parts': [(     0_AGE
0     51.0
1     30.0
2     33.0
3     56.0
4     38.0
..     ...
995   25.0
996   23.0
997   22.0
998   22.0
999   24.0

[1000 rows x 1 columns], 0      0.0
1      0.0
2      1.0
3      1.0
4      1.0
      ... 
995    0.0
996    0.0
997    0.0
998    0.0
999    1.0
Name: ________TARGET_________, Length: 1000, dtype: float32)], 'is_quantile': False}, 'validation_0', 139913667974144)])
kwargs:    {}
Exception: ValueError("feature_names mismatch: ['f0'] ['0_AGE']\nexpected f0 in input data\ntraining data did not have the following fields: 0_AGE",)

distributed.worker - WARNING -  Compute Failed
Function:  dispatched_train
args:      ('tcp://127.0.0.1:39775', [b'DMLC_NUM_WORKER=2', b'DMLC_TRACKER_URI=127.0.0.1', b'DMLC_TRACKER_PORT=9091', b'DMLC_TASK_ID=[xgboost.dask]:tcp://127.0.0.1:39775'], {'feature_names': None, 'feature_types': None, 'meta_names': ['labels'], 'missing': nan, 'parts': [(     0_AGE
0     29.0
1     24.0
2     37.0
3     28.0
4     35.0
..     ...
995   28.0
996   26.0
997   25.0
998   25.0
999   25.0

[1000 rows x 1 columns], 0      0.0
1      1.0
2      0.0
3      1.0
4      0.0
      ... 
995    0.0
996    0.0
997    1.0
998    0.0
999    0.0
Name: ________TARGET_________, Length: 1000, dtype: float32)], 'is_quantile': False}, 139913663896712, [({'feature_names': None, 'feature_types': None, 'meta_names': ['labels'], 'missing': nan, 'parts': None, 'is_quantile': False}, 'validation_0', 139913667974144)])
kwargs:    {}
Exception: ValueError("feature_names mismatch: ['0_AGE'] ['f0']\nexpected 0_AGE in input data\ntraining data did not have the following fields: f0",)

Traceback (most recent call last):
  File "dask_cudf_scitkit_issue6469.py", line 27, in <module>
    fun()
  File "dask_cudf_scitkit_issue6469.py", line 22, in fun
    model.fit(X, y, **kwargs)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/core.py", line 421, in inner_f
    return f(**kwargs)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/dask.py", line 1217, in fit
    verbose=verbose)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py", line 824, in sync
    self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/utils.py", line 339, in sync
    raise exc.with_traceback(tb)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/utils.py", line 323, in f
    result[0] = yield future
  File "/home/jon/minicondadai/lib/python3.6/site-packages/tornado/gen.py", line 735, in run
    value = future.result()
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/dask.py", line 1191, in _fit_async
    verbose_eval=verbose)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/dask.py", line 703, in _train_async
    results = await client.gather(futures)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/distributed/client.py", line 1833, in _gather
    raise exception.with_traceback(traceback)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/dask.py", line 676, in dispatched_train
    **kwargs)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/training.py", line 227, in train
    early_stopping_rounds=early_stopping_rounds)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/training.py", line 66, in _train_internal
    bst = Booster(params, [dtrain] + [d[0] for d in evals])
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/core.py", line 1008, in __init__
    self._validate_features(d)
  File "/home/jon/minicondadai/lib/python3.6/site-packages/xgboost/core.py", line 2048, in _validate_features
    data.feature_names))
ValueError: feature_names mismatch: ['f0'] ['0_AGE']
expected f0 in input data
training data did not have the following fields: 0_AGE

@pseudotensor
Copy link
Contributor Author

I think I haven't hit it lately because I made sure to always have at least 1 chunk on each worker. So maybe it's just bad cascade of errors.

@trivialfis
Copy link
Member

Hmm, training continuation with empty matrix...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants