Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask xgboost fit: munmap_chunk(): invalid pointer: 0x00007fa5380304b0 #6469

Closed
pseudotensor opened this issue Dec 5, 2020 · 39 comments · Fixed by #6475
Closed

dask xgboost fit: munmap_chunk(): invalid pointer: 0x00007fa5380304b0 #6469

pseudotensor opened this issue Dec 5, 2020 · 39 comments · Fixed by #6475

Comments

@pseudotensor
Copy link
Contributor

pseudotensor commented Dec 5, 2020

Using rapids 0.14, conda, Ubuntu 16/18, cuda 10.0, cuda 11.1 driver, dask/disributed 2.17 that matches rapids 0.14.

After updating to 1.3.0 master nightly, I'm hitting this with any dask fit. It's pervasive, so I'll probably have to go back to 1.2.1 for now unless easy fix. Once it happens the worker is restarted and xgboost hangs.

It's late here, so I'll post repro if possible on weekend.

*** Error in `dask-worker [tcp://172.16.4.18:34761]': munmap_chunk(): invalid pointer: 0x00007fa5380304b0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777f5)[0x7fa6385ca7f5]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x1a8)[0x7fa6385d76e8]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN2dh10AllReducer4InitEi+0xc7b)[0x7fa44fdb452b]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost4tree23GPUHistMakerSpecialisedINS_6detail20GradientPairInternalIdEEE6UpdateEPNS_16HostDeviceVectorINS3_IfEEEEPNS_7DMatrixERKSt6vectorIPNS_7RegTreeESaISE_EE+0x2e8)[0x7fa44ff3e568]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost3gbm6GBTree13BoostNewTreesEPNS_16HostDeviceVectorINS_6detail20GradientPairInternalIfEEEEPNS_7DMatrixEiPSt6vectorISt10unique_ptrINS_7RegTreeESt14default_deleteISC_EESaISF_EE+0x18e9)[0x7fa44fc2c559]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost3gbm6GBTree7DoBoostEPNS_7DMatrixEPNS_16HostDeviceVectorINS_6detail20GradientPairInternalIfEEEEPNS_20PredictionCacheEntryE+0x12d)[0x7fa44fc30e0d]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(_ZN7xgboost11LearnerImpl13UpdateOneIterEiSt10shared_ptrINS_7DMatrixEE+0x52b)[0x7fa44fc68c0b]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x58)[0x7fa44fb44648]
/home/jenkins/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c)[0x7fa636db9630]
/home/jenkins/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d)[0x7fa636db8fed]
/home/jenkins/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce)[0x7fa636dcff9e]
/home/jenkins/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x139d5)[0x7fa636dd09d5]
dask-worker [tcp://172.16.4.18:34761](_PyObject_FastCallDict+0x8b)[0x55e17dd7c00b]
dask-worker [tcp://172.16.4.18:34761](+0x1a179e)[0x55e17de0a79e]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x30a)[0x55e17de2d18a]
dask-worker [tcp://172.16.4.18:34761](+0x16f256)[0x55e17ddd8256]
dask-worker [tcp://172.16.4.18:34761](+0x170231)[0x55e17ddd9231]
dask-worker [tcp://172.16.4.18:34761](+0x1a1725)[0x55e17de0a725]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x30a)[0x55e17de2d18a]
dask-worker [tcp://172.16.4.18:34761](+0x16f256)[0x55e17ddd8256]
dask-worker [tcp://172.16.4.18:34761](+0x170231)[0x55e17ddd9231]
dask-worker [tcp://172.16.4.18:34761](+0x1a1725)[0x55e17de0a725]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x10c6)[0x55e17de2df46]
dask-worker [tcp://172.16.4.18:34761](PyEval_EvalCodeEx+0x329)[0x55e17dddd4f9]
dask-worker [tcp://172.16.4.18:34761](+0x175426)[0x55e17ddde426]
dask-worker [tcp://172.16.4.18:34761](PyObject_Call+0x3e)[0x55e17dd7bc2e]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x1ab6)[0x55e17de2e936]
dask-worker [tcp://172.16.4.18:34761](PyEval_EvalCodeEx+0x966)[0x55e17ddddb36]
dask-worker [tcp://172.16.4.18:34761](+0x175426)[0x55e17ddde426]
dask-worker [tcp://172.16.4.18:34761](PyObject_Call+0x3e)[0x55e17dd7bc2e]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x1ab6)[0x55e17de2e936]
dask-worker [tcp://172.16.4.18:34761](PyEval_EvalCodeEx+0x329)[0x55e17dddd4f9]
dask-worker [tcp://172.16.4.18:34761](+0x175426)[0x55e17ddde426]
dask-worker [tcp://172.16.4.18:34761](PyObject_Call+0x3e)[0x55e17dd7bc2e]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x1ab6)[0x55e17de2e936]
dask-worker [tcp://172.16.4.18:34761](+0x16fffb)[0x55e17ddd8ffb]
dask-worker [tcp://172.16.4.18:34761](+0x1a1725)[0x55e17de0a725]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x30a)[0x55e17de2d18a]
dask-worker [tcp://172.16.4.18:34761](PyEval_EvalCodeEx+0x329)[0x55e17dddd4f9]
dask-worker [tcp://172.16.4.18:34761](+0x175426)[0x55e17ddde426]
dask-worker [tcp://172.16.4.18:34761](PyObject_Call+0x3e)[0x55e17dd7bc2e]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x1ab6)[0x55e17de2e936]
dask-worker [tcp://172.16.4.18:34761](+0x16fffb)[0x55e17ddd8ffb]
dask-worker [tcp://172.16.4.18:34761](+0x1a1725)[0x55e17de0a725]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x30a)[0x55e17de2d18a]
dask-worker [tcp://172.16.4.18:34761](+0x16fffb)[0x55e17ddd8ffb]
dask-worker [tcp://172.16.4.18:34761](+0x1a1725)[0x55e17de0a725]
dask-worker [tcp://172.16.4.18:34761](_PyEval_EvalFrameDefault+0x30a)[0x55e17de2d18a]
dask-worker [tcp://172.16.4.18:34761](_PyFunction_FastCallDict+0x11b)[0x55e17ddd966b]
dask-worker [tcp://172.16.4.18:34761](_PyObject_FastCallDict+0x26f)[0x55e17dd7c1ef]
dask-worker [tcp://172.16.4.18:34761](_PyObject_Call_Prepend+0x63)[0x55e17dd80cf3]
dask-worker [tcp://172.16.4.18:34761](PyObject_Call+0x3e)[0x55e17dd7bc2e]
dask-worker [tcp://172.16.4.18:34761](+0x210c36)[0x55e17de79c36]

FYI @trivialfis

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 5, 2020

https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2Bd6386e45e8ed19c238ee06544dabbcbf56e02bbc-py3-none-manylinux2010_x86_64.whl

This is what I tried. The problem happens every time I run same thing, so I can probably bisect back to hash when it started. Will try that tomorrow.

It plausibly is related to the eval_set, which should be supported now. I'm using scikit-learn API with eval-set passed, which in past was not supported for early stopping,but I think now is. But maybe there is a problem.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 5, 2020

Yes, if I just don't pass eval_set to dask fit then no such crash. So the new early stopping support for dask must be broken. I was hoping for that early stopping support, main reason why I wanted to try 1.3.0.

@trivialfis
Copy link
Member

trivialfis commented Dec 5, 2020

@pseudotensor So far it doesn't show up on our tests. Will look into it once we have a MRE.

Yes, if I just don't pass eval_set to dask fit then no such crash. So the new early stopping support for dask must be broken.

Sometimes error happens in a weird way. Like one of the worker died due to oom/ucx failure or whatever, then got restarted by dask resulting into weird state in the underlying MPI (rabit) synchronization.

@trivialfis
Copy link
Member

Just ran it with higgs, seems fine.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 5, 2020

@trivialfis Here is repro:

  1. Launch 2 node dask-cuda-worker and scheduler cluster with 2 GPUs each. Like we have discussed/done before in other issues. I've attached my scheduler.json so you know how I set up the cluster.

dask_cuda_scheduler.json.zip

  1. run below script on main scheduler node (FYI same as dask_cudf: scikit-learn API leads to impossible train-time: ValueError("feature_names mismatch #6268 (comment) but reasonable chunksize):

xgbissue6469.pkl.zip

import pandas as pd
def fun():
    from dask.distributed import Client, wait

    if True:
        with Client(scheduler_file="dask_cuda_scheduler.json") as client:
            import xgboost as xgb
            import dask_cudf

            import pickle
            (model, X, y, kwargs) = pickle.load(open("xgbissue6469.pkl", "rb"))
            import dask.dataframe as dd
            X = dd.from_pandas(X, chunksize=250).persist()
            y = dd.from_pandas(y, chunksize=250).persist()
            valid_X = kwargs['eval_set'][0][0]
            valid_y = kwargs['eval_set'][0][1]
            valid_X = dd.from_pandas(valid_X, chunksize=250).persist()
            valid_y = dd.from_pandas(valid_y, chunksize=250).persist()
            kwargs['eval_set'] = [(valid_X, valid_y)]
            model.fit(X, y, **kwargs)

            print("here")

if __name__ == '__main__':
    fun()

Client side appears to just hang, but at least one of the workers hits the error mentioned:

** Error in `dask-worker [tcp://172.16.4.18:36115]': munmap_chunk(): invalid pointer: 0x00007f52d8029ef0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777f5)[0x7f53d4dbf7f5]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x1a8)[0x7f53d4dcc6e8]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x3074ba)[0x7f52bbed04ba]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x4a04c4)[0x7f52bc0694c4]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x179e25)[0x7f52bbd42e25]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x17ba77)[0x7f52bbd44a77]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x1a4919)[0x7f52bbd6d919]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x58)[0x7f52bbc49308]
/home/jenkins/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c)[0x7f53d35ae630]
/home/jenkins/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d)[0x7f53d35adfed]
/home/jenkins/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce)[0x7f53d35c4f9e]
/home/jenkins/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x139d5)[0x7f53d35c59d5]
dask-worker [tcp://172.16.4.18:36115](_PyObject_FastCallDict+0x8b)[0x55cc93d0600b]
dask-worker [tcp://172.16.4.18:36115](+0x1a179e)[0x55cc93d9479e]
dask-worker [tcp://172.16.4.18:36115](_PyEval_EvalFrameDefault+0x30a)[0x55cc93db718a]
dask-worker [tcp://172.16.4.18:36115](+0x16f256)[0x55cc93d62256]
dask-worker [tcp://172.16.4.18:36115](+0x170231)[0x55cc93d63231]
dask-worker [tcp://172.16.4.18:36115](+0x1a1725)[0x55cc93d94725]
dask-worker [tcp://172.16.4.18:36115](_PyEval_EvalFrameDefault+0x30a)[0x55cc93db718a]
dask-worker [tcp://172.16.4.18:36115](+0x16f256)[0x55cc93d62256]
dask-worker [tcp://172.16.4.18:36115](+0x170231)[0x55cc93d63231]
dask-worker [tcp://172.16.4.18:36115](+0x1a1725)[0x55cc93d94725]
dask-worker [tcp://172.16.4.18:36115](_PyEval_EvalFrameDefault+0x10c6)[0x55cc93db7f46]
dask-worker [tcp://172.16.4.18:36115](PyEval_EvalCodeEx+0x329)[0x55cc93d674f9]
dask-worker [tcp://172.16.4.18:36115](+0x175426)[0x55cc93d68426]
dask-worker [tcp://172.16.4.18:36115](PyObject_Call+0x3e)[0x55cc93d05c2e]
dask-worker [tcp://172.16.4.18:36115](_PyEval_EvalFrameDefault+0x1ab6)[0x55cc93db8936]
dask-worker [tcp://172.16.4.18:36115](PyEval_EvalCodeEx+0x966)[0x55cc93d67b36]
:

The local cuda server on 2 GPU system did not crash like this. Removing the eval_set does not crash either.

@trivialfis
Copy link
Member

I will look into it tomorrow. That's pretty new .. hopefully I can get sanitizer to work on distributed workers ..

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 6, 2020

Odd remote closure of issue although I only referenced the issue. Bad github.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 6, 2020

@trivialfis Actually the problem is more pervasive. I hit it even when the eval_set is empty. Just the same exact case above but this pickle:

dask_model_398f86ad-bd64-4503-954d-6f7c486a2b11.pickle.zip

The error is same, but sharing full logs to avoid confusion:

dask-cuda-worker.stderr.log.zip

full worker log for the 1 remote worker with 2 GPUs. The scheduler has worker for its 2 GPUs but no such errors appear there.

Note that this particular case is a super trivial 1 column and 1000 rows fit. So this is back to, compared to 1.2.1, 1.3.0RC being much less stable for dask. This is is still just using native nightly: xgboost-1.3.0_SNAPSHOT+d6386e45e8ed19c238ee06544dabbcbf56e02bbc-py3-none-manylinux2010_x86_64.whl directly, no changes.

Of course, the script run is a tiny bit different:

import pandas as pd
def fun():
    from dask.distributed import Client, wait
    from dask_cuda import LocalCUDACluster

    if True:
        with Client(scheduler_file="tmp/dask_cuda_scheduler.json") as client:
            import xgboost as xgb
            import dask_cudf

            import pickle
            #file1 = "xgbissue6469.pkl"
            file1 = "dask_model_398f86ad-bd64-4503-954d-6f7c486a2b11.pickle"
            (model, X, y, kwargs) = pickle.load(open(file1, "rb"))
            import dask.dataframe as dd
            X = dd.from_pandas(X, chunksize=250).persist()
            y = dd.from_pandas(y, chunksize=250).persist()
            if 'eval_set' in kwargs:
                valid_X = kwargs['eval_set'][0][0]
                valid_y = kwargs['eval_set'][0][1]
                valid_X = dd.from_pandas(valid_X, chunksize=250).persist()
                valid_y = dd.from_pandas(valid_y, chunksize=250).persist()
                kwargs['eval_set'] = [(valid_X, valid_y)]
            model.fit(X, y, **kwargs)

            print("here")

if __name__ == '__main__':
    fun()

I've tried various changs, no changes:

  1. chunksize=50
  2. ignoring original model, just setting
            params = dict(tree_method='gpu_hist', booster='gbtree')
            model = model.__class__(**params)

Also, even though the dask worker seems to come back up, if I ctrl-C the python script and try again, things then just hang fully even without any new errors. i have to restart the dask scheduler/workers.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 6, 2020

bisecting this reduced simple version of script:

import pandas as pd
def fun():
    from dask.distributed import Client, wait
    from dask_cuda import LocalCUDACluster

    if True:
        with Client(scheduler_file="tmp/dask_cuda_scheduler.json") as client:
            import xgboost as xgb
            import dask_cudf

            import pickle
            #file1 = "xgbissue6469.pkl"
            file1 = "dask_model_398f86ad-bd64-4503-954d-6f7c486a2b11.pickle"
            (model, X, y, kwargs) = pickle.load(open(file1, "rb"))
            params = dict(tree_method='gpu_hist', booster='gbtree')
            model = model.__class__(**params)
            import dask.dataframe as dd
            X = dd.from_pandas(X, chunksize=50).persist()
            y = dd.from_pandas(y, chunksize=50).persist()
            kwargs.pop('sample_weight', None)
            if 'eval_set' in kwargs:
                valid_X = kwargs['eval_set'][0][0]
                valid_y = kwargs['eval_set'][0][1]
                valid_X = dd.from_pandas(valid_X, chunksize=250).persist()
                valid_y = dd.from_pandas(valid_y, chunksize=250).persist()
                kwargs['eval_set'] = [(valid_X, valid_y)]
            print(model.get_params())
            model.fit(X, y, **kwargs)
            preds = model.predict(X)
            print(preds.compute()[0])

            print("here")

if __name__ == '__main__':
    fun()

by nightly wheels, in case helps since only takes 5 seconds to fail or run:

(here FAILED means illegal on remote worker + hang, same as in master)

FAILED: https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2Bd6386e45e8ed19c238ee06544dabbcbf56e02bbc-py3-none-manylinux2010_x86_64.whl

PASSED:: https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2B9a4e8b1d8196076098cccfac53bcfe0cc60e1d54-py3-none-manylinux2010_x86_64.whl

PASSED: https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2Bcc767247628e0c81956515292ba824b89432e9e8-py3-none-manylinux2010_x86_64.whl

PASSED: https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2B3ac173fc8b68940416bab134dd019195b16519f3-py3-none-manylinux2010_x86_64.whl

PASSED: https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2Bf3a4253984f1b76cc542fb313c3186264c2cdba0-py3-none-manylinux2010_x86_64.whl

FINISHED BUT had same failure on remote worker. So script completed, but still remote worker had illegal access. No idea how script ran still, unless dask ignored the worker entirely somehow. The remote worker had rabit failures and then followed up with the same illegal stuff:

00:05:37] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 1): [127.0.0.1]

[00:05:37] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 1): [127.0.0.1]

[00:05:39] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 2): [127.0.0.1]

[00:05:39] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 2): [127.0.0.1]

[00:05:43] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 3): [127.0.0.1]

[00:05:43] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 3): [127.0.0.1]

[00:05:49] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 4): [127.0.0.1]

[00:05:49] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 4): [127.0.0.1]

[00:05:57] WARNING: ../rabit/src/allreduce_base.cc:237: Connect to (failed): [127.0.0.1]

[00:05:57] WARNING: ../rabit/src/allreduce_base.cc:446: failed in ReconnectLink [00:05:57] ../rabit/include/rabit/internal/utils.h:114: Socket Connect Error:Connection refused
[00:05:57] WARNING: ../rabit/src/allreduce_base.cc:237: Connect to (failed): [127.0.0.1]

[00:05:57] WARNING: ../rabit/src/allreduce_base.cc:446: failed in ReconnectLink [00:05:57] ../rabit/include/rabit/internal/utils.h:114: Socket Connect Error:Connection refused
*** Error in `dask-worker [tcp://172.16.4.18:38213]': munmap_chunk(): invalid pointer: 0x00007f2a0c02d420 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777f5)[0x7f2b037c87f5]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x1a8)[0x7f2b037d56e8]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x2fe42a)[0x7f29ebf1542a]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x497a74)[0x7f29ec0aea74]

https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2B12d27f43ff117462d607d0949fb89bccbc405a49-py3-none-manylinux2010_x86_64.whl

PASSED:
https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2B8a17610666dd1286fbeb20a96f6aab92ef63a994-py3-none-manylinux2010_x86_64.whl

FINISHED BUT same illegal error and sequence of rabit stuff shown before.
https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2B6d711d648cb83409da00ff48e1890e1e9e386856-py3-none-manylinux2010_x86_64.whl

So this diff is bad:

https://github.com/dmlc/xgboost/compare/8a17610666dd1286fbeb20a96f6aab92ef63a994..d711d648cb83409da00ff48e1890e1e9e386856

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 6, 2020

In that "diff" I see changes to rabit, which probably explains why the rabit stuff shows up. I guess the rabit changes broke dask in this case, or perhaps those changes were trying to fix something that otherwise didn't work, etc. You guys would know better.

To be clear, some later commit the rabit stuff must goes away, but one is left with the same illegal error and hang instead.

As an aside, it would be nice if rabit in xgboost was such that xgboost dask/etc. fitting/predicting would be robust to workers going down. Is that how it is supposed to work? It definitely hasn't been that way for me ever, any crash of worker hangs everything, as I've shard before. Still same here for 1.3.0 head of master.

I'm not sure what one expects, but definitely being fragile to a single worker going down is bad. But illegal failure and hang are also super bad.

@trivialfis
Copy link
Member

Thanks for bisecting the issue. I haven't been able to reproduce it at the moment. Trying out a remote cluster right now. Looking at the diff you found, seems the change to rabit is just logging. Maybe the log got redirected to tracker which is causing issue, I will continue looking into it.

I'm not sure what one expects, but definitely being fragile to a single worker going down is bad. But illegal failure and hang are also super bad.

XGBoost is robust to exception, but not hard crash.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 6, 2020

I'll try to bisect the case when no more rabit failure and just illegal + hang:

illegal only: https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2Bd6386e45e8ed19c238ee06544dabbcbf56e02bbc-py3-none-manylinux2010_x86_64.whl

rabit + illegal (kept trying and going, failing multiple times and script failed): https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2B12d27f43ff117462d607d0949fb89bccbc405a49-py3-none-manylinux2010_x86_64.whl

rabit + illegal (kept trying and going, failing multiple times and script failed):
https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2Bcc581b3b6b49958a56b440de2f5d4769b8fdd193-py3-none-manylinux2010_x86_64.whl

illegal only:
https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2Bfb56da5e8b95f3fa0a797a4b2bd073d13e28ed3b-py3-none-manylinux2010_x86_64.whl

rabit + illegal:
https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2B956beead705118fd2513da79af9ccfe9cef59a0d-py3-none-manylinux2010_x86_64.whl

rabit + illegal:
https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2Bc2ba4fb95742c9fe927ee2c5f180ef81a20d33d1-py3-none-manylinux2010_x86_64.whl

This is the change that lost the rabit allreduce message but kept the illegal + hang:

https://github.com/dmlc/xgboost/compare/c2ba4fb95742c9fe927ee2c5f180ef81a20d33d1..fb56da5e8b95f3fa0a797a4b2bd073d13e28ed3b

Again some logging changes, but some actual non-test dask changes. Some global config stuff.

So yes I'd be suspicious of the logging + dask changes, since both appeared in the original case when illegal access was hit. Most likely is dask changes however.

FYI, that type of error is seemingly related to trying to free a pointer than did not come from malloc (i.e. you don't own the pointer). https://stackoverflow.com/questions/6199729/how-to-solve-munmap-chunk-invalid-pointer-error-in-c

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 6, 2020

So overall, the illegal error probably comes from the python-package/xgboost/dask.py changes here: https://github.com/dmlc/xgboost/compare/8a17610666dd1286fbeb20a96f6aab92ef63a994..d711d648cb83409da00ff48e1890e1e9e386856

Of course, it's also possible that the changes just exposed an existing issue etc.

@pseudotensor
Copy link
Contributor Author

If you guys can make a debug build of any particular hash, I'm happy to run it if that would give more useful information.

@pseudotensor
Copy link
Contributor Author

FYI, if I run without GPU for the script, I get exact same kind of error:

CUDA_VISIBLE_DEVICES= python dask_cudf_scitkit_issue6469.py
/home/jon/minicondadai/lib/python3.6/site-packages/cudf/utils/gpu_utils.py:120: UserWarning: No NVIDIA GPU detected

i.e. just letting it switch, and connecting to normal dask workers, I see:

distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
[01:11:12] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 1): [127.0.0.1]

[01:11:14] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 2): [127.0.0.1]

[01:11:18] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 3): [127.0.0.1]

[01:11:24] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 4): [127.0.0.1]

[01:11:32] WARNING: ../rabit/src/allreduce_base.cc:237: Connect to (failed): [127.0.0.1]

[01:11:32] WARNING: ../rabit/src/allreduce_base.cc:446: failed in ReconnectLink [01:11:32] ../rabit/include/rabit/internal/utils.h:114: Socket Connect Error:Connection refused
*** Error in `dask-worker [tcp://172.16.4.18:40547]': munmap_chunk(): invalid pointer: 0x00007f2981372500 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777f5)[0x7f2a651217f5]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x1a8)[0x7f2a6512e6e8]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x3050ba)[0x7f29b685f0ba]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x49de04)[0x7f29b69f7e04]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x178a55)[0x7f29b66d2a55]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x17a6a7)[0x7f29b66d46a7]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(+0x1a2d59)[0x7f29b66fcd59]
/home/jenkins/minicondadai/lib/python3.6/site-packages/xgboost/lib/libxgboost.so(XGBoosterUpdateOneIter+0x58)[0x7f29b65da0c8]
/home/jenkins/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c)[0x7f2a63a1b630]
/home/jenkins/minicondadai/lib/python3.6/lib-dynload/../../libffi.so.6(ffi_call+0x22d)[0x7f2a63a1afed]
/home/jenkins/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce)[0x7f2a63a31f9e]
/home/jenkins/minicondadai/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so(+0x139d5)[0x7f2a63a329d5]

Of course, it could just be ignoring the client and doing GPU stuff on dask-worker's.

If I run with 'hist' as well, I see:

~
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.4.18:33701'
distributed.worker - INFO -       Start worker at:    tcp://172.16.4.18:43191
distributed.worker - INFO -          Listening to:    tcp://172.16.4.18:43191
distributed.worker - INFO -          dashboard at:          172.16.4.18:45455
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.2.210:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    2.74 GB
distributed.worker - INFO -       Local Directory: /home/ops/h2oai/tmp/dask_worker_files/worker-4lvbgi4k
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:    tcp://172.16.2.210:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
[01:13:44] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 1): [127.0.0.1]

[01:13:46] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 2): [127.0.0.1]

[01:13:50] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 3): [127.0.0.1]

[01:13:56] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 4): [127.0.0.1]

[01:14:04] WARNING: ../rabit/src/allreduce_base.cc:237: Connect to (failed): [127.0.0.1]

[01:14:04] WARNING: ../rabit/src/allreduce_base.cc:446: failed in ReconnectLink [01:14:04] ../rabit/include/rabit/internal/utils.h:114: Socket Connect Error:Connection refused
[01:14:04] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[01:14:04] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 1): [127.0.0.1]

[01:14:06] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 2): [127.0.0.1]

[01:14:10] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 3): [127.0.0.1]

[01:14:16] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 4): [127.0.0.1]

[01:14:24] WARNING: ../rabit/src/allreduce_base.cc:237: Connect to (failed): [127.0.0.1]

[01:14:24] WARNING: ../rabit/src/allreduce_base.cc:139: Failed to shutdown due to[01:14:24] ../rabit/include/rabit/internal/utils.h:114: Socket Connect Error:Connection refused
(END)

It hangs, but no illegal errors printed. So probably failed in related way just not caught same way.

But the hang is already bad. So seems like

@trivialfis
Copy link
Member

trivialfis commented Dec 6, 2020

Maybe the log got redirected to tracker which is causing issue,

Nope. The log is not redirected.

I still can't reproduce it on a remote cluster. My setup is 2 nodes with 4-8 GPUs each, and a scheduler on a third different node and it's CPU only.

@hcho3 We have seen similar log in #6469 (comment) , but for spark package. (the failed to shutdown thing).

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 6, 2020

Going back to latest build:

https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2B703c2d06aa7c5649672506fb3acc37f5f72481a4-py3-none-manylinux2010_x86_64.whl

And trying just CPU mode:

import pandas as pd
def fun():
    from dask.distributed import Client, wait
    from dask_cuda import LocalCUDACluster

    if True:
        #with Client(scheduler_file="tmp/dask_cuda_scheduler.json") as client:
        with Client(scheduler_file="tmp/dask_scheduler.json") as client:
            import xgboost as xgb
            import dask_cudf

            import pickle
            #file1 = "xgbissue6469.pkl"
            file1 = "dask_model_398f86ad-bd64-4503-954d-6f7c486a2b11.pickle"
            (model, X, y, kwargs) = pickle.load(open(file1, "rb"))
            params = dict(tree_method='hist', booster='gbtree')
            model = model.__class__(**params)
            import dask.dataframe as dd
            X = dd.from_pandas(X, chunksize=50).persist()
            y = dd.from_pandas(y, chunksize=50).persist()
            kwargs.pop('sample_weight', None)
            if 'eval_set' in kwargs:
                valid_X = kwargs['eval_set'][0][0]
                valid_y = kwargs['eval_set'][0][1]
                valid_X = dd.from_pandas(valid_X, chunksize=250).persist()
                valid_y = dd.from_pandas(valid_y, chunksize=250).persist()
                kwargs['eval_set'] = [(valid_X, valid_y)]
            print(model.get_params())
            model.fit(X, y, **kwargs)
            preds = model.predict(X)
            print(preds.compute()[0])

            print("here")

if __name__ == '__main__':
    fun()

CUDA_VISIBLE_DEVICES= python dask_cudf_scitkit_issue6469.py

Of course, the nightly was built to run on GPU, but I assume that won't matter.

Error hit:

distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.4.18:42031'
distributed.worker - INFO -       Start worker at:    tcp://172.16.4.18:42583
distributed.worker - INFO -          Listening to:    tcp://172.16.4.18:42583
distributed.worker - INFO -          dashboard at:          172.16.4.18:36047
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.2.210:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    2.74 GB
distributed.worker - INFO -       Local Directory: /home/ops/h2oai/tmp/dask_worker_files/worker-g3uicahi
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:    tcp://172.16.2.210:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
[01:21:17] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 1): [127.0.0.1]

[01:21:19] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 2): [127.0.0.1]

[01:21:23] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 3): [127.0.0.1]

[01:21:29] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 4): [127.0.0.1]

[01:21:37] WARNING: ../rabit/src/allreduce_base.cc:237: Connect to (failed): [127.0.0.1]

[01:21:37] WARNING: ../rabit/src/allreduce_base.cc:446: failed in ReconnectLink [01:21:37] ../rabit/include/rabit/internal/utils.h:114: Socket Connect Error:Connection refused
[01:21:37] WARNING: ../src/learner.cc:1067: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[01:21:37] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 1): [127.0.0.1]

[01:21:39] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 2): [127.0.0.1]

[01:21:43] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 3): [127.0.0.1]

[01:21:49] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 4): [127.0.0.1]

[01:21:57] WARNING: ../rabit/src/allreduce_base.cc:237: Connect to (failed): [127.0.0.1]

[01:21:57] WARNING: ../rabit/src/allreduce_base.cc:139: Failed to shutdown due to[01:21:57] ../rabit/include/rabit/internal/utils.h:114: Socket Connect Error:Connection refused

It stalls alot before moving on. Whole script should only take 1 second.

And hangs. No illegal error still though. But I guess related problem, not GPU specific.

@trivialfis
Copy link
Member

@pseudotensor How's the scheduler doing ?

@pseudotensor
Copy link
Contributor Author

FYI:

(base) jon@mr-dl10:/data/jon/h2oai.fullcondatest3$ cat tmp/dask_scheduler.json 
{
  "type": "Scheduler",
  "id": "Scheduler-0eb3a385-a1bd-4992-9c70-9b506e6ca3ea",
  "address": "tcp://172.16.2.210:8786",
  "services": {
    "dashboard": 8787
  },
  "workers": {}
}(base) jon@mr-dl10:/data/jon/h2oai.fullcondatest3$ 
scheduler logs:
(base) jon@mr-dl10:/data/jon/h2oai.fullcondatest3$ cat tmp/dask-scheduler.stderr.log
distributed.scheduler - INFO - -----------------------------------------------
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:   tcp://172.16.2.210:8786
distributed.scheduler - INFO -   dashboard at:         172.16.2.210:8787
distributed.scheduler - INFO - Register worker <Worker 'tcp://172.16.4.18:42583', name: tcp://172.16.4.18:42583, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.16.4.18:42583
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register worker <Worker 'tcp://172.16.2.210:44095', name: tcp://172.16.2.210:44095, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://172.16.2.210:44095
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-6311405e-37a4-11eb-b8ae-0cc47adb058f
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Run out-of-band function '_start_tracker'
distributed.scheduler - INFO - Remove client Client-6311405e-37a4-11eb-b8ae-0cc47adb058f
distributed.scheduler - INFO - Close client connection: Client-6311405e-37a4-11eb-b8ae-0cc47adb058f
distributed.scheduler - INFO - Remove worker <Worker 'tcp://172.16.2.210:44095', name: tcp://172.16.2.210:44095, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tcp://172.16.2.210:44095
(base) jon@mr-dl10:/data/jon/h2oai.fullcondatest3$ 

@pseudotensor
Copy link
Contributor Author

And the local worker with the scheduler never sees errors.

@trivialfis
Copy link
Member

trivialfis commented Dec 6, 2020

Will be back in 20 minutes. @pseudotensor Is there a way to reach you offline via voice call? I'm a bit overwhelmed by all the information you have provided and might be more productive if we can talk directly.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 6, 2020

Ya, happy to chat in 20, email me at pseudotensor@gmail.com .

@pseudotensor
Copy link
Contributor Author

Going back to the PASSED hash just before problems also passes and works fine with only CPU mode.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 6, 2020

So to summarize, even in CPU mode it hangs hard and has various errors, even if not hitting same illegal thing.

Steps:

System1)

dask-scheduler --scheduler-file dask_scheduler.json&
dask-worker --scheduler-file dask_scheduler.json&

System2)

dask-worker --scheduler-file dask_scheduler.json&

System1 again)

CUDA_VISIBLE_DEVICES= python dask_cudf_scitkit_issue6469_simple.py

with script:

def fun():
    from dask.distributed import Client

    if True:
        with Client(scheduler_file="dask_scheduler.json") as client:
            import xgboost as xgb

            import pickle
            file1 = "dask_model_398f86ad-bd64-4503-954d-6f7c486a2b11.pickle"
            (model, X, y, kwargs) = pickle.load(open(file1, "rb"))
            params = dict(tree_method='hist', booster='gbtree')
            model = model.__class__(**params)
            import dask.dataframe as dd
            X = dd.from_pandas(X, chunksize=50).persist()
            y = dd.from_pandas(y, chunksize=50).persist()
            model.fit(X, y, **kwargs)
            print("here")

if __name__ == '__main__':
    fun()

and remote worker hits:

(base) ops@mr-dl18:~/h2oai$ distributed.nanny - INFO -         Start Nanny at: 'tcp://172.16.4.18:37571'
distributed.worker - INFO -       Start worker at:    tcp://172.16.4.18:42723
distributed.worker - INFO -          Listening to:    tcp://172.16.4.18:42723
distributed.worker - INFO -          dashboard at:          172.16.4.18:44161
distributed.worker - INFO - Waiting to connect to:    tcp://172.16.2.210:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                         40
distributed.worker - INFO -                Memory:                  109.69 GB
distributed.worker - INFO -       Local Directory: /home/ops/h2oai/dask-worker-space/worker-pnuctsxo
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:    tcp://172.16.2.210:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
[01:35:39] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 1): [127.0.0.1]

[01:35:41] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 2): [127.0.0.1]

[01:35:45] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 3): [127.0.0.1]

[01:35:51] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 4): [127.0.0.1]

[01:35:59] WARNING: ../rabit/src/allreduce_base.cc:237: Connect to (failed): [127.0.0.1]

[01:35:59] WARNING: ../rabit/src/allreduce_base.cc:446: failed in ReconnectLink [01:35:59] ../rabit/include/rabit/internal/utils.h:114: Socket Connect Error:Connection refused
[01:35:59] WARNING: ../src/learner.cc:1067: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[01:35:59] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 1): [127.0.0.1]

[01:36:01] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 2): [127.0.0.1]

[01:36:05] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 3): [127.0.0.1]

[01:36:11] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 4): [127.0.0.1]

[01:36:19] WARNING: ../rabit/src/allreduce_base.cc:237: Connect to (failed): [127.0.0.1]

[01:36:19] WARNING: ../rabit/src/allreduce_base.cc:139: Failed to shutdown due to[01:36:19] ../rabit/include/rabit/internal/utils.h:114: Socket Connect Error:Connection refused

and script never reaches passed "fit", just hangs.

(base) jon@mr-dl10:/data/jon/h2oai.fullcondatest3$ CUDA_VISIBLE_DEVICES= python dask_cudf_scitkit_issue6469_simple.py
/home/jon/minicondadai/lib/python3.6/site-packages/cudf/utils/gpu_utils.py:120: UserWarning: No NVIDIA GPU detected
  warnings.warn("No NVIDIA GPU detected")

@pseudotensor
Copy link
Contributor Author

I didn't show you the dask dashboard when the hang occurs, here is that:

image

Incomplete dispatched_train.

@trivialfis
Copy link
Member

Hmm, one of the worker is not doing well, I assume that would be the local worker.

@pseudotensor
Copy link
Contributor Author

The remote worker is always the one that has the connect failures

@trivialfis
Copy link
Member

I can't reproduce your network environment. Need to ask for some help from others who are more familiar with network admin.

@trivialfis
Copy link
Member

trivialfis commented Dec 6, 2020

Just trying to think out loud, feel free to ignore me. There are 2 issues related to the failure, first one is a rabit worker refusing to connect to tracker, another one is it crashes after the connection failure. There are 2 systems for both scheduler, worker and client. 2 workers are allocated on these 2 different systems and 1 of them is local to scheduler. The failed rabit worker is trying to connect to local host for talking to tracker, which is apparently wrong. The tracker address is obtained by client process running a _start_tracker function on scheduler, which uses socket.gethostbyname(socket.gethostname()) to resolve the address.

The last part is most suspicious, it was done for supporting a reverse proxy environment to get a correct ip address. Zooming into the possible issue, what if client is local to scheduler and hence obtained a ip as localhost, which then passed to the remote worker that's resided on a different machine?

@pseudotensor
Copy link
Contributor Author

Yes, there are now seemingly 2 issues, perhaps related, unsure:

  1. CPU dask cluster fails with rabit [02:24:05] WARNING: ../rabit/src/allreduce_base.cc:240: Retry connect to ip(retry time 1): [127.0.0.1] 4 times, then failure. Then it tries again 4 more times. Eventually fails due to connection refused. Client hangs.

Same number of works on same host has no such issue. Systems have no special setup, I see no reason it should be trying to connect to localhost=127.0.0.1.

  1. Similar script run for GPU case and dask-cuda-workers hits same rabit message if done on some broad range of commits, but for last month there is no rabit message for GPU case (even if CPU still htis it) and instead there is an invalid pointer error hit.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 7, 2020

FYI even if I just have 1 scheduler on system A and 1 worker on system B the same problem happens. A local worker does not matter.

I also tried a different remote system. It also hangs but differently:

distributed.core - INFO - Event loop was unresponsive in Worker for 5.47s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Dec 7, 2020

It still feels race related. If I install from scratch and try again, I don't see the same rabit error (same systems used), but instead it is super slow do to the same script. No reason it should take 40 seconds. Also, the remote worker never shows any logs for the task. It's as if dask just ends up using the local client to run things.

Dask looks like this:

image

the dispatched_train stands at 0/1 for long time.

But I've tried multiple remote workers for the 1 (only) scheduler system and always same for that script. Completes but ridiculously long.

I have a separate system I put remote worker on, also same fresh install, and this one hangs.

@pseudotensor
Copy link
Contributor Author

FYI, still even though the errors/hangs/times are not consistent, the bisection is consistent.

That is, if I go back to a PASSED case:

https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2B8a17610666dd1286fbeb20a96f6aab92ef63a994-py3-none-manylinux2010_x86_64.whl

Then that same scheduler/worker setup on any of the systems always works and is always quick.

So I think still has to be the dask.py changes I pointed to @trivialfis . It's too consistently major problems since that.

@trivialfis
Copy link
Member

I still couldn't reproduce it. It runs pretty well on my setups.

@trivialfis
Copy link
Member

trivialfis commented Dec 7, 2020

Could you please try applying this patch and see if it fixes the connection issue? It reverts some of the changes used to support k8s.

diff --git a/python-package/xgboost/dask.py b/python-package/xgboost/dask.py
index 602f7b26..5b8cadcc 100644
--- a/python-package/xgboost/dask.py
+++ b/python-package/xgboost/dask.py
@@ -67,11 +67,9 @@ distributed = LazyLoader('distributed', globals(), 'dask.distributed')
 LOGGER = logging.getLogger('[xgboost.dask]')
 
 
-def _start_tracker(n_workers):
+def _start_tracker(host, n_workers):
     """Start Rabit tracker """
     env = {'DMLC_NUM_WORKER': n_workers}
-    import socket
-    host = socket.gethostbyname(socket.gethostname())
     rabit_context = RabitTracker(hostIP=host, nslave=n_workers)
     env.update(rabit_context.slave_envs())
 
@@ -603,7 +601,9 @@ def _dmatrix_from_list_of_parts(is_quantile, **kwargs):
 
 async def _get_rabit_args(n_workers: int, client):
     '''Get rabit context arguments from data distribution in DaskDMatrix.'''
-    env = await client.run_on_scheduler(_start_tracker, n_workers)
+    host = distributed.comm.get_address_host(client.scheduler.address)
+    env = await client.run_on_scheduler(
+        _start_tracker, host.strip('/:'), n_workers)
     rabit_args = [('%s=%s' % item).encode() for item in env.items()]
     return rabit_args
 

@trivialfis
Copy link
Member

Finally reproduced.

@trivialfis
Copy link
Member

The above patch should fix it. But on the other hand it will break k8s and reverse proxy env.

@pseudotensor
Copy link
Contributor Author

Hi @trivialfis glad you were able to repro. Sorry I got a bit busy and wasn't yet able to try the patch you suggested. Iss there any need to try it anymore or is it all figured out? Thanks!

How about the illegal pointer? That only showed up in GPU case.

@trivialfis
Copy link
Member

It's fixed in xgboost, no need to try the patch now. I haven't been able to get the invalid pointer error, as it's a second error caused the currently fixed one. But right now I don't have another GPU device to form a SSH cluster. I used a laptop and a desktop to reproduce your environment after failing to do so with virtual machine. But my laptop is CPU only...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants