hang in dask xgboost on CPU #6604

pseudotensor · 2021-01-14T02:58:16Z

rapids 0.14
master dmlc and tried 1.3.0
dask cluster with 2 nodes with normal dask_scheduler and dask_workers
2 ec2 nodes

Hi @trivialfis , how do I diagnose this problem? I understand I'm not giving repro, but it's same kind of thing as usual we have discussed. Main question is logs from dask scheduler, dask workers, and xgboost show no problems, just hangs.

xgboost dask can work for several fits, but hangs at arbitrary times

xgboost stuck here

Current thread 0x0000148bdd769700 (most recent call first):
  File "/home/ubuntu/dai-1.9.1-linux-x86_64/python/lib/python3.6/threading.py", line 299 in wait
  File "/home/ubuntu/dai-1.9.1-linux-x86_64/python/lib/python3.6/threading.py", line 551 in wait
  File "/home/ubuntu/dai-1.9.1-linux-x86_64/cpu-only/lib/python3.6/site-packages/distributed/utils.py", line 336 in sync
  File "/home/ubuntu/dai-1.9.1-linux-x86_64/cpu-only/lib/python3.6/site-packages/distributed/client.py", line 832 in sync
  File "/home/ubuntu/dai-1.9.1-linux-x86_64/cpu-only/lib/python3.6/site-packages/xgboost/dask.py", line 1322 in fit
  File "/home/ubuntu/dai-1.9.1-linux-x86_64/cpu-only/lib/python3.6/site-packages/xgboost/core.py", line 422 in inner_f

Dask stuck here:

I've tried playing with which IP is used, as ec2 has 2 versions, but both hit same problem

Seems kinda similar to problems with rabit that it uses wrong IP and then gets stuck, but no errors appear in this case.

Also, if I try to just re-use the dask scheduler/worker with a separate python interpreter use, that hangs too immediately, like original hang is blocking things.

FYI for my version of xgboost (1.4.0) the line mentioned above for dask.py 1322 is:

        return self.client.sync(self._fit_async,
                                X=X,
                                y=y,
                                sample_weight=sample_weight,
                                base_margin=base_margin,
                                eval_set=eval_set,
                                eval_metric=eval_metric,
                                sample_weight_eval_set=sample_weight_eval_set,
                                early_stopping_rounds=early_stopping_rounds,
                                verbose=verbose,
                                feature_weights=feature_weights,
                                callbacks=callbacks)

The text was updated successfully, but these errors were encountered:

trivialfis · 2021-01-14T07:39:55Z

Is there any warning message? Or could you share your full log?

pseudotensor · 2021-01-14T11:28:00Z

Yes, will share soon.

FYI, when I say I tried 2 IPs, I mean, here were the compute nodes:

ec2-52-71-252-183.compute-1.amazonaws.com — master
ec2-3-91-224-37.compute-1.amazonaws.com — worker node

and I tried the normal IP of 52.71.252.183 as scheduler, but hit this kind of error:

https://stackoverflow.com/questions/7640619/cannot-assign-requested-address-possible-causes

I noticed ifconfig says:

ubuntu@ip-10-10-4-103:~/dai-1.9.1-linux-x86_64$ ifconfig 
ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001 
       inet 10.10.4.103 netmask 255.255.255.0 broadcast 10.10.4.255 
       inet6 fe80::cb3:fff:fe72:c349 prefixlen 64 scopeid 0x20<link> 
       ether 0e:b3:0f:72:c3:49 txqueuelen 1000 (Ethernet) 
       RX packets 88070104 bytes 112586690259 (112.5 GB) 
       RX errors 0 dropped 3989 overruns 0 frame 0 
       TX packets 72647075 bytes 88196273944 (88.1 GB) 
       TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

So I tried the nslookup-based IP. But this leads to hangs, sometimes.

I'll provide logs soon.

pseudotensor · 2021-01-14T20:14:25Z

Here is logs/info for last attempt that uses nslookup-resolved IPs:

scheduler file:

ubuntu@ip-10-10-4-103:~/dai-1.9.1-linux-x86_64/tmp$ cat dai_dask_scheduler.json
{
  "type": "Scheduler",
  "id": "Scheduler-17bf0219-73d9-4e6b-8f58-454c8f934bb8",
  "address": "tcp://10.10.4.103:8786",
  "services": {
    "dashboard": 8787
  },
  "workers": {}

Logs for node with scheduler and worker:

dask-scheduler_worker.zip

Logs for node with just worker connecting to scheduler:

dask-worker.zip

Note that I launch the scheduler/worker with unique stdout/stderr file via popen to run the CLI dask-worker or dask-scheduler. So that's why there are date-time marked stdout/stderr files. There are multiple files because I was trying to see if I can avoid the hang by restarting the workers and other things, but it never worked.

The hang is not always. E.g. once I switched from the internal ip of 52.. to 10.. it seemed to work for about 20 fits/predicts, but another sequence later hung after 2-3 fit/predicts. And roughly every trial seems to hang after 2-3 fit/predicts.

Once xgboost does this, it seems to hang-up all of dask because the work is still not done and is not cancelled. So then scheduler even has to be restarted, so it makes for bad experience.

Not sure it matters, but here is the command line that launched it from the ps listing:

/home/ubuntu/dai-1.9.1-linux-x86_64/python/bin/python3.6 /home/ubuntu/dai-1.9.1-linux-x86_64/python/bin/dask-worker tcp://10.10.4.103:8786 --pid-file ./tmp/dai_dask_worker_n_jobs_2021-01-14_02_02_11.519475.pid --nthreads 1 --nprocs 1 --protocol tcp --resources n_jobs=1 --local-directory ./tmp/dask_worker_files

Same on both nodes apart from .pid file name.

FYI I only mention CPU here because I didn't try GPU yet. I've been playing with GPU on non-ec2 setup (just local cluster) and the only thing I have problems with is the other issue of empty data when passing dask frame, and using dask_cudf frame seems to lower the occurence, but that has problem of eating GPU memory on client process (not just dask worker) and so is not good use of GPU memory.

I'm happy to diagnose, the problem here is that (unlike before with rabit problems) there are no error messages.

I can try again to go back to prior to that old commit we know about to see if things worked before.

trivialfis · 2021-02-04T20:26:39Z

How many booster instances are you training?

trivialfis · 2021-02-05T14:21:17Z

Opened an issue in dask/distributed#4485

trivialfis · 2021-05-19T06:32:32Z

Multi-lock is used in XGBoost. Please reopen if the issue is still reproducible.

elaineejiang mentioned this issue Jan 27, 2021

Dask XGBoost hangs during training with multiple GPU workers #6649

Closed

elaineejiang mentioned this issue Feb 4, 2021

Fix Dask XGBoost hanging on rabit initialization during training with multi-GPU multi-nodes #6677

Closed

trivialfis added the status: need update label Mar 30, 2021

trivialfis closed this as completed May 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hang in dask xgboost on CPU #6604

hang in dask xgboost on CPU #6604

pseudotensor commented Jan 14, 2021 •

edited

Loading

trivialfis commented Jan 14, 2021

pseudotensor commented Jan 14, 2021

pseudotensor commented Jan 14, 2021 •

edited

Loading

trivialfis commented Feb 4, 2021

trivialfis commented Feb 5, 2021

trivialfis commented May 19, 2021

hang in dask xgboost on CPU #6604

hang in dask xgboost on CPU #6604

Comments

pseudotensor commented Jan 14, 2021 • edited Loading

trivialfis commented Jan 14, 2021

pseudotensor commented Jan 14, 2021

pseudotensor commented Jan 14, 2021 • edited Loading

trivialfis commented Feb 4, 2021

trivialfis commented Feb 5, 2021

trivialfis commented May 19, 2021

pseudotensor commented Jan 14, 2021 •

edited

Loading

pseudotensor commented Jan 14, 2021 •

edited

Loading