Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hang in dask xgboost on CPU #6604

Closed
pseudotensor opened this issue Jan 14, 2021 · 6 comments
Closed

hang in dask xgboost on CPU #6604

pseudotensor opened this issue Jan 14, 2021 · 6 comments

Comments

@pseudotensor
Copy link
Contributor

pseudotensor commented Jan 14, 2021

rapids 0.14
master dmlc and tried 1.3.0
dask cluster with 2 nodes with normal dask_scheduler and dask_workers
2 ec2 nodes

Hi @trivialfis , how do I diagnose this problem? I understand I'm not giving repro, but it's same kind of thing as usual we have discussed. Main question is logs from dask scheduler, dask workers, and xgboost show no problems, just hangs.

xgboost dask can work for several fits, but hangs at arbitrary times

xgboost stuck here

Current thread 0x0000148bdd769700 (most recent call first):
  File "/home/ubuntu/dai-1.9.1-linux-x86_64/python/lib/python3.6/threading.py", line 299 in wait
  File "/home/ubuntu/dai-1.9.1-linux-x86_64/python/lib/python3.6/threading.py", line 551 in wait
  File "/home/ubuntu/dai-1.9.1-linux-x86_64/cpu-only/lib/python3.6/site-packages/distributed/utils.py", line 336 in sync
  File "/home/ubuntu/dai-1.9.1-linux-x86_64/cpu-only/lib/python3.6/site-packages/distributed/client.py", line 832 in sync
  File "/home/ubuntu/dai-1.9.1-linux-x86_64/cpu-only/lib/python3.6/site-packages/xgboost/dask.py", line 1322 in fit
  File "/home/ubuntu/dai-1.9.1-linux-x86_64/cpu-only/lib/python3.6/site-packages/xgboost/core.py", line 422 in inner_f

Dask stuck here:
image

I've tried playing with which IP is used, as ec2 has 2 versions, but both hit same problem

Seems kinda similar to problems with rabit that it uses wrong IP and then gets stuck, but no errors appear in this case.

Also, if I try to just re-use the dask scheduler/worker with a separate python interpreter use, that hangs too immediately, like original hang is blocking things.

FYI for my version of xgboost (1.4.0) the line mentioned above for dask.py 1322 is:

        return self.client.sync(self._fit_async,
                                X=X,
                                y=y,
                                sample_weight=sample_weight,
                                base_margin=base_margin,
                                eval_set=eval_set,
                                eval_metric=eval_metric,
                                sample_weight_eval_set=sample_weight_eval_set,
                                early_stopping_rounds=early_stopping_rounds,
                                verbose=verbose,
                                feature_weights=feature_weights,
                                callbacks=callbacks)
@trivialfis
Copy link
Member

Is there any warning message? Or could you share your full log?

@pseudotensor
Copy link
Contributor Author

Yes, will share soon.

FYI, when I say I tried 2 IPs, I mean, here were the compute nodes:

ec2-52-71-252-183.compute-1.amazonaws.com — master
ec2-3-91-224-37.compute-1.amazonaws.com — worker node

and I tried the normal IP of 52.71.252.183 as scheduler, but hit this kind of error:

https://stackoverflow.com/questions/7640619/cannot-assign-requested-address-possible-causes

I noticed ifconfig says:

ubuntu@ip-10-10-4-103:~/dai-1.9.1-linux-x86_64$ ifconfig 
ens3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9001 
       inet 10.10.4.103 netmask 255.255.255.0 broadcast 10.10.4.255 
       inet6 fe80::cb3:fff:fe72:c349 prefixlen 64 scopeid 0x20<link> 
       ether 0e:b3:0f:72:c3:49 txqueuelen 1000 (Ethernet) 
       RX packets 88070104 bytes 112586690259 (112.5 GB) 
       RX errors 0 dropped 3989 overruns 0 frame 0 
       TX packets 72647075 bytes 88196273944 (88.1 GB) 
       TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

So I tried the nslookup-based IP. But this leads to hangs, sometimes.

I'll provide logs soon.

@pseudotensor
Copy link
Contributor Author

pseudotensor commented Jan 14, 2021

Here is logs/info for last attempt that uses nslookup-resolved IPs:

scheduler file:

ubuntu@ip-10-10-4-103:~/dai-1.9.1-linux-x86_64/tmp$ cat dai_dask_scheduler.json
{
  "type": "Scheduler",
  "id": "Scheduler-17bf0219-73d9-4e6b-8f58-454c8f934bb8",
  "address": "tcp://10.10.4.103:8786",
  "services": {
    "dashboard": 8787
  },
  "workers": {}

Logs for node with scheduler and worker:

dask-scheduler_worker.zip

Logs for node with just worker connecting to scheduler:

dask-worker.zip

Note that I launch the scheduler/worker with unique stdout/stderr file via popen to run the CLI dask-worker or dask-scheduler. So that's why there are date-time marked stdout/stderr files. There are multiple files because I was trying to see if I can avoid the hang by restarting the workers and other things, but it never worked.

The hang is not always. E.g. once I switched from the internal ip of 52.. to 10.. it seemed to work for about 20 fits/predicts, but another sequence later hung after 2-3 fit/predicts. And roughly every trial seems to hang after 2-3 fit/predicts.

Once xgboost does this, it seems to hang-up all of dask because the work is still not done and is not cancelled. So then scheduler even has to be restarted, so it makes for bad experience.

Not sure it matters, but here is the command line that launched it from the ps listing:

/home/ubuntu/dai-1.9.1-linux-x86_64/python/bin/python3.6 /home/ubuntu/dai-1.9.1-linux-x86_64/python/bin/dask-worker tcp://10.10.4.103:8786 --pid-file ./tmp/dai_dask_worker_n_jobs_2021-01-14_02_02_11.519475.pid --nthreads 1 --nprocs 1 --protocol tcp --resources n_jobs=1 --local-directory ./tmp/dask_worker_files

Same on both nodes apart from .pid file name.

FYI I only mention CPU here because I didn't try GPU yet. I've been playing with GPU on non-ec2 setup (just local cluster) and the only thing I have problems with is the other issue of empty data when passing dask frame, and using dask_cudf frame seems to lower the occurence, but that has problem of eating GPU memory on client process (not just dask worker) and so is not good use of GPU memory.

I'm happy to diagnose, the problem here is that (unlike before with rabit problems) there are no error messages.

I can try again to go back to prior to that old commit we know about to see if things worked before.

@trivialfis
Copy link
Member

How many booster instances are you training?

@trivialfis
Copy link
Member

Opened an issue in dask/distributed#4485

@trivialfis
Copy link
Member

Multi-lock is used in XGBoost. Please reopen if the issue is still reproducible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants