-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hang in dask xgboost on CPU #6604
Comments
Is there any warning message? Or could you share your full log? |
Yes, will share soon. FYI, when I say I tried 2 IPs, I mean, here were the compute nodes: ec2-52-71-252-183.compute-1.amazonaws.com — master and I tried the normal IP of 52.71.252.183 as scheduler, but hit this kind of error: https://stackoverflow.com/questions/7640619/cannot-assign-requested-address-possible-causes I noticed ifconfig says:
So I tried the nslookup-based IP. But this leads to hangs, sometimes. I'll provide logs soon. |
Here is logs/info for last attempt that uses nslookup-resolved IPs: scheduler file:
Logs for node with scheduler and worker: Logs for node with just worker connecting to scheduler: Note that I launch the scheduler/worker with unique stdout/stderr file via popen to run the CLI dask-worker or dask-scheduler. So that's why there are date-time marked stdout/stderr files. There are multiple files because I was trying to see if I can avoid the hang by restarting the workers and other things, but it never worked. The hang is not always. E.g. once I switched from the internal ip of 52.. to 10.. it seemed to work for about 20 fits/predicts, but another sequence later hung after 2-3 fit/predicts. And roughly every trial seems to hang after 2-3 fit/predicts. Once xgboost does this, it seems to hang-up all of dask because the work is still not done and is not cancelled. So then scheduler even has to be restarted, so it makes for bad experience. Not sure it matters, but here is the command line that launched it from the ps listing:
Same on both nodes apart from .pid file name. FYI I only mention CPU here because I didn't try GPU yet. I've been playing with GPU on non-ec2 setup (just local cluster) and the only thing I have problems with is the other issue of empty data when passing dask frame, and using dask_cudf frame seems to lower the occurence, but that has problem of eating GPU memory on client process (not just dask worker) and so is not good use of GPU memory. I'm happy to diagnose, the problem here is that (unlike before with rabit problems) there are no error messages. I can try again to go back to prior to that old commit we know about to see if things worked before. |
How many booster instances are you training? |
Opened an issue in dask/distributed#4485 |
Multi-lock is used in XGBoost. Please reopen if the issue is still reproducible. |
rapids 0.14
master dmlc and tried 1.3.0
dask cluster with 2 nodes with normal dask_scheduler and dask_workers
2 ec2 nodes
Hi @trivialfis , how do I diagnose this problem? I understand I'm not giving repro, but it's same kind of thing as usual we have discussed. Main question is logs from dask scheduler, dask workers, and xgboost show no problems, just hangs.
xgboost dask can work for several fits, but hangs at arbitrary times
xgboost stuck here
Dask stuck here:
I've tried playing with which IP is used, as ec2 has 2 versions, but both hit same problem
Seems kinda similar to problems with rabit that it uses wrong IP and then gets stuck, but no errors appear in this case.
Also, if I try to just re-use the dask scheduler/worker with a separate python interpreter use, that hangs too immediately, like original hang is blocking things.
FYI for my version of xgboost (1.4.0) the line mentioned above for dask.py 1322 is:
The text was updated successfully, but these errors were encountered: