-
-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeyError and Worker already exists #169
Comments
See #117, this error is seen when workers die and are being restarted. This is often due to out of memory errors. The message should be corrected on master branch, but the underlying issue coming from your dask process will remain. Try increasing memory per worker and also using dashboard to monitor your worker processes and see if you spot something wrong. |
Thank you! Yes, I see the memory increasing a bit too much, I'll try increasing the limit in a few hours and hope it will solve my problem. However, if a node unexpectedly encounters a problem that causes the worker to restart, shouldn't dask restart the computation as well? And the Cluster object consider the jobs as running again? |
Dask proposes a mechanism that will relaunch your failed tasks. I believe it should try to launch the tasks three times, but I'm not sure exactly in which cases. However, the memory problem in your case will probably shows up every time, and eventually your computation should fail. Glad the problem is identified here, I will close this issue, but feel free to raise something upstream in dask or distributed if you believe there is a problem in task retrial mechanism! |
Yes, you were right, the used memory grows a lot before the KeyError happens. Is there a way to signal that there is a memory error? Not a message but an exception or a special return type. A good test to identify those issues is:
In my case it shows some
and exactly 4 so I think you were right about the three restarts. What is funny is that it will only crash after the 4 attempts, while executing other tasks in the meantime. Thank you a lot for your help! |
A pleasure to help!
You should try to ask this upstream in distributed, I imagine there has been some thought in this behavior. |
Yes, I can confirm it happens as well with a LocalCluster, so nothing to do with dask-jobqueue! |
A reproducible example (perhaps using one of the sklearn make_dataset
functions) that replicated this with local cluster would be a great
contribution if you have time.
…On Sat, Oct 6, 2018 at 6:01 PM louisabraham ***@***.***> wrote:
Yes, I can confirm it happens with a LocalCluster, so nothing to do with
dask-jobqueue!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#169 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszJB44qGxDstroX2mnqZTjftlRrMOks5uiSgzgaJpZM4XLaEa>
.
|
The problem might come from A reproducible example with tpot that runs fast enough involves restricting the operators to force it to run PolynomialFeatures. I should be able produce an example that causes an error in dask_ml. Maybe I should open an issue there? However, I think the problem comes from dask.distributed as the error also happens with the joblib backend provided by dask (see EpistasisLab/tpot#779). |
I think I need some help to produce a reproducible example that runs without a cluster. I am not sure how to trigger a memory error on my laptop. I fear that it will either use the swap memory or restart the computer. I produced an error with a LocalCluster on a notebook started with LSF bsub with a soft memory limit. Maybe using |
Probably the error doesn't strictly require a memory error, but perhaps
just needs a worker to remove itself during computation?
Alternatively if you think you can identify the failure and submit a fix
without creating a minimal reproducer that would also be useful.
…On Sun, Oct 7, 2018 at 11:51 AM louisabraham ***@***.***> wrote:
I think I need some help to produce a reproducible example that runs
without a cluster.
I am not sure how to trigger a memory error on my laptop. I fear that it
will either use the swap memory or restart the computer.
I produced an error with a LocalCluster on a notebook started with LSF
bsub with a soft memory limit.
Maybe using ulimit will work?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#169 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszF0tnvG9zBBrdMhbAmTR56qTteN2ks5uiiL_gaJpZM4XLaEa>
.
|
I think that setting a dask-worker with the --memory-limit option will do the trick. ulimit doesn't work at all on macos and doesn't limit effectively the memory on linux. |
I'm trying to setup dask with tpot.
My code looks like this:
and I keep getting those annoying errors:
I think there might be a problem with LSFCluster because it puts a lot of workers in
cluster.finished_jobs
that are still running according tobjobs
and even to the dask.distributed web interface.The text was updated successfully, but these errors were encountered: