-
-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when scaling a function over a wider range of parameters (LSFCluster) #139
Comments
We've seen this error in other cases, see #117 for example. Currently, we believe it is due to workers restarting for some reason, which reveal some bug in dask-jobqueue code, see #138. So probably the KeyError is not the root cause of your problem, justa consequence. It would be nice if you could check your workers logs and see if you see any restart. This is often due to out of memory problem, but apparently this is not your case. |
I searched for useful logging information, this is what I found so far:
So there were indeed restarts. Hope this helps... |
What is your configuration of LSFCluster (either in jobqueue.yaml, or you constructor parameters)? From what you say, I suspect you use a shared space for worker local-directory. |
I haven't changed In my constructor, I use:
Regarding a shared space for worker local-directory - can you elaborate on this? I'm not sure I understand the problem (or how to try and fix it). |
See https://dask-jobqueue.readthedocs.io/en/latest/configuration-setup.html#local-storage. It is highly recommended to use space local to computing nodes for local-directory kwarg: /tmp, /scratch or anything you've got on your cluster. I don't know if this is what is causing your workers to restart, though. Could you print the logs file around
Does you job scheduler kills specific processes when they grows beyond some requested resources threshold? |
Most of the these warnings were accompanied by variants of:
And yes - LSF kills jobs that reach their maximal allocated memory. But like you said - I don't think this is what happens here... ADD: General question - do |
Regarding the local space - I think I've found the corresponding environment variable in LSF - aptly named LSF_TMPDIR. :-) However, it's not clear to me how to pass this to
But got Is there a better way to configure LSF cluster to use the LSF_TMPDIR as a local directory? |
Looks like the variable Else, just try to use some local disk like |
The nodes have both |
I'm using PBSCluster which does not defined commons .err or .out files. However, if I did this, they would be created for each new job I would launched, so every time a single dask-worker job ends. This may not be a good default setting... With default PBSCluster, I've got one output file per worker job. I would recommand adding the %J meta character in the template as a PR, see https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_command_ref/bsub.e.1.html, or simply removing this from the template.
You should not see anymore some dask-worker-space locks or thing like that. But apparently this is not the root cause of the problem. Could you:
Ideally, we would need a small reproducible example here. |
Sure - here's a toy script that reproduces the problem:
NEST is a neural simulator kernel with a python interface. As far as I understand, it needs to run in its own dedicated folder, hence all the When I run this with |
Hm, hard to reproduce as it needs the installation of nest, I will see if I can easily install it or not. Do you reproduce the behavior without Nest? Could it be related to it? |
I couldn't reproduce the behaviour without nest, so it's probably related. I'll try to reproduce it locally (on a single multicore machine with dask and nest) and will post an update. |
I can confirm that the error is not NEST related - I get the same |
Could you provide a reproducible example of this problem with numpy only? |
Sure - it's a simple function that takes a list of "event times" and "event senders" and returns a binarized raster of size n_senders x n_time_bins:
When I do:
It works fine. But when I do:
I get the above mentioned errors. |
I miss your recording objects here |
And other question, did you try to use the Dask dashboard for diagnosing any eventual problem (memory, CPU, or any contention)? |
I'm attaching an example for such a file - you can replicate it a few times with different suffixes to replicate what I did above... |
Hm, still need nest with your example it seems:
I again recommend looking at the Dask dashboard to see if anything weird (memory or CPU issues ?) arises. |
@adamhaber any update on this? Otherwise I'm going to close it, it seems really specific to your environment or computation. |
@adamhaber I'm closing this one. Right now we don't have enough information to help here. Feel free to try the latest dask-jobqueue and dask versions and see if you see some improvement. |
Hi,
I'm trying to compute some function over a wide range of parameters using
LSFCluster
.The general outline is as follows:
When I try to compute with X=Y=20, everything goes smoothly.
However, when I increase the range of parameters over which I'm computing
f(x,y)
(for example = X=Y=100), I get an error message I don't understand:Just to be sure, I ran
bjobs -r
and indeed I have a job running with job id 856313. I get similar error message for many other different workers.Some more info which might be relevant:
f
to be some really simple function (f(x,y)=x+y
), the problem disappears.f
writes temporary files to a tmp directory - eachf(x,y)
creates its own temp directory - perhaps this involves too much disk operations?Any help would be much appreciated!
The text was updated successfully, but these errors were encountered: