-
-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AssertionError (assert count > 0) in SLURMCluster._adapt #222
Comments
So looking at https://github.com/dask/distributed/blob/master/distributed/scheduler.py#L2644, it looks like it comes from
@mrocklin can you confirm/infirm this? The solutions I see: remove every reference to data you don't need, or just use |
@guillaumeeb 's interpretation is fundamentally correct, with a small clarification
There are no longer any workers that still have this data. It has been lost somehow. There may still be workers or those workers may have died from some other cause than adapt. This seems like a bug to me. I don't have thoughts on how it might arise. It would be good to resolve. |
I think I got this error in a session where |
But might code is eager at retrieving the results and I think the cluster
should not destroy the last worker as long as there is still living data on
it.
I agree that it shouldn't be doing this. The adaptive code takes into
account the presence of data when determining if it can kill a worker.
…On Thu, Jan 17, 2019 at 1:42 AM Olivier Grisel ***@***.***> wrote:
I think I got this error in a session where cluster.adapt(minimum=0,
maximum=10). But might code is eager at retrieving the results and I
think the cluster should not destroy the last worker as long as there is
still living data on it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#222 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszGYezQ6DvMM4Q-2vIkRFQ-uHPLCvks5vEEVvgaJpZM4aDCLM>
.
|
Hm, so this would be an adaptive bug. We would need to find a small example which reproduces this, ideally with LocalCluster if possible. But it might come from how |
@ogrisel do you still encounter this bug? |
I can confirm this issue is still present (when used alongside distributed version 1.28.1), but I haven't been able to deterministically reproduce it unfortunately. |
Can anyone give feedback on this one after the big rewrite #306 that just occured? |
@guillaumeeb I've been using the new #306 release quite a bit and haven't seen this once so far, so it seems like this has possibly been fixed. |
I am no longer using dask on SLURM clusters so feel free to close if nobody else has seen this issue in a while. |
Thanks @SimonBoothroyd and @ogrisel for the updates. Let's close this then. |
Hey guys, I'm seeing this exact same error using SLURMCluster. I've tried both Conda environment
Here is the sequence of code I'm running followed by the error that is basically the same as @ogrisel error.
Error
|
I know it is a rather big ask, but without a stand-alone snippet reproducing the problem it is not going to be easy to debug this problem ... |
@lesteve Sorry for the likely naive question, but can you explain what you mean by a "stand-alone snippet"? |
No worries! It means some code that reproduce the problem and that anyone can run on its computer. You can read https://stackoverflow.com/help/mcve for more details. |
Thanks! I'll have to think about how to make this feasible. The biggest hurdle in my specific case is that the data I'm using are proprietary, so I won't be able to hand that over. |
I am also having this issue with an LSF cluster. The data has been persisted to the cluster in my case, but it seems that the workers get closed before they can transfer the operands, leading to recomputing of some pretty expensive operations. If I prevent the workers from being closed via adapt parameters the issue is not seen. I believe I can reproduce this case with generic data, but I don't believe it would be possible to do with a local cluster. Is an MRE which relies on a jobqueue cluster useful? |
That would be very useful indeed! This is likely a |
Alright, here's what I wound up with. Clearly you'll need to fill out I think the results may be dependent on network and worker performance, that can probably be minimized by configuring adapt to be more aggressive about closing workers. Perhaps: import dask.dataframe as dd
import dask_ml.datasets
from dask.distributed import Client
import dask
def start_cluster():
# Replace these two lines with your HPC cluster startup.
# I used 20*2**30 for worker memory.
from dask_jobqueue import LSFCluster
cluster = LSFCluster()
cluster.adapt(minimum=0, maximum=4)
client = Client(cluster)
return client
client = start_cluster()
n_features = 250000
X = dask_ml.datasets.make_blobs(n_samples=10000, n_features=n_features, chunks=(250, n_features))
dfx = dd.from_dask_array(X[0])
dfx.npartitions
dfx = client.persist(dfx)
dfx['id'] = 1
dfx['id'] = dfx['id'].cumsum()
dfx.set_index('id', sorted=True)
dfx.rename(columns=str).to_parquet('mre.parq') |
Thanks for the snippet! For future references, it would be very nice if you could edit your message and add the following info:
|
Oh geeze, sorry, that would be helpful wouldn't it. Stack trace and versions are below, although the details tag seems to have negatively impacted the formatting :( This example takes less than 10 minutes to run with four machines on my cluster, and it has reproduced the error every time (n=3) Please do let me know if I can provide anything else that would be helpful! Stack Trace
client.get_versions()
dask_jobqueue.version is 0.7.0 |
Great thanks, I have edited your message (formatting inside |
I am going to reopen this, I am not sure I will have too much time in the short term, but I'll try to reproduce. |
I am seeing this same error (as pbrady32). I am using adapt scaling with an SGEcluster. I had running without any issues - I can't be sure, but the only new aspect added is writing to parquet. Perhaps something related to both adaptive scaling and parquet? |
Here are more details from my situation. It seems to be generic to jobqueue or distributed? (since I am using SGEcluster)
Let me know if I can offer more details to help. My versions
Error stacktrace
|
This is very likely a Just in case, can you try to use the latest |
I also hit this error in some cases when using the ConductorCluster from #472 in adaptive mode. Similar to previous commenters, the error does not occur in all cases and happens sporadically. Re-running the cell typically is successful. Environment:
|
Thanks @jennalc for the update. Are you also using Parquet, or is it unrelated to that? |
@guillaumeeb - No, I was not using parquet, but was working with a large CSV. |
Some operations, e.g. |
OK, thanks both. Closing this again as it is quite probably linked to a wrong behaviour of some functions when using adaptive clusters. Not related to Dask-jobqueue specificaly. But I'd say that this is probably very visible with it, as adaptive workers can go in and out very fast in a cluster with some free resources! |
I ran into this also with my cluster implementation (mostly HTCondorCluster but using directly the htcondor bindings and implementing async) and am wondering if the root cause of this may also explain why I saw some cases where the cluster lost a future and had to recompute rather large chunks of the task graph. Is it possible the adaptive deployment is somehow force closing the workers before the futures have a chance to migrate? I've been struggling to understand exactly when and how the async |
Adaptive cluster is a tricky set-up. There might still be some edge cases where it fails closing properly things. But I think you'd better ask this question on distributed issue tracker, but non before having some code that reproduce things a bit ;). I know, this is sometimes almost impossible... |
When I run a slurm cluster with
adapt()
I sometimes get the following crash (but this is not deterministic and I have not identified a way to trigger it more often).The text was updated successfully, but these errors were encountered: