-
-
Notifications
You must be signed in to change notification settings - Fork 726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
setting distributed.scheduler.allowed-failures to 0 does not always work #6078
Comments
I wrapped the similar of the above code into a unittest. It works fine in local (although the number of reruns is not determined, but it finished). But when it runs on circleCI, it seems stuck keep retring(way more retries) and got a different exception in the end.
|
It appears that your function finishes already before the worker can be killed. The scheduler receives notice about the finished function and tries to gather the result (this is why you see the messages about "cannot gather key"). Only once it is gathering, the worker dies and the scheduler needs to reschedule everything. Since the task was already successful, the scheduler does not correlate the worker failure with the execution of the task but thinks something else must've killed it. If you change your example to @delayed
def f1():
print("running f1")
df = pd.DataFrame(dict(row_id=np.zeros(10000000)))
sleep(5)
print("done running f1")
return df you should see it terminate deterministically. The CommClosedError is a different issue. We're seeing similar failures on CI lately and are investigating |
hi @fjetter thanks for the help! I don't get the part when you say "your function finishes already before the worker can be killed". This line When any exception is thrown from a delayed function, the function "exits". It's not "finishing" successfully, why does it send a notice to scheduler saying that it has a result ready for gathering? Sending a successful notice when some error actually happened sounds like a bug to me. I see the change you made is to move the |
For the time being though, the temporary work around is to add something like |
More on the CI side, I pushed the work around, but it just keeps restarting the worker constantly. |
It does not trigger any exception for me but finishes successfully. The memory monitor then kicks in a bit later and kills the worker. The only way to ensure this raises directly is to allocate so much memory at once that the kernel kills your process immediately but from what I can see in your logs, this is clearly not the case. |
So, I think you are right @fjetter and I was right about
but still no direct exception thrown from inside the f1(). I think this is still a valid scenario to be handled more gracefully by Dask:
And let me describe the problem more generally:
if 4.1 happens before 4.2, it's the case I'm seeing. Even if 4.2 happens before 4.1, I don't how much further it could go. |
More on the circleCI front. I realized from the log that memory usage didn't rise to 95% while inside the function f1. But it got there after exiting that function which caused the restart of the worker. I increased the size 100 times (I guess when running on circle/linux vs macOS, the memory allocation for np arrays are differently) and now it's failing in time. This is good news for me, in the meantime, it shows another example of how can the memory monitoring/restart mechanism be broken. logs for the previously failed case
|
This sounds like another one that might be solved by #6177. I believe with an OS memory limit, trying to allocate the large array would just fail. |
Given that there has been no followup for a few years I think it's safe to assume #6177 fixed this. |
What happened:
Set the configuration distributed.scheduler.allowed-failures to 0 and trigger a worker restart by filling up the memory. Sometimes(yes, you need to run the sample code several times to see that if you are lucky), dask seems to ignore that config and just retry the delayed function several times.
What you expected to happen:
When the property is set to 0, there should not be retries.
Minimal Complete Verifiable Example:
Anything else we need to know?:
logs showing that f1 is executed 3 times
Environment:
Cluster Dump State:
The text was updated successfully, but these errors were encountered: