-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMM ReduceReplicas
will cause increased failure rate of worker data fetching
#6038
Comments
We might want to adjust the logic in distributed/distributed/worker.py Lines 3086 to 3113 in a8a9a3f
|
Yes, and we'd also need to make the scheduler proactively call I've also opened #6056 to discuss the other side of this problem. |
I just spotted this ticket 1 month late - please @ me next time!! This is an issue that ReduceReplicas currently doesn't prevent, but mitigates: distributed/distributed/active_memory_manager.py Lines 495 to 501 in baf05c0
Given the above, the AMM may cause an occasional failure in gather_deps, but you will never reach the point where you need to rely on find_missing - short of losing workers. Enabling the AMM while you compute |
Done in #6342 As explained above, I don't believe the AMM actually increases failure rate by a significant amount. |
When the scheduler assigns a task to a worker, it sends it a list of other workers to fetch the dependencies of that task from. At a later point, the worker will actually pick one of those peers at random and ask it for the data.
Before AMM, so long as the peer worker wasn't dead, it would be extremely likely (guaranteed?) that the particular key was still there.
With the AMM
ReduceReplicas
policy active, it's quite possible—likely even—that the peers which originally had the data have since deleted their copy by the time the worker gets around to callinggather_dep
. It's even possible that none of the peers on the original list have the data anymore (if the key ended up on a new worker in the interim, and ReduceReplicas happened to delete copies from all the old workers and leave the new worker with the only replica).This is handled by sending a
missing-data
message to the scheduler (which doesn't do anything to get a more up-to-date list to the worker that needs it), then trying to fetch again from a different worker from the list the scheduler originally sent.Only once every worker in the original list has been tried (and failed) will the
Worker.find_missing
callback kick in ask the scheduler who actually has the data needed.So as long as
find_missing
works correctly, things should eventually sort themselves out. But it may make the data fetch process a lot slower than it needs to be.For example, in one cluster dump I happened to look at, a worker's
who_has
list for a particular key had 82 entries. But on the scheduler,who_has
for that key had only 1 worker, which was not even in the 82 (becauseReduceReplicas
had gotten rid of the all the copies in the time since). That meansgather_dep
would have to fail 82 times before round-tripping to the scheduler, then finally getting the data.I don't think this is a correctness issue right now (again, assuming
Worker.find_missing
works reliably), but I wonder how much it degrades performance, especially for widely-used keys like this (where the originalwho_has
list can be long).cc @fjetter
The text was updated successfully, but these errors were encountered: