Making AMM ReduceReplicas
less aggressive towards widely-shared dependencies
#6056
Labels
discussion
Discussing a topic with no specific actions yet
enhancement
Improve existing functionality or make things work better
memory
performance
Corollary to #6038. In that issue, I described a situation where workers thought a key (which most tasks depended on) had 82 replicas, but in reality it only had 1.
This issue is about the fact that
ReduceReplicas
maybe shouldn't try to delete copies of that critical key so aggressively.In this case
x
andy
are going to be reused by every task, so they will end up having replicas on most workers. Constantly deleting them is inefficient—as soon as you delete it, the next task that wants to run on that worker is going to have to transfer it back again.(Of course, once most of the
*
tasks are done, then you should start reducing replicas. But while the cluster is fully saturated with*
tasks, there's no benefit to doing this.)I'm not sure what metric to use for this. Ideas explored in #4967, #5325, #5326 could be interesting here.
Really, this issue is just about how to calculate a smarter target for this
desired_replicas
count automatically based on the task'swaiters
, number of current workers, etc.:distributed/distributed/active_memory_manager.py
Line 477 in 4b3e0c2
The text was updated successfully, but these errors were encountered: