Making AMM `ReduceReplicas` less aggressive towards widely-shared dependencies #6056

gjoseph92 · 2022-04-01T16:53:33Z

Corollary to #6038. In that issue, I described a situation where workers thought a key (which most tasks depended on) had 82 replicas, but in reality it only had 1.

This issue is about the fact that ReduceReplicas maybe shouldn't try to delete copies of that critical key so aggressively.

* * * * * *
\ \ \ / / /
    x y

In this case x and y are going to be reused by every task, so they will end up having replicas on most workers. Constantly deleting them is inefficient—as soon as you delete it, the next task that wants to run on that worker is going to have to transfer it back again.

(Of course, once most of the * tasks are done, then you should start reducing replicas. But while the cluster is fully saturated with * tasks, there's no benefit to doing this.)

I'm not sure what metric to use for this. Ideas explored in #4967, #5325, #5326 could be interesting here.

Really, this issue is just about how to calculate a smarter target for this desired_replicas count automatically based on the task's waiters, number of current workers, etc.:

distributed/distributed/active_memory_manager.py

Line 477 in 4b3e0c2

desired_replicas = 1 # TODO have a marker on TaskState

The text was updated successfully, but these errors were encountered:

crusaderky · 2022-05-04T21:23:39Z

This issue is about the fact that ReduceReplicas maybe shouldn't try to delete copies of that critical key so aggressively.
[...]
Really, this issue is just about how to calculate a smarter target for this desired_replicas count automatically based on the task's waiters, number of current workers, etc.:

This is already the case:

distributed/distributed/active_memory_manager.py

Lines 495 to 501 in baf05c0

    
           # If a dependent task has not been assigned to a worker yet, err on the side 
        
           # of caution and preserve an additional replica for it. 
        
           # However, if two dependent tasks have been already assigned to the same 
        
           # worker, don't double count them. 
        
           nwaiters = len({waiter.processing_on or waiter for waiter in ts.waiters}) 
        
           ndrop_key = len(ts.who_has) - max(desired_replicas, nwaiters)

crusaderky · 2022-06-15T09:41:52Z

@gjoseph92 is the clarification above sufficient? Can we close this issue?

gjoseph92 added enhancement Improve existing functionality or make things work better performance discussion Discussing a topic with no specific actions yet labels Apr 1, 2022

gjoseph92 mentioned this issue Apr 1, 2022

AMM ReduceReplicas will cause increased failure rate of worker data fetching #6038

Closed

crusaderky added the memory label May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making AMM `ReduceReplicas` less aggressive towards widely-shared dependencies #6056

Making AMM `ReduceReplicas` less aggressive towards widely-shared dependencies #6056

gjoseph92 commented Apr 1, 2022

crusaderky commented May 4, 2022 •

edited

Loading

crusaderky commented Jun 15, 2022

Making AMM ReduceReplicas less aggressive towards widely-shared dependencies #6056

Making AMM ReduceReplicas less aggressive towards widely-shared dependencies #6056

Comments

gjoseph92 commented Apr 1, 2022

crusaderky commented May 4, 2022 • edited Loading

crusaderky commented Jun 15, 2022

Making AMM `ReduceReplicas` less aggressive towards widely-shared dependencies #6056

Making AMM `ReduceReplicas` less aggressive towards widely-shared dependencies #6056

crusaderky commented May 4, 2022 •

edited

Loading