25% performance regression in merges #7052

wence- · 2022-09-21T11:53:44Z

Our weekly multi-node benchmarking (working on making this publicly visible) shows a performance regression in simple dataframe merges, which I can pinpoint to #6975. (This was briefly reverted in #6994 and then reintroduced in #7007).

More specifically, #6975 changes the decision making in _select_keys_for_gather:

distributed/distributed/worker_state_machine.py

Lines 1654 to 1665 in 2b23840

    
           if ( 
        
               # When there is no other traffic, the top-priority task is fetched 
        
               # regardless of its size to ensure progress 
        
               self.transfer_incoming_bytes 
        
               or to_gather 
        
           ) and total_nbytes + ts.get_nbytes() > bytes_left_to_fetch: 
        
               break 
        
           for worker in ts.who_has: 
        
               # This also effectively pops from available 
        
               self.data_needed[worker].remove(ts) 
        
           to_gather.append(ts) 
        
           total_nbytes += ts.get_nbytes()

Prior to this change the logic was

distributed/distributed/worker_state_machine.py

Lines 1620 to 1630 in b133009

    
           # The top-priority task is fetched regardless of its size 
        
           if ( 
        
               to_gather 
        
               and total_nbytes + ts.get_nbytes() > self.transfer_message_target_bytes 
        
           ): 
        
               break 
        
           for worker in ts.who_has: 
        
               # This also effectively pops from available 
        
               self.data_needed[worker].remove(ts) 
        
           to_gather.append(ts) 
        
           total_nbytes += ts.get_nbytes()

Note the difference in whether we fetch the top priority task. If I remove the part of the decision making logic that looks at self.incoming_transfer_bytes:

 if ( 
     to_gather 
     and total_nbytes + ts.get_nbytes() > bytes_left_to_fetch
 ):

Then performance goes back to where it was previously.

Not sure the correct way to square this circle. I don't understand the how the change in _select_keys_for_gather interacts with the intention of the PR to throttle data transfer.

cc @hendrikmakait (as author of #6975)

The text was updated successfully, but these errors were encountered:

jrbourbeau · 2022-09-21T16:07:17Z

Thanks for reporting and even identifying the relevant code change @wence-! @hendrikmakait do you have some time to look into this?

hendrikmakait · 2022-09-26T11:58:33Z

Hi @wence, thanks for bringing this to my attention! #6975 was by design going to hurt some workloads, we're essentially trading fewer out-of-memory scenarios (which might have fatal consequences on a worker) for increased runtime of some workloads. I had not noticed any performance hits for the integration tests executed within coiled/coiled-runtime which indicated that we had hit a sweet spot for the threshold set via distributed.worker.memory.transfer that would keep the oom-killer away while rarely impacting runtime.

Do you have an example workload and cluster configuration (e.g. cluster size, available RAM, # of workers) that I could try and replicate?

If you have time to investigate further, would you mind exploring the effect of increasing distributed.worker.memory.transfer on the runtime of the impacted workloads?

It might be that we have to revisit the default setting or this imperfect approach to limiting memory load caused by data transfer altogether.

wence- · 2022-09-26T14:10:52Z

If you have time to investigate further, would you mind exploring the effect of increasing distributed.worker.memory.transfer on the runtime of the impacted workloads?

Setting export DASK_DISTRIBUTED__WORKER__MEMORY__TRANSFER=1 (which I think is the maximum value) doesn't improve things. In fact, it appears that setting this value doesn't really have an effect at all for this workload (I get effectively the same throughput with export DASK_DISTRIBUTED__WORKER__MEMORY__TRANSFER=0.00000001).

Inspecting the values of self.transfer_incoming_bytes_limit, self.transfer_incoming_bytes, and self.transfer_message_target_bytes, it appears that the limit on bytes_left_to_fetch is always coming from self.transfer_message_target_bytes (which is hard-coded at 50MB).

These benchmarks are running on a high-performance network (depending on the worker pairings between 12 and 45 GiB/s uni-directional bandwidth), so the default to limit grabbing multiple "small" messages from a single worker at 50MB total is getting in the way (I can send multiple GiBs of data in less than a second).

I think what is happening is that previously there might have been two messages in flight between any given pair of workers at any one time, whereas now the changed logic means we limit to a single message.

So I think that #6975 fixed the logic in terms of limiting wrt transfer_message_target_bytes, but this turns out to be bad in some settings. One way to fix this is add configuration for transfer_message_target_bytes, I suppose.

hendrikmakait · 2022-09-26T14:50:52Z

Inspecting the values of self.transfer_incoming_bytes_limit, self.transfer_incoming_bytes, and self.transfer_message_target_bytes, it appears that the limit on bytes_left_to_fetch is always coming from self.transfer_message_target_bytes (which is hard-coded at 50MB).

I think there might be a problem with the related logic, let me take a closer look at the implemention.

hendrikmakait · 2022-09-26T14:57:43Z

One way to fix this is add configuration for transfer_message_target_bytes, I suppose.

This feels like a good idea regardless of the problem at hand, I'll put together a PR.

hendrikmakait · 2022-09-28T14:38:36Z

Fixed by #7071

hayesgb assigned hendrikmakait Sep 21, 2022

jrbourbeau added performance regression labels Sep 21, 2022

This was referenced Sep 26, 2022

Fix transfer limiting in _select_keys_for_gather #7071

Merged

Expose message-bytes-limit in config #7074

Merged

hendrikmakait closed this as completed Sep 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

25% performance regression in merges #7052

25% performance regression in merges #7052

wence- commented Sep 21, 2022 •

edited

Loading

jrbourbeau commented Sep 21, 2022

hendrikmakait commented Sep 26, 2022

wence- commented Sep 26, 2022 •

edited

Loading

hendrikmakait commented Sep 26, 2022 •

edited

Loading

hendrikmakait commented Sep 26, 2022

hendrikmakait commented Sep 28, 2022

25% performance regression in merges #7052

25% performance regression in merges #7052

Comments

wence- commented Sep 21, 2022 • edited Loading

jrbourbeau commented Sep 21, 2022

hendrikmakait commented Sep 26, 2022

wence- commented Sep 26, 2022 • edited Loading

hendrikmakait commented Sep 26, 2022 • edited Loading

hendrikmakait commented Sep 26, 2022

hendrikmakait commented Sep 28, 2022

wence- commented Sep 21, 2022 •

edited

Loading

wence- commented Sep 26, 2022 •

edited

Loading

hendrikmakait commented Sep 26, 2022 •

edited

Loading