Optimize transitions #4451

jakirkham · 2021-01-22T23:20:01Z

Closes #4454
~~Requires #4452~~ (merged! 🎉)

This more thoroughly optimizes the higher-level transition and transitions functions. Does this by going through and annotating the variables used. Also avoids contains checks when it is possible to retrieve with a fallback (like with dict.get(...)). Tries to remove any unneeded copies where possible.

Also this collects all messages to send to workers and clients from transitions and waits to send them until the end of the transition where it lumps multiple messages together.

Note: Still need to move communication calls from transition to transitions.

distributed/scheduler.py

mrocklin · 2021-01-26T23:48:54Z

distributed/batched.py

-    def send(self, msg):
+    def send(self, *msgs):
        """Schedule a message for sending to the other side

        This completes quickly and synchronously
        """
        if self.comm is not None and self.comm.closed():
            raise CommClosedError

-        self.message_count += 1
-        self.buffer.append(msg)
+        self.message_count += len(msgs)
+        self.buffer.extend(msgs)


Ah, this is cleaner than I expected :)

mrocklin · 2021-01-26T23:50:03Z

If you have time I encourage you to try out viztracer to see the results. I would be happy to help walk you through first use of it if you want to get together some time tomorrow.

jakirkham · 2021-01-27T00:00:14Z

Yeah I did run with cProfile last night and looked at the call graph. Though that was before the changes to BatchedSend.

Trying to root out a remaining bug. Probably just a small typo I'm overlooking.

mrocklin · 2021-01-27T01:09:15Z

Yeah I did run with cProfile last night and looked at the call graph

I think that it's worth adding viztracer to your bag of tricks. I think that you especially might really find value from it.

distributed/scheduler.py

Neither of these statements should raise a `KeyError`. So just drop this `try...except...`.

This avoids building a `list`, which makes it easier for Cython to optimize.

This should simplify the C code generated by Cython to unpack the `tuple` as it no longer needs to check if it is a `list` or some other sequence that needs to be unpacked and can simply use the `tuple` unpacking logic.

This allows us to batch all worker and client sends into a single function.

jakirkham · 2021-01-29T17:03:41Z

Should add I'm not really seeing anything socket related (except maybe socket.close) in the first 21 items sorted by self time when running on my Mac.

mrocklin · 2021-01-29T17:09:20Z

I suspect that you've lost history of the intense part of the shuffle. viztracer only captures the last N events, so if you don't shut things down relatively quickly you miss the fun part of the computation. I'm guessing this because of the focus on transitions to the forgotten state.

Also, when selecting the region of interest, I recommend aiming low so that you don't get the events like Loop.run_forever. This should help focus in on only the interesting part.

jakirkham · 2021-01-29T17:21:19Z

Yeah I'll admit I'm not that familiar with this tool. Also was only able to view the results in Chrome (no other browser worked). It was also pretty slow to navigate through. Would encourage others to play with this if they have something in mind that they would like to see.

mrocklin · 2021-01-29T17:25:42Z

If you're interested I'd be very happy to give you a brief tour. I think that pairing briefly here would be worth the time.

jakirkham · 2021-01-29T18:59:04Z

Ok I gave this one last try. Though honestly I don't plan to spend more time with this tool (am just finding to be too slow; it was causing my computer to freeze)

Am seeing socket.send around the 20th item. send from batched is similarly low on the list. It does still show time spent in things like decide_worker and stealing, which seems to suggest it is still doing work. Transitions and serialization are higher in the list.

jakirkham · 2021-01-29T19:14:16Z

Here's the call graph I get from this change. This can be compared to the recent nightly benchmark ( quasiben/dask-scheduler-performance#98 ). This seems to cut a decent chunk of time _background_send relative to what it was before. There is a little bit of time cut from write, etc. Though it looks like serialization is the main bottleneck there.

jakirkham · 2021-01-29T20:27:57Z

Ok one last time with viztracer (still struggling with sluggish UI). Just killed the Scheduler roughly 20-30s into working. Guessing that is in the middleish somewhat near the end, which also lines up with the logging messages seen there and some of the transitions popping up here.

Seeing socket.send as the 25th item in the list ordered by self time. More time is actually spent reading it seems than writing. This may just be where I happened to terminate the job though. So not sure whether that is indicative of anything.

On the reading point since that came up here, there may be changes that can be made in Tornado that would help. In particular some code changes to rely on asyncio for sock_recv_into, which in turn could use uvloop's sock_recv_into when enabled in Distributed, may help. See comment ( #4443 (comment) ) and linked issues for more details.

Also time is being spent deserializing messages. Though this goes hand-in-hand with reading and is not something addressed in this PR (though maybe we can look at that after we merge this PR).

quasiben · 2021-01-29T21:27:12Z

@jakirkham asked me to run this PR and compare with latest in master with Py-Spy with the following code:

ddf_h = timeseries(start='2000-01-01', end='2000-02-01', partition_freq='5min')
ddf_h = ddf_h.map_partitions(lambda df: df.head(0))
ddf_h = ddf_h.persist()
print(ddf_h)
_  = wait(ddf_h)
result = shuffle(ddf_h, "id", shuffle="tasks")
ddf = client.persist(result)
_ = wait(ddf)

Total tasks: 648270

Latest (2021-01-29)

write: 7.88%
to_frames: 7.81%
extract serialize: 6.07%
transitions: 3.24%

This PR

write: 10.04%
to_frames: 9.96%
extract 7.77%
transitions: 2.17%

jakirkham · 2021-01-29T22:47:27Z

To summarize the above results, we have moved a good chunk of time out of transitions and into tcp's write. Most of the time in write is actually just spent serializing messages, which we already knew. So that seems like the next thing to work on once this is in.

jakirkham · 2021-02-01T17:46:56Z

Planning on merging tomorrow if no comments

gaogaotiantian · 2021-03-06T07:52:11Z

@jakirkham happened to saw this thread about sluggish viztracer :)

I'm aware that with default size of circular buffer, if you fill all the buffer, you will experience performance issue if your RAM is not large enough. One thing to solve this is to reduce the buffer size with --tracer_entries 100000 (for example). I know it would be good to always have smooth experience with the tool, but it's also important to log as many entries as possible, so I picked a larger number.

However, the latest viztracer now supports perfetto, which has a much better performance than Chrome Trace Viewer, which was developed years ago and was already deprecated.

You can use viztracer with perfetto by

# You need to output to json in order to use perfetto
viztracer -o result.json --open your_script.py

This will automatically open your browser for the report.

Or you can log to a json file and open it with vizviewer

viztracer -o result.json your_script.py
vizviewer result.json

I'm sorry that viztracer did not perform as you expected, but if you wanted to give it another try, let me know how the latest UI works :)

jakirkham marked this pull request as draft January 22, 2021 23:20

jakirkham commented Jan 23, 2021

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

jakirkham mentioned this pull request Jan 23, 2021

Refactor task_groups & task_prefixes #4452

Merged

jakirkham force-pushed the opt_transition branch 7 times, most recently from 89a84e5 to 7d2a080 Compare January 24, 2021 02:06

jakirkham mentioned this pull request Jan 25, 2021

Handling custom serialization with MsgPack directly #4379

Closed

jakirkham force-pushed the opt_transition branch 8 times, most recently from 389bd83 to 7c02a74 Compare January 26, 2021 07:00

jakirkham commented Jan 26, 2021

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

jakirkham mentioned this pull request Jan 26, 2021

Batch transition messages to workers/clients #4454

Closed

jakirkham force-pushed the opt_transition branch 3 times, most recently from ed70ecf to be4e152 Compare January 26, 2021 19:50

mrocklin reviewed Jan 26, 2021

View reviewed changes

jakirkham force-pushed the opt_transition branch from be4e152 to 152c1fa Compare January 27, 2021 00:41

fjetter reviewed Jan 27, 2021

View reviewed changes

distributed/scheduler.py Show resolved Hide resolved

distributed/scheduler.py Outdated Show resolved Hide resolved

jakirkham added 12 commits January 28, 2021 16:52

Just update recommendations with a & b

b26e527

Drop unneeded KeyError handling

b8e09fe

Neither of these statements should raise a `KeyError`. So just drop this `try...except...`.

Annotate finish2

43a09bf

Replace generator with simple for-loop

f47b9b6

This avoids building a `list`, which makes it easier for Cython to optimize.

Bind tuple results to typed variable

a8282dd

This should simplify the C code generated by Cython to unpack the `tuple` as it no longer needs to check if it is a `list` or some other sequence that needs to be unpacked and can simply use the `tuple` unpacking logic.

Collect list of messages for clients and workers

3d7ad9c

Extend BatchedSend's send to take many msgs

6b62ef1

Add send_all method and use in transition

6f46e72

This allows us to batch all worker and client sends into a single function.

Deliver all messages to batched send

e2c849f

Refactor out private _transition function

8a39818

Send all messages after processing all transitions

6da1d47

declare ALL_TASK_STATES a set

cdcfcf2

jakirkham force-pushed the opt_transition branch from 53d5363 to cdcfcf2 Compare January 29, 2021 00:53

jakirkham requested a review from mrocklin January 29, 2021 16:56

jakirkham merged commit 98570fb into dask:master Feb 2, 2021

jakirkham deleted the opt_transition branch February 2, 2021 15:57

jakirkham mentioned this pull request Feb 3, 2021

DGX Nightly Benchmark run 20210203 quasiben/dask-scheduler-performance#100

Open

jakirkham mentioned this pull request Feb 20, 2021

Refactor SchedulerState from Scheduler #4365

Merged

jakirkham mentioned this pull request Jul 23, 2021

Ensure worker reconnect registers existing tasks properly #5103

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize transitions #4451

Optimize transitions #4451

jakirkham commented Jan 22, 2021 •

edited

Loading

mrocklin Jan 26, 2021

mrocklin commented Jan 26, 2021

jakirkham commented Jan 27, 2021

mrocklin commented Jan 27, 2021

jakirkham commented Jan 29, 2021

mrocklin commented Jan 29, 2021

jakirkham commented Jan 29, 2021

mrocklin commented Jan 29, 2021

jakirkham commented Jan 29, 2021

jakirkham commented Jan 29, 2021

jakirkham commented Jan 29, 2021

quasiben commented Jan 29, 2021 •

edited by jakirkham

Loading

jakirkham commented Jan 29, 2021 •

edited

Loading

jakirkham commented Feb 1, 2021

gaogaotiantian commented Mar 6, 2021

Optimize transitions #4451

Optimize transitions #4451

Conversation

jakirkham commented Jan 22, 2021 • edited Loading

mrocklin Jan 26, 2021

Choose a reason for hiding this comment

mrocklin commented Jan 26, 2021

jakirkham commented Jan 27, 2021

mrocklin commented Jan 27, 2021

jakirkham commented Jan 29, 2021

mrocklin commented Jan 29, 2021

jakirkham commented Jan 29, 2021

mrocklin commented Jan 29, 2021

jakirkham commented Jan 29, 2021

jakirkham commented Jan 29, 2021

jakirkham commented Jan 29, 2021

quasiben commented Jan 29, 2021 • edited by jakirkham Loading

Latest (2021-01-29)

This PR

jakirkham commented Jan 29, 2021 • edited Loading

jakirkham commented Feb 1, 2021

gaogaotiantian commented Mar 6, 2021

jakirkham commented Jan 22, 2021 •

edited

Loading

quasiben commented Jan 29, 2021 •

edited by jakirkham

Loading

jakirkham commented Jan 29, 2021 •

edited

Loading