Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize transitions #4451

Merged
merged 22 commits into from
Feb 2, 2021
Merged

Optimize transitions #4451

merged 22 commits into from
Feb 2, 2021

Conversation

jakirkham
Copy link
Member

@jakirkham jakirkham commented Jan 22, 2021

Closes #4454
Requires #4452 (merged! 🎉)

This more thoroughly optimizes the higher-level transition and transitions functions. Does this by going through and annotating the variables used. Also avoids contains checks when it is possible to retrieve with a fallback (like with dict.get(...)). Tries to remove any unneeded copies where possible.

Also this collects all messages to send to workers and clients from transitions and waits to send them until the end of the transition where it lumps multiple messages together.

Note: Still need to move communication calls from transition to transitions.

@jakirkham jakirkham marked this pull request as draft January 22, 2021 23:20
distributed/scheduler.py Outdated Show resolved Hide resolved
@jakirkham jakirkham force-pushed the opt_transition branch 7 times, most recently from 89a84e5 to 7d2a080 Compare January 24, 2021 02:06
@jakirkham jakirkham force-pushed the opt_transition branch 8 times, most recently from 389bd83 to 7c02a74 Compare January 26, 2021 07:00
distributed/scheduler.py Outdated Show resolved Hide resolved
@jakirkham jakirkham force-pushed the opt_transition branch 3 times, most recently from ed70ecf to be4e152 Compare January 26, 2021 19:50
Comment on lines -130 to +139
def send(self, msg):
def send(self, *msgs):
"""Schedule a message for sending to the other side

This completes quickly and synchronously
"""
if self.comm is not None and self.comm.closed():
raise CommClosedError

self.message_count += 1
self.buffer.append(msg)
self.message_count += len(msgs)
self.buffer.extend(msgs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this is cleaner than I expected :)

@mrocklin
Copy link
Member

If you have time I encourage you to try out viztracer to see the results. I would be happy to help walk you through first use of it if you want to get together some time tomorrow.

@jakirkham
Copy link
Member Author

Yeah I did run with cProfile last night and looked at the call graph. Though that was before the changes to BatchedSend.

Trying to root out a remaining bug. Probably just a small typo I'm overlooking.

@mrocklin
Copy link
Member

Yeah I did run with cProfile last night and looked at the call graph

I think that it's worth adding viztracer to your bag of tricks. I think that you especially might really find value from it.

distributed/scheduler.py Show resolved Hide resolved
distributed/scheduler.py Outdated Show resolved Hide resolved
Neither of these statements should raise a `KeyError`. So just drop this
`try...except...`.
This avoids building a `list`, which makes it easier for Cython to
optimize.
This should simplify the C code generated by Cython to unpack the
`tuple` as it no longer needs to check if it is a `list` or some other
sequence that needs to be unpacked and can simply use the `tuple`
unpacking logic.
This allows us to batch all worker and client sends into a single
function.
@jakirkham
Copy link
Member Author

Should add I'm not really seeing anything socket related (except maybe socket.close) in the first 21 items sorted by self time when running on my Mac.

Screen Shot 2021-01-29 at 8 58 04 AM

@mrocklin
Copy link
Member

I suspect that you've lost history of the intense part of the shuffle. viztracer only captures the last N events, so if you don't shut things down relatively quickly you miss the fun part of the computation. I'm guessing this because of the focus on transitions to the forgotten state.

Also, when selecting the region of interest, I recommend aiming low so that you don't get the events like Loop.run_forever. This should help focus in on only the interesting part.

@jakirkham
Copy link
Member Author

Yeah I'll admit I'm not that familiar with this tool. Also was only able to view the results in Chrome (no other browser worked). It was also pretty slow to navigate through. Would encourage others to play with this if they have something in mind that they would like to see.

@mrocklin
Copy link
Member

If you're interested I'd be very happy to give you a brief tour. I think that pairing briefly here would be worth the time.

@jakirkham
Copy link
Member Author

Ok I gave this one last try. Though honestly I don't plan to spend more time with this tool (am just finding to be too slow; it was causing my computer to freeze)

Am seeing socket.send around the 20th item. send from batched is similarly low on the list. It does still show time spent in things like decide_worker and stealing, which seems to suggest it is still doing work. Transitions and serialization are higher in the list.

Screen Shot 2021-01-29 at 10 50 44 AM

@jakirkham
Copy link
Member Author

Here's the call graph I get from this change. This can be compared to the recent nightly benchmark ( quasiben/dask-scheduler-performance#98 ). This seems to cut a decent chunk of time _background_send relative to what it was before. There is a little bit of time cut from write, etc. Though it looks like serialization is the main bottleneck there.

prof_97676 pstat dot

@jakirkham
Copy link
Member Author

Ok one last time with viztracer (still struggling with sluggish UI). Just killed the Scheduler roughly 20-30s into working. Guessing that is in the middleish somewhat near the end, which also lines up with the logging messages seen there and some of the transitions popping up here.

Seeing socket.send as the 25th item in the list ordered by self time. More time is actually spent reading it seems than writing. This may just be where I happened to terminate the job though. So not sure whether that is indicative of anything.

On the reading point since that came up here, there may be changes that can be made in Tornado that would help. In particular some code changes to rely on asyncio for sock_recv_into, which in turn could use uvloop's sock_recv_into when enabled in Distributed, may help. See comment ( #4443 (comment) ) and linked issues for more details.

Also time is being spent deserializing messages. Though this goes hand-in-hand with reading and is not something addressed in this PR (though maybe we can look at that after we merge this PR).

Screen Shot 2021-01-29 at 11 57 51 AM

@quasiben
Copy link
Member

quasiben commented Jan 29, 2021

@jakirkham asked me to run this PR and compare with latest in master with Py-Spy with the following code:

ddf_h = timeseries(start='2000-01-01', end='2000-02-01', partition_freq='5min')
ddf_h = ddf_h.map_partitions(lambda df: df.head(0))
ddf_h = ddf_h.persist()
print(ddf_h)
_  = wait(ddf_h)
result = shuffle(ddf_h, "id", shuffle="tasks")
ddf = client.persist(result)
_ = wait(ddf)

Total tasks: 648270

Latest (2021-01-29)

  • write: 7.88%
  • to_frames: 7.81%
  • extract serialize: 6.07%
  • transitions: 3.24%

Screen Shot 2021-01-29 at 4 23 24 PM

This PR

  • write: 10.04%
  • to_frames: 9.96%
  • extract 7.77%
  • transitions: 2.17%

Screen Shot 2021-01-29 at 4 21 23 PM

@jakirkham
Copy link
Member Author

jakirkham commented Jan 29, 2021

To summarize the above results, we have moved a good chunk of time out of transitions and into tcp's write. Most of the time in write is actually just spent serializing messages, which we already knew. So that seems like the next thing to work on once this is in.

@jakirkham
Copy link
Member Author

Planning on merging tomorrow if no comments

@gaogaotiantian
Copy link

@jakirkham happened to saw this thread about sluggish viztracer :)

I'm aware that with default size of circular buffer, if you fill all the buffer, you will experience performance issue if your RAM is not large enough. One thing to solve this is to reduce the buffer size with --tracer_entries 100000 (for example). I know it would be good to always have smooth experience with the tool, but it's also important to log as many entries as possible, so I picked a larger number.

However, the latest viztracer now supports perfetto, which has a much better performance than Chrome Trace Viewer, which was developed years ago and was already deprecated.

You can use viztracer with perfetto by

# You need to output to json in order to use perfetto
viztracer -o result.json --open your_script.py

This will automatically open your browser for the report.

Or you can log to a json file and open it with vizviewer

viztracer -o result.json your_script.py
vizviewer result.json

I'm sorry that viztracer did not perform as you expected, but if you wanted to give it another try, let me know how the latest UI works :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Batch transition messages to workers/clients
5 participants