-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Most use cases are much slower on 4-CPU workers #551
Comments
What's the goal? Improve dask performance under various circumstances, provide better guidance good infrastructure choices, something else? |
The goal is that dask has 40% better memory performance and less network transfers on less, larger hosts, but in order to benefit from it we need to deal with GIL contention first. |
The context is that we've had an argument about performance of workers with many threads. The benchmarks above uses the same total number of CPUs and same total RAM on the cluster but distribution is different. Under the assumption that GIL contention is not an issue, the "few larger VMs/workers" should be performing better, at least w.r.t memory usage. We can see that this is true, memory usage is reduced. The goal is a combination of things. This would likely translate into better performance for everyone, a better default setting for coiled and possible future scenarios where we deploy things a bit differently (e.g. multiple workers on the same, larger VM) @crusaderky Why compare AMD vs Intel? I don't see how this related Re: Required actions Let's talk about these points first before going down the rabbit hole. |
FYI for a while we were unintentionally giving customers AMD rather than Intel (we weren't excluding AMD and we were filtering by cost). Multiple customers said "why did my workloads get slower?" |
We did not have a measure of how much slower AMD instances are. |
FWIW I think platform team would be happy to hear evidence about common workloads where it'd be helpful to have multiple worker processes per instance. That's the sort of data that would help us prioritize allowing that in the platform. (It's possible there'd also be dask help we want for implementing it.) My interpretation of Florian's comment is that we're not really sure yet, but maybe I'm reading it wrong. Anyway let's stay in touch! |
Yes, not sure yet. The one benefit this would have is that transfers between these workers is typically faster because we don't actually need to utilize the network and the kernel can sometimes even apply some optimizations. The common workload here is again heavy network. I can see this being relevant for both "very large data transfers (higher throughput)" and for "many, many small data transfers (lower latency)". I think we don't have any reliable data to know how impactful this is in which scenario. I'm wondering how difficult it would be for coiled to support this, even just for an experiment. There is also a way to achieve the same/similar from within dask. I'm interest in running this experiment but I don't know what's the cheapest approach |
FYI snakebench already supports profiling each test with py-spy in native mode: https://github.com/gjoseph92/snakebench#profiling. This makes it very easy to see how much GIL contention (or if not that, something else) is affecting us. See the dask demo day video for an example of profiling the GIL in worker threads with py-spy: https://youtu.be/_x7oaSEJDjA?t=2762 |
I mean, we could test this without coiled. Manually start a bunch of EC2 instances, manually deploy conda on all of them, manually start the cluster. It's a substantial amount of legwork though. |
I see three options
|
FWIW I have doubts about the benefit of faster intramachine network communication being significant. I would be curious to hear perspectives on this comment: https://github.com/coiled/product/issues/7#issuecomment-1328076454 I'm curious about GIL-boundedness. I might suggest looking at this problem also in the small with a tool like https://github.com/jcrist/ptime and playing with data types. My hypothesis is that we can make this problem go away somewhat by increasingly relying on Arrow data types in Pandas. I don't know though. |
Why don't we just profile it with py-spy and see how much GIL contention there is #551 (comment)? You just do |
I ran the tests under py-spy for 2-vcpu and 4-vcpu workers. Benchmark results: https://observablehq.com/@gjoseph92/snakebench?commits=3b8fe8e&commits=6c9b5e1&measure=duration&groupby=branch&show=passed It's only 1 repeat and some workers were running py-spy, so I wouldn't put too much weight on the results, but we are seeing most tests 50% slower and lots 100-200% slower by trading workers for threads. The py-spy profiles are GitHub artifacts which can be downloaded here: |
Sorry Gabe, can you catch me up on what you're comparing?
is this a cluster with many (how many?) ec2 instances, some with 4 CPUs and
process, others with 2 CPUs and threads?
How are you running the cluster given that Coiled doesn't do what Guido
wants?
…On Tue, Nov 29, 2022, 4:48 PM Gabe Joseph ***@***.***> wrote:
I ran the tests under py-spy for 2-vcpu and 4-vcpu workers.
Benchmark results:
***@***.***/snakebench?commits=3b8fe8e&commits=6c9b5e1&measure=duration&groupby=branch&show=passed
It's only 1 repeat and some workers were running py-spy, so I wouldn't put
too much weight on the results, but we are seeing most tests 50% slower and
lots 100-200% slower by trading workers for threads.
The py-spy profiles are GitHub artifacts which can be downloaded here:
- 2cpu: https://github.com/gjoseph92/snakebench/actions/runs/3576809575
- 4cpu: https://github.com/gjoseph92/snakebench/actions/runs/3576826130
—
Reply to this email directly, view it on GitHub
<#551 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJKQRXXYD7V5OZCTLC52P3WKZ2Y7ANCNFSM6AAAAAASKOZH7M>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
If you go to the benchmark results on observable and click through to the commits being compared, you can see the state of everything. But this is the only difference between the two cases gjoseph92/snakebench@6c9b5e1: switching from 10 The primary point of this was to record py-spy profiles, which snakebench makes it easy to do. |
OK. Thanks. So the lesson is... GIL contention is a common problem and you should frequently use more small workers instead of fewer big ones? And that big worker instances with more worker processes could be nice but as far as we know now Matt's right (in https://github.com/coiled/product/issues/7#issuecomment-1328076454) that it's not a big deal, and you're basically as well off with many small workers as you would be if Coiled implemented separate worker processes on an instance? (This experiment isn't really meant to address Matt's point there, right?) |
No, it's not. It's intended to try to get a little more of a sense of the "why". I'm thinking though that it might also motivate how much we want to work on https://github.com/coiled/product/issues/7. For instance, if profiling indicates "there are couple obvious things we could change in dask or upstream to alleviate this" then multiple workers per machine on Coiled becomes less important. If it indicates "this is a pervasive issue with the GIL we're unlikely to solve easily", then we just have to answer:
|
I'm not going to do extensive analysis on these (though I hope someone else does), but here are a few interesting things I see:
Some thoughts:
|
Profiles only included frames holding the GIL: |
Here is a small video measuring the parallel performance using groupby aggregations on a 64-(virtual)-core machine with increasing number of cores (using ptime) with different data types. I use groupby-aggregations because I feel that they're an ok stand-in for a "moderately-complex pandas workflow" where the GIL is concerned. https://drive.google.com/file/d/14CoYhheEKo9KXvOesE9olO_VlsRJwINM/view?usp=share_link (I'll ask @mrdanjlund77 to make it pretty and put it on youtube for future viewing) Basic results are as follows:
Now of course, as has been mentioned, these queries that we're looking at aren't really looking at groupby-aggregation performance. Many of them are almost entirely parquet performance. The experiment that I ran here wasn't really that representative of these benchmarks (but neither are the benchmarks necessarily representative of a wide range of situations). Hopefully however this experiment gives some (but not enough) evidence that we can use significantly larger machines (16 cores seems pretty safe) if we're using good data types. |
I ran the following A/B test:
https://github.com/coiled/coiled-runtime/actions/runs/3534162161
dask is 2022.11.1 everywhere.
As you can see the cluster-wide number of CPUs and memory capacity is the same everywhere.
The cost per hour of the AB_4x is exactly the same as the baseline, whereas the AMD instances are 10% cheaper.
Insights
Required actions
The text was updated successfully, but these errors were encountered: