-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_basic_sum occasionally takes 340% time and 160% memory to complete #315
Comments
very interesting. Have you tried reproducing it? Just by counting it looks like ~5% of all runs are affected (assuming there is no infrastructure issue) |
Not yet |
If stealing is not the culprit, my next guess would be reevaluate_occupancy which recomputes occupancies on a round robin basis but may be skipped if CPU load on the scheduler is too high. See also dask/distributed#6573 (comment) At this point this is merely guessing about non-deterministic parts of our scheduling logic |
Slow Fast Do we have a way to find the clusters these ran on? Or other data about these runs beyond just the wall-clock times? |
So short answer to both is no. (We have peak and avg memory use as well as wall-clock time, but in general nothing with more granularity.) |
This is mostly true, but note that we are also tracking compute, transfer, and disk-spill time, it's just not visualized at the moment. So if the compute time stayed roughly constant while the wall clock time spiked, I would suspect something went wrong with scheduling. |
FYI @hendrikmakait got his hands on a performance report of a slow run. the task stream shows wide white gaps and we see that the scheduler event loop is stuck for a while (one time up to ~46s). No GC warnings. Not sure but highly suspect this aligns with the white gaps. There are a couple of "Connection from tls://77.20.250.112:30608 closed before handshake completed" messages following this 46s tick. I suspect this is a heartbeat? Can't find any corresponding logs on any of the workers. It happened on https://cloud.coiled.io/dask-engineering/clusters/68284/details @ntabris do you know why we're seeing different IP addresses here? Should this concern us? |
77.20.250.112 is Vodafone Germany, so presumably client IP. I'll take a look at the logs later this morning and see if I can make anything of them. |
Thanks. That should already help clear things up. I don't think you'll find anything useful in the logs. I think this is our problem ;) |
If the above are client side connection attempts, this may be related to us trying to fetch performance reports, etc. If nothing failed client side, I suspect smth like |
Do we have a measure of CPU seconds of the scheduler process?
|
While root-causing #316, we have discovered a bug in |
Looks like it's not a bug in I'm looking into it and will open an issue over there. |
Not clear if this is the only thing causing the variation, but it's certainly not helping. |
From the raw db dump I think I'm reading that, in the "bad" runs, there are many many more network transfers ( It looks like co-assignment is occasionally and randomly falling apart for some reason? |
Healthy run:dump: s3://coiled-runtime-ci/test-scratch/cluster_dumps/test_array-c2d95249/benchmarks.test_array.py.test_basic_sum.msgpack.gz Bad run:dump: s3://coiled-runtime-ci/test-scratch/cluster_dumps/test_array-c6668c2c/benchmarks.test_array.py.test_basic_sum.msgpack.gz |
Good to confirm that it seems like co-assignment is going wrong.
Can we see if a worker has left at any point? This seems unlikely but would
definitely cause co-assignment to be thrown off. Or is it possible the
initial task assignment happens before all workers have arrived?
Otherwise, I’d suspect work stealing.
…On Thu, Oct 20, 2022 at 3:54 AM crusaderky ***@***.***> wrote:
Healthy run:
[image: image]
<https://user-images.githubusercontent.com/6213168/196837094-323c2360-8d75-40ad-99f9-f463a5e292d3.png>
Bad run:
[image: image]
<https://user-images.githubusercontent.com/6213168/196837143-6df4ca2f-2ca7-42c7-ab0a-4d1f67ab697a.png>
—
Reply to this email directly, view it on GitHub
<#315 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZIB2UNQM2ZXFJGT6S4B3TWECQ55ANCNFSM6AAAAAAQHSO4D4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Holy work stealing Batman! In the good run, 13% of the tasks end up stolen (it feels quite high already).
No workers left.
No, I see that the first task transition is 2 seconds after the last worker joined the cluster. |
Task durations. It's interesting to see here how the bad run, which is spilling a lot more, has much longer "flares" of outliers. Those are all moments where the event loop of the worker was busy spilling; this impacted the measured duration, which in turn might have caused improper stealing choices. |
If work stealing is under investigation, it's worth investigating the worker idleness detection. Work stealing should only affect workers that are flagged as idle. If this doens't work properly work stealing can cause weird things. This should be more reliable in latest releases but I still wouldn't be surprised to see bad things happening. In later versions, stealing is using the worker_objective to determine a good thief but this still breaks co-assignment (We'd need smth like dask/distributed#7141 to not break coassignment) |
Good to see that the recent changes to work-stealing seem to have removed the erratic behavior. Some of the issues that were fixed (including work-stealing going overboard and stealing way too much) could explain the behavior that has been observed. |
benchmarks/test_array.py::test_basic_sum
usually runs in ~80s wall clock time, ~21GiB average memory, and 27~33GiB peak memory.Once in a while, however, it takes ~270s wall clock time, 32~35GiB average memory, and ~46GiB peak memory.
Both sets of measures are internally very consistent - it's almost exactly always one or the other.
I can't imagine what could possibly happen to trigger a "bad" run.
Both the test and the algorithm being tested are extremely simple.
Time measures start when all workers are up and running and stop before shutting them down.
There should not be any spilling involved; network transfers should be very mild.
Even in the event of a CPU and/or network slowdown, there should not be an increase in memory usage.
Screenshots from coiled 0.1.0 (dask 2022.6.0), but I've observed the same behaviour on 2022.8.1 as well:
The text was updated successfully, but these errors were encountered: