-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rebalance during a moving cluster #4906
Comments
In a side-conversation @crusaderky said the following: I'm working on running rebalance() while a computation is in progress.
The fundamental idea behind it is that it would be prohibitive to think about and deterministically preempt all the use cases that could trigger a race condition. Such an approach would lead to a very poor cost/benefit ratio. If however we can gracefully deal with an occasional key that is moved where it shouldn't when it shouldn't, and just make the computation bring it back to its rightful place, things will be a lot easier and robust overall. As for 1 (coarse preemption) I am inclined to simply blacklist from rebalancing all keys that are an input of a queued or running task. It is a very greedy strategy and I appreciate that people will come up with use cases where a key should be moved even if it's part of a computation, but I really want to KISS (Keep It Simple & Stupid) in the initial delivery, push it through, let people play with it, and then analyse in real life scenarios where it falls short and potentially add iterative refinements over time. Also, such a conservative approach would guarantee that we have can quickly deliver something that is better than nothing, as opposed to a regression which actively hampers computations. |
In general I'm supportive of simple 90% solutions :) However, I also want to propose an alternate way of looking at things. We're never going to find safe data to move around. For example, even if a task isn't currently processing, it might end up processing in the next cycle (because a worker just went down for example). Trying to segment data into safe and not safe may not be a possible way of looking at the world. Instead, what we did for work stealing (which may or may not be a good example) is that we developed a protocol/handshake to move things around in a safe way. An example of such a handshake is ... Example conversation
Example conversation with break
I think that if every transfer we do passes through a safe cycle like this, and if we never actually need that transfer to occur (because for example, we call rebalance every second or so) then we're good. I think that generally it's useful to solve resiliency problems in the small where we think about how to do a small step safely rather than trying to think about resiliency in the large where we try to keep a global view of things. |
From a discussion we had yesterday a few notes I still have in mind. @mrocklin @crusaderky feel free to add on in case I missed something.
|
Came upon this issue when I was seeing some behavior that I think(?) might be related. I was using the secede/rejoin workflow to submit tasks from tasks. In some cases, a worker was not doing any work after seceding and I realized it was because there were no other tasks in its list to process. These were long-running tasks, and there were only a few left. For example, my workers looked something like this: Worker 1: 1 task processing It was worker 1 that had seceded when running its 1 task, but it did not have any more work to do so it was sitting idle. I think the ideal situation would be to pull in the not-yet-started tasks from worker2 or worker, but that would require some rebalancing on-the-fly. I'm not sure if that is related to this discussion - it could also just be that dask determined that the movement of data required to shift tasks to Worker 1 was not worth it or something like that (I'm not super familiar with the inner scheduler logic). If it's unrelated to this discussion feel free to ignore. But I just wanted to point out a potentially relevant use case. |
@bolliger32 what you are describing sounds like a work-stealing issue. Work stealing it the process of not yet executing tasks to other workers if other workers have more capacity for computation. In this specific issue here, we're discussing already computed task results and the resulting data replicas. |
ahh thanks for the clarification @fjetter ... carry on :) I'll do some more investigation to see why the work was not stolen in this situation when (I'm pretty sure) it would help |
Status update: The Active Memory Manager core machinery is live and is currently being used for worker retirement and (optionally) for replica reduction. The effort required is relatively mild - a lot of cut and paste and rewriting some tests. |
Evidence of potential benefits of AMM RebalanceI looked at the coiled-runtime benchmarks for evidence of severe memory imbalances between workers, which rebalancing the cluster every 2 seconds would potentially smooth out. Below is an analysis of use cases in coiled-runtime CI that would either benefit straight away (as they are currently spilling on 1-2 workers), would benefit if one increased the size of the dataset, or would not benefit at all. All the use cases that would benefit show a large gap between worst worker (or 4th quantile) and mean. AMM Rebalance would take memory away from the worst-case workers (max) and move it to the best-case ones (min). I've omitted the test cases that generate trivial amounts of data. All tests were run on 10 workers with 2 threads per worker, 8 GiB of RAM each, and which start spilling at 4.8 GiB worth of managed memory. benchmarks/test_array.py::test_anom_meanData is already balanced; AMM Rebalance would do nothing. benchmarks/test_array.py::test_basic_sumLooks like test_anon_mean. No benefit. benchmarks/test_array.py::test_dot_productMild imbalance and no spilling. AMM Rebalance would not improve the runtime of the test; the same test on a larger dataset could receive a minor runtime improvement. benchmarks/test_array.py::test_double_diffSubstantial imbalance and mild spilling. AMM Rebalance would most likely cause improvements for larger datasets. benchmarks/test_array.py::test_map_overlap_sampleThe test runs in 7s end to end, while grafana has a sample period of 5s. Inconclusive data. benchmarks/test_array.py::test_vorticitySubstantial imbalance and spilling in the second half of the run. The use case would most likely benefit from AMM Rebalance. benchmarks/test_csv.py::test_csv_basicNo imbalance and no spilling. benchmarks/test_custom.py::test_jobqueueNo imbalance and no spilling. benchmarks/test_dataframe.py::test_dataframe_alignMajor imbalance and spilling. To my understanding this is a shuffle - interaction with AMM Rebalance is somewhat less predictable here. benchmarks/test_dataframe.py::test_shuffleNo imbalance, no spilling, no benefit. I don't undertstand why the graph looks so much different from test_dataframe_align; I need to investigate further. benchmarks/test_join.py::test_join_big[0.1]No imbalance, no spilling, no benefit. |
@crusaderky in #4906 (comment) you mentioned the effort for this would be "relatively mild". Can you try to be a bit more specific? Hours? Days? Weeks? Months? |
5 to 8 days |
This is part of Active Memory Management
We would like to be able to intentionally move data around the cluster. Today we do this with methods like rebalance and replicate, however these do not work well when the cluster is active. We would like to improve robustness here.
The text was updated successfully, but these errors were encountered: