-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster Manager Task Throttling #479
Comments
Breaking changes in multiple PR:
|
Hi @dhwanilpatel, are you actively working on this? could please provide some updates? |
There was an attempt in #553 to implement this that hasn't been finished. Please feel free to pick it up where it was left! |
Hello, I am going to pick this up again to take this changes to completion. Major feedback on last PR (#554) was to break the changes into multiple PRs for ease of review. Below is plan on how I will be breaking changes into multiple PR.
Below are the list of item for future followup checks,
@dblock can you please help in creating the feature branch for this issue, against which we can raise multiple PRs. |
Sorry for the late reply - I think @CEHENKLE has a process for feature branches. You don't need to wait on me, raise a PR and we can redirect it to a feature branch when it's ready, too. |
@dhwanilpatel Hey, how's it going? Can we we help at all? |
@CEHENKLE so far it is going as per plan, Data Node and Master Node side changes are in review state. After those PR, upcoming PRs should be straightforward. Thanks to the reviewers for providing their valuable feedbacks. |
@dhwanilpatel Cool beans. LMK if we can help :) /C |
@dhwanilpatel need a task in here for integ tests as well. |
@dhwanilpatel I wanted to confirm if this is on track for the 2.3 release. If so, pls add the v2.3.0 label to this issue. Thanks! |
Created followup issue for exposing throttling exception to user : #4724 |
@dhwanilpatel given we are calling the feature "Cluster Manager Task Throttling" can we rename this issue "Cluster Manager Task Throttling"? |
…t#479) Signed-off-by: Vacha Shah <vachshah@amazon.com>
Is your feature request related to a problem? Please describe.
For many cluster activities, data nodes submits tasks to master node. Like for put-mapping, create-index, shard started, etc. Sometimes due to some bug or issue Data nodes floods the master node with too many tasks, as a result we can see the spikes in pending task in master queue. This can affect master's performance, which can effect availability of whole cluster.
We should increase master's resiliency against such high pending task.
Describe the solution you'd like
We can make master more resilient by adding throttling of tasks on master node. Master will reject task submitted from data node based on throttling limits. This throttling should work on task type basis, so throttling of one task wont affect different task's submission.
Once master rejects such task based on throttling logic, data node will perform retries exponential back off to submit this tasks to master node.
We should make dynamic setting for enabling and disabling throttling on master and we should also be able to provide throttling configuration for task types in dynamic setting.
This framework will help if there are some bugs/issue in cluster, we can enable throttling for making master resilient against high tasks and disable it when underlying bug/issue gets resolved.
Describe alternatives you've considered
De-duplication of tasks: We have de-duplication framework as well which prevents submitting duplicate tasks to master node, but it wont help for all the cases. Data nodes can submit different tasks and flood master or master gets flooded from customer driven activities as well where tasks wont be duplicate. We want to make master resilient against high pending tasks, so de duplication wont help achieving it.
Additional context
Master performs the batching of tasks, so it iterate over all the task queued in master queue to see whether they can be batched or not, also such tasks will be remain in queue until they are not executed hence it will consume memory as well(memory according to particular task types).
So such high pending tasks on master queue can affect CPU/JVM of master node and can affect the availability of whole cluster.
The text was updated successfully, but these errors were encountered: