Cluster Manager Task Throttling #479

dhwanilpatel · 2021-04-01T10:25:12Z

Is your feature request related to a problem? Please describe.

For many cluster activities, data nodes submits tasks to master node. Like for put-mapping, create-index, shard started, etc. Sometimes due to some bug or issue Data nodes floods the master node with too many tasks, as a result we can see the spikes in pending task in master queue. This can affect master's performance, which can effect availability of whole cluster.

We should increase master's resiliency against such high pending task.

Describe the solution you'd like

We can make master more resilient by adding throttling of tasks on master node. Master will reject task submitted from data node based on throttling limits. This throttling should work on task type basis, so throttling of one task wont affect different task's submission.
Once master rejects such task based on throttling logic, data node will perform retries exponential back off to submit this tasks to master node.
We should make dynamic setting for enabling and disabling throttling on master and we should also be able to provide throttling configuration for task types in dynamic setting.
This framework will help if there are some bugs/issue in cluster, we can enable throttling for making master resilient against high tasks and disable it when underlying bug/issue gets resolved.

Describe alternatives you've considered

De-duplication of tasks: We have de-duplication framework as well which prevents submitting duplicate tasks to master node, but it wont help for all the cases. Data nodes can submit different tasks and flood master or master gets flooded from customer driven activities as well where tasks wont be duplicate. We want to make master resilient against high pending tasks, so de duplication wont help achieving it.

Additional context

Master performs the batching of tasks, so it iterate over all the task queued in master queue to see whether they can be batched or not, also such tasks will be remain in queue until they are not executed hence it will consume memory as well(memory according to particular task types).
So such high pending tasks on master queue can affect CPU/JVM of master node and can affect the availability of whole cluster.

dhwanilpatel · 2021-04-20T13:17:10Z

Breaking changes in multiple PR:

Add Master task throttling changed in data/master nodes(Add master task throttling #553)
Add Throttling Stats on Stats API
Add Documentation of new Settings.

anasalkouz · 2021-11-08T21:55:52Z

Hi @dhwanilpatel, are you actively working on this? could please provide some updates?

dblock · 2022-03-07T16:58:43Z

There was an attempt in #553 to implement this that hasn't been finished. Please feel free to pick it up where it was left!

dhwanilpatel · 2022-05-31T11:59:35Z

Hello,

I am going to pick this up again to take this changes to completion. Major feedback on last PR (#554) was to break the changes into multiple PRs for ease of review. Below is plan on how I will be breaking changes into multiple PR.

Basic Throttler Framework / Exponential Basic back off policy. (Add basic thorttler/exponential backoff policy for retry/Defination o… #3527 )
Changes required in Master node to perform throttling.(Master node changes for master task throttling #3882)
Changes required in Data node to perform retry on throttling.(Data node changes for master task throttling #4204 )
Provide support for all task type in throttling framework.(Onboarding of few task types to throttling #4542 )
Integration Tests (Fix timeout exception and Add Integ test for Master task throttling #4588 )

Below are the list of item for future followup checks,

Documentation regarding new settings.
Throttling stats in Stats API.

@dblock can you please help in creating the feature branch for this issue, against which we can raise multiple PRs.

dblock · 2022-06-14T15:11:53Z

Sorry for the late reply - I think @CEHENKLE has a process for feature branches.

You don't need to wait on me, raise a PR and we can redirect it to a feature branch when it's ready, too.

CEHENKLE · 2022-07-21T15:57:32Z

@dhwanilpatel Hey, how's it going? Can we we help at all?

dhwanilpatel · 2022-08-12T13:39:53Z

@CEHENKLE so far it is going as per plan, Data Node and Master Node side changes are in review state. After those PR, upcoming PRs should be straightforward.

Thanks to the reviewers for providing their valuable feedbacks.

CEHENKLE · 2022-08-23T17:23:58Z

@dhwanilpatel Cool beans. LMK if we can help :)

/C

shwetathareja · 2022-08-30T08:19:55Z

@dhwanilpatel need a task in here for integ tests as well.

elfisher · 2022-08-31T18:37:48Z

@dhwanilpatel I wanted to confirm if this is on track for the 2.3 release. If so, pls add the v2.3.0 label to this issue. Thanks!

dhwanilpatel · 2022-10-10T10:28:32Z

Created followup issue for exposing throttling exception to user : #4724

elfisher · 2022-11-07T20:29:44Z

@dhwanilpatel given we are calling the feature "Cluster Manager Task Throttling" can we rename this issue "Cluster Manager Task Throttling"?

…t#479) Signed-off-by: Vacha Shah <vachshah@amazon.com>

dhwanilpatel added the enhancement Enhancement or improvement to existing feature or request label Apr 1, 2021

dhwanilpatel mentioned this issue Apr 2, 2021

Add master task throttling #484

Closed

6 tasks

dhwanilpatel mentioned this issue Apr 14, 2021

Add master task throttling #553

Closed

6 tasks

AmiStrn mentioned this issue Apr 27, 2021

[Proposal] Query Profiling #539

Closed

minalsha added the untriaged label May 31, 2022

dhwanilpatel mentioned this issue Jun 7, 2022

Add basic thorttler/exponential backoff policy for retry/Defination o… #3527

Merged

5 tasks

dhwanilpatel mentioned this issue Jul 13, 2022

Master node changes for master task throttling #3882

Merged

5 tasks

CEHENKLE removed the untriaged label Jul 21, 2022

dhwanilpatel mentioned this issue Aug 12, 2022

Data node changes for master task throttling #4204

Merged

5 tasks

elfisher added the roadmap label Aug 31, 2022

dhwanilpatel mentioned this issue Sep 19, 2022

Onboarding of few task types to throttling #4542

Merged

6 tasks

dhwanilpatel mentioned this issue Sep 26, 2022

Fix timeout exception and Add Integ test for Master task throttling #4588

Merged

6 tasks

dhwanilpatel mentioned this issue Oct 10, 2022

[Master Task Throttling] Expose throttling exception to cx #4724

Open

Rishikesh1159 added the distributed framework label Oct 10, 2022

dhwanilpatel mentioned this issue Oct 20, 2022

Complete TODO for version change and removed unused classes #4846

Merged

6 tasks

This was referenced Oct 31, 2022

[Feature]Cluster manager task throttling feature #4986

Merged

[Backport 2.x] Cluster Manager task throttling #5041

Merged

dhwanilpatel mentioned this issue Nov 4, 2022

[Backport 2.4] Cluster manager task throttling feature #5071

Merged

6 tasks

This was referenced Dec 12, 2022

Minor bug fix related to cluster manager throttling settings #5524

Merged

[Backport 2.x] Minor bug fix related to cluster manager throttling settings #5598

Merged

dhwanilpatel mentioned this issue Dec 22, 2022

Change version check for cluster manager throttling setting to 2.5 #5617

Merged

6 tasks

andrross changed the title ~~Master Task Throttling~~ Cluster Manager Task Throttling Dec 22, 2022

dhwanilpatel mentioned this issue Dec 26, 2022

Add version check during task submission for bwc for static threshold setting #5633

Merged

6 tasks

Bukhtawar mentioned this issue Dec 31, 2022

Adding @gbbafna to OpenSearch maintainers #5668

Merged

2 tasks

This was referenced Jan 10, 2023

Add cluster manager task throttling stats in nodes stats API #5790

Merged

[Backport 2.x] Add cluster manager throttling task stats in nodes stats API #5871

Merged

anasalkouz added Migration:Backlog and removed Migration:Backlog labels Mar 17, 2023

This was referenced Apr 5, 2023

Make cluster manager throttling retry delay as dynamic setting #6997

Closed

Make cluster manager throttling retry delay as dynamic setting #6998

Merged

dhwanilpatel mentioned this issue Aug 11, 2023

Cluster manager throttling retry improvements #9258

Open

shwetathareja closed this as completed Aug 24, 2023

gbbafna mentioned this issue Sep 15, 2023

[Resiliency] Enable guard rails by default #10063

Open

khushbr mentioned this issue Oct 10, 2023

Remove redundant ClusterManagerThrottlingMetricsCollector opensearch-project/performance-analyzer#582

Merged

6 tasks

ritty27 pushed a commit to ritty27/OpenSearch that referenced this issue May 12, 2024

Adding szczepanczykd to Maintainers and Codeowners (opensearch-projec…

e027ae8

…t#479) Signed-off-by: Vacha Shah <vachshah@amazon.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Manager Task Throttling #479

Cluster Manager Task Throttling #479

dhwanilpatel commented Apr 1, 2021

dhwanilpatel commented Apr 20, 2021

anasalkouz commented Nov 8, 2021

dblock commented Mar 7, 2022

dhwanilpatel commented May 31, 2022 •

edited

Loading

dblock commented Jun 14, 2022 •

edited

Loading

CEHENKLE commented Jul 21, 2022

dhwanilpatel commented Aug 12, 2022

CEHENKLE commented Aug 23, 2022

shwetathareja commented Aug 30, 2022

elfisher commented Aug 31, 2022

dhwanilpatel commented Oct 10, 2022 •

edited

Loading

elfisher commented Nov 7, 2022

Cluster Manager Task Throttling #479

Cluster Manager Task Throttling #479

Comments

dhwanilpatel commented Apr 1, 2021

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

dhwanilpatel commented Apr 20, 2021

anasalkouz commented Nov 8, 2021

dblock commented Mar 7, 2022

dhwanilpatel commented May 31, 2022 • edited Loading

dblock commented Jun 14, 2022 • edited Loading

CEHENKLE commented Jul 21, 2022

dhwanilpatel commented Aug 12, 2022

CEHENKLE commented Aug 23, 2022

shwetathareja commented Aug 30, 2022

elfisher commented Aug 31, 2022

dhwanilpatel commented Oct 10, 2022 • edited Loading

elfisher commented Nov 7, 2022

dhwanilpatel commented May 31, 2022 •

edited

Loading

dblock commented Jun 14, 2022 •

edited

Loading

dhwanilpatel commented Oct 10, 2022 •

edited

Loading