-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Meta] Shard level Indexing Back-Pressure #478
Comments
… based on key performance thresholds. (#478) Signed-off-by: Saurabh Singh <sisurab@amazon.com>
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
…ration. (#478) Signed-off-by: Saurabh Singh <sisurab@amazon.com>
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
Breaking the changes further into more granular PRs as below. This is to logically help the reviewers to visit the changes.
Below are the initial PRs for IndexingPressure changes, which are now being broken and merged into feature branch. Keeping here for discussion traceability.
Below are List of Items for Future Follow Ups on the backpressure changes:
|
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
…nsearch-project#478) Signed-off-by: Saurabh Singh <sisurab@amazon.com>
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
Signed-off-by: Saurabh Singh <sisurab@amazon.com>
* Add Shard Indexing Pressure Store (#478) Signed-off-by: Saurabh Singh <sisurab@amazon.com> * Added comments and shard allocation based on compute in hot store. Signed-off-by: Saurabh Singh <sisurab@amazon.com> Co-authored-by: Saurabh Singh <sisurab@amazon.com>
It introduces a Memory Manager for Shard Indexing Pressure. It is responsible for increasing and decreasing the allocated shard limit based on incoming requests, and validate the current values against the thresholds. Signed-off-by: Saurabh Singh <sisurab@amazon.com>
* Add Shard Indexing Pressure Store (#478) Signed-off-by: Saurabh Singh <sisurab@amazon.com> * Added comments and shard allocation based on compute in hot store. Signed-off-by: Saurabh Singh <sisurab@amazon.com> Co-authored-by: Saurabh Singh <sisurab@amazon.com>
It introduces a Memory Manager for Shard Indexing Pressure. It is responsible for increasing and decreasing the allocated shard limit based on incoming requests, and validate the current values against the thresholds. Signed-off-by: Saurabh Singh <sisurab@amazon.com>
…earch-project#716) Signed-off-by: Saurabh Singh <sisurab@amazon.com>
opensearch-project#718) Signed-off-by: Saurabh Singh <sisurab@amazon.com>
…arch-project#717) Signed-off-by: Saurabh Singh <sisurab@amazon.com>
…h-project#838) * Add Shard Indexing Pressure Store (opensearch-project#478) Signed-off-by: Saurabh Singh <sisurab@amazon.com> * Added comments and shard allocation based on compute in hot store. Signed-off-by: Saurabh Singh <sisurab@amazon.com> Co-authored-by: Saurabh Singh <sisurab@amazon.com>
…pensearch-project#945) It introduces a Memory Manager for Shard Indexing Pressure. It is responsible for increasing and decreasing the allocated shard limit based on incoming requests, and validate the current values against the thresholds. Signed-off-by: Saurabh Singh <sisurab@amazon.com>
Closing this issue as changes are already merged as part of #1336 |
Is your feature request related to a problem? Please describe.
Elastic search today provides few gating mechanism to protect a node when under a duress, via the concepts of queue rejections and circuit breakers. However, queue sizes are fixed, isolated and do not effectively represent the total work required to be done. Similarly, Circuit breakers acts as a last line of defence, are mostly too late to act upon, and do not offer fairness. Some of these gaps create availability issue with the cluster, when under duress due to hardware failures, node performance degradation or traffic bursts.
Indexing Pressure today tries to address this to some extent by rejecting indexing requests based on some hard-coded limits on nodes. However, there is a need for smarter rejection mechanism at shard level, when there are too many stuck/slow indexing requests, breaching key performance thresholds (such as throughput). This can prevent the cluster from running into cascading effects of failures.
Describe the solution you'd like
With shard level indexing pressure we want to improve the current Indexing Pressure framework which performs memory accounting at node level and rejects the requests. We aim to take a step further to have rejections based on the memory accounting at shard level along with other key performance factors like throughput and last successful requests. This can be called as ShardIndexingPressure.
Key features to be covered
Additional context
Shard Indexing Pressure will be available in two phases ie Shadow & Enforced, via dynamic ES settings as below:
Enabled - To turn the shard indexing pressure feature turn on and off (default off initially).
Enforced - To run the feature in Shadow/Enforced mode if Enabled. In Shadow mode there will be no rejections but metrics will be published. In Enforced mode there will be actual rejection happening in the cluster.
Rejection Criteria - Three broad criteria for rejections as below:
Node Limit - Acts as the defence line if the utilisation by all the shards reaches the node limit assigned. i.e. 10% of heap. This effectively indicates a shard is unable to take in more traffic on the node.
Throughput degradation - This is to detect any hardware/software issue resulting into performance degradation. When node level occupancy is already breaching its soft-limit, and there exists a constant deterioration in the request turnaround at a shard level, additional requests are rejected until the system recovers.
Last Successful request - This is to address stuck requests or black hole scenarios. When node level occupancy is already breaching the soft-limit, and shard has multiple outstanding requests whole no request are turning around, then beyond a threshold additional requests will be actively discarded until the system recovers.
All the thresholds and key parameters will be available in for of dynamic ES settings for real time tuning.
Feature Branch: 478_indexBackPressure
The text was updated successfully, but these errors were encountered: