-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Error Prevention in ISM #432
Labels
Comments
1 task
12 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Overview
ISM Validation Service to pre check actions and notify users before they are executed.
Motivation
Index State Management (ISM) is a plugin for OpenSearch that automates reoccurring operations between the lifecycles and manages metadata on specific indices through user defined policies. Every ISM operation (documented here) is managed by a policy for the action states and transitions. There are generally three reasons as to why an action will fail: an unmet action execution prerequisite, an invalid policy configuration, or a transient failure like a timeout or circuit breaker exception. For the first two failures, a user can redefine their policy or make a cluster change and retry the action to see if it passes. However, this process is not clearly defined and may often take a long time to fully execute. Additionally, it is often clear that the defined policy action will fail beforehand and sometimes even at policy creation.
Problem Statement
There can be potential action failures for ISM with no good way of understanding and handling why these failures are happening and what the causes are. There needs to be an investigation into the causes for these errors as well as an overarching analysis in order to gain better understanding on how to prevent them from happening in the future.
In order to manage and prevent catchable errors, an error validation and prevention structure must be created that allows users to preemptively check whether or not the next action is in danger of failing with an explanation of the cause of failure. With this validation and prevention structure, users will be able to manage errors easier and instantaneously fix action failures, thus reducing operational burden for ISM.
User Story
Tenets
Challenges
In Scope
Out of Scope
Error Analysis
There are a variety of failures that may occur when executing an action, but when analyzing these action failures a few common errors begin to arise.
In general, action failures can be categorized into four succinct groups:
Validation Framework
High Level Design
To mitigate action failures, a validation service will be created. This framework will be implemented in two steps such that it 1. validates the potential action failure before the action is performed and determines the best course of action depending on the error cause and 2. informs the user in a well mannered way the potential failure along with suggested solutions or documentation on how to fix the error through the Explain API. The framework will be enabled by default but can be turned off by the user in the cluster settings. However, it is not enabled by default in the Explain API but can be called by setting the query parameters to
"validate_action=true"
.Validating action failure
Notifying action failure
Pros:
Cons:
Workflow:
Architectural Design
Design Alternatives
Validation service not enabled by default
If there is an error, fail all actions forever and allow users to manually re-validate instead of automatically re-validating and retrying the action.
Implementing self-healing for certain actions by fixing or reconfiguring the cluster.
Implementation Details
ValidationService.kt
Validate.kt
Validate.kt
abstract class and will implement the necessary functions.Demo
Validation Framework Demo using missing rollover alias example. In the demo, the framework is called using the Explain API and the results are shown when the service is both enabled and disabled. It is also called through the Managed Index Runner and then notifies the user through Amazon Chime.
Validation.Framework.Demo.mp4
Testing
Limitations
Appendix
Terminology
Policy - a user defined set of rules that describe how to run certain OpenSearch operations on an index and manage them through the use of states and transitions.
Action - steps that the policy sequentially executes upon entering a specific state.
Step - individual jobs broken down from the action that execute transition conditions or an action itself.
Related issue(s): #27
The text was updated successfully, but these errors were encountered: