[FEATURE] Error Prevention in ISM #432

jowg-amazon · 2022-07-26T22:02:29Z

Overview

ISM Validation Service to pre check actions and notify users before they are executed.

Motivation

Index State Management (ISM) is a plugin for OpenSearch that automates reoccurring operations between the lifecycles and manages metadata on specific indices through user defined policies. Every ISM operation (documented here) is managed by a policy for the action states and transitions. There are generally three reasons as to why an action will fail: an unmet action execution prerequisite, an invalid policy configuration, or a transient failure like a timeout or circuit breaker exception. For the first two failures, a user can redefine their policy or make a cluster change and retry the action to see if it passes. However, this process is not clearly defined and may often take a long time to fully execute. Additionally, it is often clear that the defined policy action will fail beforehand and sometimes even at policy creation.

Problem Statement

There can be potential action failures for ISM with no good way of understanding and handling why these failures are happening and what the causes are. There needs to be an investigation into the causes for these errors as well as an overarching analysis in order to gain better understanding on how to prevent them from happening in the future.

In order to manage and prevent catchable errors, an error validation and prevention structure must be created that allows users to preemptively check whether or not the next action is in danger of failing with an explanation of the cause of failure. With this validation and prevention structure, users will be able to manage errors easier and instantaneously fix action failures, thus reducing operational burden for ISM.

User Story

As a user I want to mitigate as many errors as possible with the least amount of manual work.
As a user I want to be able to apply a policy to an index and immediately check if the next action to be executed is in danger of failing without having to wait for the condition(s) to be met.
As a user I want to have the ability to check if my action is projected to fail at any point.
As a user I want sufficient information about the errors and suggested solutions I receive so that I am able to troubleshoot it quick and easily.

Tenets

User experience is of the utmost importance and must be held to high standards.
Design choices must be driven by data and research.
Performance of ISM is a major consideration and must not be affected.

Challenges

Action failures encompasses around 100 unique causes and not all errors may be preventable so the existing errors need to be differentiated and categorized.
Validation code must not impact the overall performance of the ISM lifecycle.
Notifications must be user friendly and contain enough information for the user to understand and fix the error.

In Scope

Implement a validation structure that will check if the next action is in danger of failing and provide notification to users if an action error is projected to occur.
Refactor existing validation code to adhere to the new validation structure.
Implement validation checks and unit tests for all new validation functions.

Out of Scope

The validation structure will not try to fix the error, this will need to be done manually by the user.
Users will not be able to manually call and check for action failures.
Users will not be able to indicate specific indices they would like the validation structure to be performed on.

Error Analysis

There are a variety of failures that may occur when executing an action, but when analyzing these action failures a few common errors begin to arise.

In general, action failures can be categorized into four succinct groups:

Preventable errors
1. This type of error may be validated ahead of time and fixed by the user after changing some type of configuration.
Preventable errors caught by an API call exception
1. This type of error may only be checked by an API call instead of validated ahead of time.
Preventable errors that may cause the cluster state to worsen
1. This type of error may worsen the cluster state if ISM continues to run.
Errors messages that still need investigation

Validation Framework

High Level Design

To mitigate action failures, a validation service will be created. This framework will be implemented in two steps such that it 1. validates the potential action failure before the action is performed and determines the best course of action depending on the error cause and 2. informs the user in a well mannered way the potential failure along with suggested solutions or documentation on how to fix the error through the Explain API. The framework will be enabled by default but can be turned off by the user in the cluster settings. However, it is not enabled by default in the Explain API but can be called by setting the query parameters to "validate_action=true".

Validating action failure
- Validation logic will be implemented in a separate validation class for each action and implementation will be tailored to the cause of action failure. This will be called from the managed index runner and checked before an action is supposed to be executed.
- Depending on the error cause, some API’s need to be called upon to catch errors rather than validating them.
- Once the validation logic finishes and checks confirm that there is an action failure, the action in question returns with an error indication and the validation service is either retried at the next job scheduler interval or the action fails forever.
Notifying action failure
- Notification will be called after the validation logic from part 1 is performed through a flag in the Explain API. This will allow users to validate future actions while ISM is running and find potential action errors.
- The notification will provide remedy on how to prevent these action failures from occurring and will include either simple steps on how to fix or reconfigure the problem area or it will provide documentation pointing users to a solution.
  - By providing a link to the documentation, users will be able to find the actual solution.
  - By decoupling the solution from the codebase, the document itself may be updated instead of the codebase.

Pros:

Validation is only executed before an action is set to take place preventing unnecessary checks.
If validation fails, the action is automatically retried or failed so that less manual labor needs to be performed.
Because validation is enabled by default, users don’t need to proactively check and query for errors when running the cluster.

Cons:

Execution is limited to the job scheduler interval (currently at 5 minutes) so once a user implements a fix, the update will not take place until the next interval.
There may be additional overhead and a slow down in performance based on the validation implementation for each action failure.

Workflow:

Architectural Design

Design Alternatives

Validation service not enabled by default
1. Pros:
  1. ISM performance would not be affected until the user decides to enable the validation framework.
  2. Users may not always need this validation framework when running ISM.
2. Cons:
  1. Users need to manually enable the validation which can lead to underutilization of the validation service.
If there is an error, fail all actions forever and allow users to manually re-validate instead of automatically re-validating and retrying the action.
1. Pros:
  1. No unnecessary retries if the action failure is not yet fixed by the user.
2. Cons:
  1. Some actions don’t need to be failed and re-validating and retrying the action wouldn’t harm the cluster state.
  2. Less automation and more manual work for the users to perform.
Implementing self-healing for certain actions by fixing or reconfiguring the cluster.
1. Pros:
  1. If user’s don’t have to deal with cluster level issues, it would provide a better user experience and more automation.
  2. User’s may not know how to fix the cluster level problems themselves.
2. Cons:
  1. Risky for ISM to configure the cluster directly and the responsibility of fixing the error should be on the user not ISM.

Implementation Details

ValidationService.kt

class ValidationService(
    val settings: Settings,
    val clusterService: ClusterService
) {

    fun validate(actionName: String, indexName: String): ValidationResult {
        // map action to validation class
        val validation = when (actionName) {
            "rollover" -> ValidateRollover(settings, clusterService, jvmService).execute(indexName)
            "delete" -> ValidateDelete(settings, clusterService, jvmService).execute(indexName)
            "force_merge" -> ValidateForceMerge(settings, clusterService, jvmService).execute(indexName)
            else -> {
                // temporary call until all actions are mapped
                ValidateNothing(settings, clusterService, jvmService).execute(indexName)
            }
        }
        return ValidationResult(validation.validationMessage.toString(), validation.validationStatus)
    }
}

Validate.kt

abstract class Validate(
    val settings: Settings,
    val clusterService: ClusterService
) {

    var validationStatus = ValidationStatus.PASS

    abstract fun execute(context: StepContext): Validate

    enum class ValidationStatus(val status: String) : Writeable {
        PASSED("passed"),
        RE_VALIDATE("re_validate"),
        FAILED("failed");
    }
}

Every action to be validated will adhere to the Validate.kt abstract class and will implement the necessary functions.

Demo

Validation Framework Demo using missing rollover alias example. In the demo, the framework is called using the Explain API and the results are shown when the service is both enabled and disabled. It is also called through the Managed Index Runner and then notifies the user through Amazon Chime.

Validation.Framework.Demo.mp4

Testing

Integration testing for each action
Unit testing on validation logic when appropriate

Limitations

Not all action errors are preventable and can be caught before runtime.

Appendix

Terminology

Policy - a user defined set of rules that describe how to run certain OpenSearch operations on an index and manage them through the use of states and transitions.

Action - steps that the policy sequentially executes upon entering a specific state.

Step - individual jobs broken down from the action that execute transition conditions or an action itself.

Related issue(s): #27

The text was updated successfully, but these errors were encountered:

jowg-amazon added enhancement New request untriaged labels Jul 26, 2022

bowenlan-amzn added feature and removed enhancement New request untriaged labels Jul 26, 2022

bowenlan-amzn mentioned this issue Aug 3, 2022

Action Validation framework and Explain API integration #441

Merged

1 task

Angie-Zhang mentioned this issue Oct 27, 2022

Error prevention stage 1 #579

Merged

2 tasks

praveensameneni closed this as completed Nov 4, 2022

bowenlan-amzn mentioned this issue Nov 5, 2022

[META FEATURE] Error Prevention Enhancement #587

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Error Prevention in ISM #432

[FEATURE] Error Prevention in ISM #432

jowg-amazon commented Jul 26, 2022 •

edited by praveensameneni

Loading

[FEATURE] Error Prevention in ISM #432

[FEATURE] Error Prevention in ISM #432

Comments

jowg-amazon commented Jul 26, 2022 • edited by praveensameneni Loading

Overview

Motivation

Problem Statement

User Story

Tenets

Challenges

In Scope

Out of Scope

Error Analysis

Validation Framework

High Level Design

Workflow:

Architectural Design

Design Alternatives

Implementation Details

Demo

Testing

Limitations

Appendix

Terminology

jowg-amazon commented Jul 26, 2022 •

edited by praveensameneni

Loading