Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Error Prevention in ISM #432

Closed
jowg-amazon opened this issue Jul 26, 2022 · 0 comments
Closed

[FEATURE] Error Prevention in ISM #432

jowg-amazon opened this issue Jul 26, 2022 · 0 comments
Labels

Comments

@jowg-amazon
Copy link
Contributor

jowg-amazon commented Jul 26, 2022

Overview

ISM Validation Service to pre check actions and notify users before they are executed.

Motivation

Index State Management (ISM) is a plugin for OpenSearch that automates reoccurring operations between the lifecycles and manages metadata on specific indices through user defined policies. Every ISM operation (documented here) is managed by a policy for the action states and transitions. There are generally three reasons as to why an action will fail: an unmet action execution prerequisite, an invalid policy configuration, or a transient failure like a timeout or circuit breaker exception. For the first two failures, a user can redefine their policy or make a cluster change and retry the action to see if it passes. However, this process is not clearly defined and may often take a long time to fully execute. Additionally, it is often clear that the defined policy action will fail beforehand and sometimes even at policy creation.

Problem Statement

There can be potential action failures for ISM with no good way of understanding and handling why these failures are happening and what the causes are. There needs to be an investigation into the causes for these errors as well as an overarching analysis in order to gain better understanding on how to prevent them from happening in the future.

In order to manage and prevent catchable errors, an error validation and prevention structure must be created that allows users to preemptively check whether or not the next action is in danger of failing with an explanation of the cause of failure. With this validation and prevention structure, users will be able to manage errors easier and instantaneously fix action failures, thus reducing operational burden for ISM.

User Story

  • As a user I want to mitigate as many errors as possible with the least amount of manual work.
  • As a user I want to be able to apply a policy to an index and immediately check if the next action to be executed is in danger of failing without having to wait for the condition(s) to be met.
  • As a user I want to have the ability to check if my action is projected to fail at any point.
  • As a user I want sufficient information about the errors and suggested solutions I receive so that I am able to troubleshoot it quick and easily.

Tenets

  • User experience is of the utmost importance and must be held to high standards.
  • Design choices must be driven by data and research.
  • Performance of ISM is a major consideration and must not be affected.

Challenges

  • Action failures encompasses around 100 unique causes and not all errors may be preventable so the existing errors need to be differentiated and categorized.
  • Validation code must not impact the overall performance of the ISM lifecycle.
  • Notifications must be user friendly and contain enough information for the user to understand and fix the error.

In Scope

  • Implement a validation structure that will check if the next action is in danger of failing and provide notification to users if an action error is projected to occur.
  • Refactor existing validation code to adhere to the new validation structure.
  • Implement validation checks and unit tests for all new validation functions.

Out of Scope

  • The validation structure will not try to fix the error, this will need to be done manually by the user.
  • Users will not be able to manually call and check for action failures.
  • Users will not be able to indicate specific indices they would like the validation structure to be performed on.

Error Analysis

There are a variety of failures that may occur when executing an action, but when analyzing these action failures a few common errors begin to arise.

In general, action failures can be categorized into four succinct groups:

  1. Preventable errors
    1. This type of error may be validated ahead of time and fixed by the user after changing some type of configuration.
  2. Preventable errors caught by an API call exception
    1. This type of error may only be checked by an API call instead of validated ahead of time.
  3. Preventable errors that may cause the cluster state to worsen
    1. This type of error may worsen the cluster state if ISM continues to run.
  4. Errors messages that still need investigation

Validation Framework

High Level Design

To mitigate action failures, a validation service will be created. This framework will be implemented in two steps such that it 1. validates the potential action failure before the action is performed and determines the best course of action depending on the error cause and 2. informs the user in a well mannered way the potential failure along with suggested solutions or documentation on how to fix the error through the Explain API. The framework will be enabled by default but can be turned off by the user in the cluster settings. However, it is not enabled by default in the Explain API but can be called by setting the query parameters to "validate_action=true".

  1. Validating action failure

    • Validation logic will be implemented in a separate validation class for each action and implementation will be tailored to the cause of action failure. This will be called from the managed index runner and checked before an action is supposed to be executed.
    • Depending on the error cause, some API’s need to be called upon to catch errors rather than validating them.
    • Once the validation logic finishes and checks confirm that there is an action failure, the action in question returns with an error indication and the validation service is either retried at the next job scheduler interval or the action fails forever.
  2. Notifying action failure

    • Notification will be called after the validation logic from part 1 is performed through a flag in the Explain API. This will allow users to validate future actions while ISM is running and find potential action errors.
    • The notification will provide remedy on how to prevent these action failures from occurring and will include either simple steps on how to fix or reconfigure the problem area or it will provide documentation pointing users to a solution.
      • By providing a link to the documentation, users will be able to find the actual solution.
      • By decoupling the solution from the codebase, the document itself may be updated instead of the codebase.

Pros:

  • Validation is only executed before an action is set to take place preventing unnecessary checks.
  • If validation fails, the action is automatically retried or failed so that less manual labor needs to be performed.
  • Because validation is enabled by default, users don’t need to proactively check and query for errors when running the cluster.

Cons:

  • Execution is limited to the job scheduler interval (currently at 5 minutes) so once a user implements a fix, the update will not take place until the next interval.
  • There may be additional overhead and a slow down in performance based on the validation implementation for each action failure.

Workflow:

Error Prevention Workflow

Architectural Design

Error Validation Framework

Design Alternatives

  1. Validation service not enabled by default

    1. Pros:
      1. ISM performance would not be affected until the user decides to enable the validation framework.
      2. Users may not always need this validation framework when running ISM.
    2. Cons:
      1. Users need to manually enable the validation which can lead to underutilization of the validation service.
  2. If there is an error, fail all actions forever and allow users to manually re-validate instead of automatically re-validating and retrying the action.

    1. Pros:
      1. No unnecessary retries if the action failure is not yet fixed by the user.
    2. Cons:
      1. Some actions don’t need to be failed and re-validating and retrying the action wouldn’t harm the cluster state.
      2. Less automation and more manual work for the users to perform.
  3. Implementing self-healing for certain actions by fixing or reconfiguring the cluster.

    1. Pros:
      1. If user’s don’t have to deal with cluster level issues, it would provide a better user experience and more automation.
      2. User’s may not know how to fix the cluster level problems themselves.
    2. Cons:
      1. Risky for ISM to configure the cluster directly and the responsibility of fixing the error should be on the user not ISM.

Implementation Details

ValidationService.kt

class ValidationService(
    val settings: Settings,
    val clusterService: ClusterService
) {

    fun validate(actionName: String, indexName: String): ValidationResult {
        // map action to validation class
        val validation = when (actionName) {
            "rollover" -> ValidateRollover(settings, clusterService, jvmService).execute(indexName)
            "delete" -> ValidateDelete(settings, clusterService, jvmService).execute(indexName)
            "force_merge" -> ValidateForceMerge(settings, clusterService, jvmService).execute(indexName)
            else -> {
                // temporary call until all actions are mapped
                ValidateNothing(settings, clusterService, jvmService).execute(indexName)
            }
        }
        return ValidationResult(validation.validationMessage.toString(), validation.validationStatus)
    }
}

Validate.kt

abstract class Validate(
    val settings: Settings,
    val clusterService: ClusterService
) {

    var validationStatus = ValidationStatus.PASS

    abstract fun execute(context: StepContext): Validate

    enum class ValidationStatus(val status: String) : Writeable {
        PASSED("passed"),
        RE_VALIDATE("re_validate"),
        FAILED("failed");
    }
}
  • Every action to be validated will adhere to the Validate.kt abstract class and will implement the necessary functions.

Demo

Validation Framework Demo using missing rollover alias example. In the demo, the framework is called using the Explain API and the results are shown when the service is both enabled and disabled. It is also called through the Managed Index Runner and then notifies the user through Amazon Chime.

Validation.Framework.Demo.mp4

Testing

  • Integration testing for each action
  • Unit testing on validation logic when appropriate

Limitations

  • Not all action errors are preventable and can be caught before runtime.

Appendix

Terminology

Policy - a user defined set of rules that describe how to run certain OpenSearch operations on an index and manage them through the use of states and transitions.

Action - steps that the policy sequentially executes upon entering a specific state.

Step - individual jobs broken down from the action that execute transition conditions or an action itself.

Related issue(s): #27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants