Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist migration errors and automatically retry on failure #26144

Closed
spalger opened this issue Nov 25, 2018 · 2 comments
Closed

Persist migration errors and automatically retry on failure #26144

spalger opened this issue Nov 25, 2018 · 2 comments
Labels
Team:Operations Team label for Operations Team

Comments

@spalger
Copy link
Contributor

spalger commented Nov 25, 2018

We need to start persisting the progress of the migration as it is running so that other Kibana instances have a way to know that a migration is not actually in progress, but has failed. There are a number of ways we can do this, but I wanted to write down what i think we should do:

The current migration process uses the ability to create the index as a sort of "lock" that it is the only Kibana instance that is currently migrating the index. I don't think this is sufficient if we want to track progress, do retries, and avoid relying on a single-shared clock. I think we now need to store the migration progress in a new temporary index:

PUT /_template/.kibana_migration_progress
{
  "index_patterns": [".kibana_migration_*"],
  "settings": {
    "number_of_shards": 1,
    "auto_expand_replicas": "0-1"
  }, 
  "mappings": {
    "_doc": {
      "properties": {
        "node_id": {
          "type": "keyword"
        },
        "error": {
          "type": "object",
          "enabled": false
        },
        "attempt": {
          "type": "integer"
        }
      }
    }
  }
}

PUT /.kibana_migration_{hash(kibana.index)}/_doc/progress
{
  "node_id": "xyz",
  "attempt" 1
}

Rather than using the creation of the migration target to identify which Kibana instance will run the migration we could use the .kibana_migration_{hash(kibana.index)}/_doc/progress document as our "lock". Each instance will:

  1. Check if a migration is necessary at startup
  2. If no migration is necessary, continue with startup
  3. Else, put the progress index template
  4. Attempt to create the progress document (using op_type=create) with the right node_id and an attempt count of 1
  5. If the create was successful this instance will run the migration
  6. If the document already exists,
    1. Read the progress document (using preference=_primary)
    2. If the document includes an error:
      1. log the error
      2. if the attempt count is 10 (configurable?)
        1. log an error instructing the user to delete the .kibana_migration_{hash(kibana.index)} to resume
        2. abort startup
      3. else, attempt to update the document (using the version param) with our node_id and incrementing the attempt count
      4. the instance that successfully updates the document runs the migration
    3. If the document DOES NOT include an error:
      1. wait 30 seconds and fetch the document again (using preference=_primary)
      2. if the version is greater than previously go to step 6.iii.a
      3. if the version is not greater than previously
        1. log an error describing the node_id that failed to show sign of life
        2. go to step 6.ii.b

Kibana instance running the migration will:

  1. Reindex the progress document every 15 seconds, causing the version to increment
  2. Re-check if a migration is necessary
    1. If migration is not necessary it's possible a migration completed between when we previously thought it was necessary and when we acquired the progress document
    2. Delete old versions of the target index if migration necessary
    3. Recreate target index
    4. Migrate documents
  3. Before each write
    1. fetch the progress document and verify node_id
    2. if node_id does not match, abort startup with an error
  4. If an error occurs during the migration:
    1. fetch the progress document and verify node_id
    2. update the progress document (using the version param) to include error and unset the node_id
    3. abort startup
  5. When the migration is complete, and the target index has been refreshed, delete the progress index
@spalger spalger added the Team:Operations Team label for Operations Team label Nov 25, 2018
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-operations

@rudolf
Copy link
Contributor

rudolf commented Oct 1, 2020

Closing in favour of #66056

@rudolf rudolf closed this as completed Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Operations Team label for Operations Team
Projects
None yet
Development

No branches or pull requests

3 participants