Persist migration errors and automatically retry on failure #26144

spalger · 2018-11-25T01:21:20Z

We need to start persisting the progress of the migration as it is running so that other Kibana instances have a way to know that a migration is not actually in progress, but has failed. There are a number of ways we can do this, but I wanted to write down what i think we should do:

The current migration process uses the ability to create the index as a sort of "lock" that it is the only Kibana instance that is currently migrating the index. I don't think this is sufficient if we want to track progress, do retries, and avoid relying on a single-shared clock. I think we now need to store the migration progress in a new temporary index:

PUT /_template/.kibana_migration_progress
{
  "index_patterns": [".kibana_migration_*"],
  "settings": {
    "number_of_shards": 1,
    "auto_expand_replicas": "0-1"
  }, 
  "mappings": {
    "_doc": {
      "properties": {
        "node_id": {
          "type": "keyword"
        },
        "error": {
          "type": "object",
          "enabled": false
        },
        "attempt": {
          "type": "integer"
        }
      }
    }
  }
}

PUT /.kibana_migration_{hash(kibana.index)}/_doc/progress
{
  "node_id": "xyz",
  "attempt" 1
}

Rather than using the creation of the migration target to identify which Kibana instance will run the migration we could use the .kibana_migration_{hash(kibana.index)}/_doc/progress document as our "lock". Each instance will:

Check if a migration is necessary at startup
If no migration is necessary, continue with startup
Else, put the progress index template
Attempt to create the progress document (using op_type=create) with the right node_id and an attempt count of 1
If the create was successful this instance will run the migration
If the document already exists,
1. Read the progress document (using preference=_primary)
2. If the document includes an error:
  1. log the error
  2. if the attempt count is 10 (configurable?)
    1. log an error instructing the user to delete the .kibana_migration_{hash(kibana.index)} to resume
    2. abort startup
  3. else, attempt to update the document (using the version param) with our node_id and incrementing the attempt count
  4. the instance that successfully updates the document runs the migration
3. If the document DOES NOT include an error:
  1. wait 30 seconds and fetch the document again (using preference=_primary)
  2. if the version is greater than previously go to step 6.iii.a
  3. if the version is not greater than previously
    1. log an error describing the node_id that failed to show sign of life
    2. go to step 6.ii.b

Kibana instance running the migration will:

Reindex the progress document every 15 seconds, causing the version to increment
Re-check if a migration is necessary
1. If migration is not necessary it's possible a migration completed between when we previously thought it was necessary and when we acquired the progress document
2. Delete old versions of the target index if migration necessary
3. Recreate target index
4. Migrate documents
Before each write
1. fetch the progress document and verify node_id
2. if node_id does not match, abort startup with an error
If an error occurs during the migration:
1. fetch the progress document and verify node_id
2. update the progress document (using the version param) to include error and unset the node_id
3. abort startup
When the migration is complete, and the target index has been refreshed, delete the progress index

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-11-25T02:00:36Z

Pinging @elastic/kibana-operations

rudolf · 2020-10-01T07:50:46Z

Closing in favour of #66056

spalger added the Team:Operations Team label for Operations Team label Nov 25, 2018

chrisdavies mentioned this issue Feb 7, 2019

More resilient error handling for migrations #26143

Closed

stacey-gammon mentioned this issue Jul 24, 2019

Improve handling of invalid data during migrations #38669

Closed

4 tasks

rudolf mentioned this issue Dec 4, 2019

Improve Saved Object Migrations to minimize operational impact of Kibana upgrades #52202

Closed

rudolf closed this as completed Oct 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist migration errors and automatically retry on failure #26144

Persist migration errors and automatically retry on failure #26144

spalger commented Nov 25, 2018 •

edited

Loading

elasticmachine commented Nov 25, 2018

rudolf commented Oct 1, 2020

Persist migration errors and automatically retry on failure #26144

Persist migration errors and automatically retry on failure #26144

Comments

spalger commented Nov 25, 2018 • edited Loading

elasticmachine commented Nov 25, 2018

rudolf commented Oct 1, 2020

spalger commented Nov 25, 2018 •

edited

Loading