Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrations fail with "Not enough active copies to meet shard count of [ALL] (have 1, needed 2)" #127136

Closed
Tracked by #129016
rudolf opened this issue Mar 8, 2022 · 7 comments · Fixed by #136605
Closed
Tracked by #129016
Assignees
Labels
Feature:Migrations impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:medium Medium Level of Effort project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@rudolf
Copy link
Contributor

rudolf commented Mar 8, 2022

Part of #129016

We've observed some Kibana upgrades to 8.0+ on a single node Elasticsearch cluster can fail with:

[{"type":"unavailable_shards_exception","reason":"[.kibana_task_manager_7.17.1_reindex_temp][0] Not enough active copies to meet shard count of [ALL] (have 1, needed 2). Timeout: [1m], request: [BulkShardRequest [[.kibana_task_manager_7.17.1_reindex_temp][0]] containing [32] requests]"},{"type":"unavailable_shards_exception","reason":"[.kibana_task_manager_7.17.1_reindex_temp][0] Not enough active copies to meet shard count of [ALL] (have 1, needed 2). Timeout: [1m]

This issue has been observed in Elasticsearch Service (ESS) and Elastic Cloud Enterprise (ECE). There is an orchestration issue where the shutdown metadata is not cleared (c.f Get shutdown API). This usually happens after some failed configuration changes.

Related to the above, the number of replicas (set by index.auto_expand_replicas) may be incorrect in a single-node cluster (c.f elastic/elasticsearch#84788).

Workaround

a) Delete the shutdown metadata (c.f Delete shutdown API). This will allow the shards to be allocated and the Kibana saved objects migration to complete.

b) Wait for the next restart of Kibana (this happens automatically in ECE/ESS in this particular scenario).

c) Check that Kibana is healthy and accessible. Kibana logs should reveal that the saved objects migration successfully completed. For example:

[.kibana] Migration completed after 124282ms
...
[.kibana_task_manager] Migration completed after 81877ms
@rudolf rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Feature:Migrations labels Mar 8, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@romain-chanu
Copy link

romain-chanu commented Mar 10, 2022

After further investigation, we found-out that this issue is caused by the following:

a) Orchestration issue in ESS/ECE where the shutdown metadata is not cleared (c.f Get shutdown API). This usually happens after some failed configuration changes. If so, the workaround here is to delete the shutdown metadata (c.f Delete shutdown API). This will allow the shards to be allocated and the Kibana saved objects migration to complete.

b) Related to the above, the number of replicas (set by index.auto_expand_replicas) may be incorrect in a single-node cluster (c.f elastic/elasticsearch#84788).

@pgayvallet
Copy link
Contributor

From #129016:

We want to:

In any case:

  • Be able to identify it, and to assign a unique error code to it
  • Add online documentation describing how to fix, or work around, the failure
    • it can either be one page per failure or one page listing all the failures, TBD
  • Surface the error code, and the link to the documentation, in the failure's log

When the failure's cause can be predetermined:

  • fail-fast during the migration
  • surface the problem in Upgrade Assistant

@exalate-issue-sync exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:medium Medium Level of Effort labels Mar 31, 2022
@lukeelmers lukeelmers self-assigned this May 12, 2022
@rudolf
Copy link
Contributor Author

rudolf commented May 18, 2022

We believe the root cause of this problem will be fixed in 8.3 elastic/elasticsearch#86047

@rudolf
Copy link
Contributor Author

rudolf commented May 23, 2022

elastic/elasticsearch#85277 Also seems to address a part of this problem although it was backported to 7.17.2 we still saw occurrences of this in 7.17.2

@rudolf
Copy link
Contributor Author

rudolf commented Jul 13, 2022

This error occurs under the following circumstances:

  1. Kibana creates the temporary index but not all shards are ready within the timeout
  2. Because shards weren't ready we poll for the index status to become yellow meaning the primary shard has been started
  3. Once the status is yellow we start writing batches of documents into the temporary index with wait_for_active_shards="all". If, at this point the replica shard hasn't been started and the index status is still yellow the batch indexing request will fail and crash the Kibana process.

Apart from the referenced bugs for single node clusters this is almost always a temporary problem, so if Kibana gets restarted the index status is usually green eventually and the migration can successfully complete without intervention.

To fix this we should wait for the temporary index status to become "green". We initially chose an "yellow" index status because that's all that we need for reading from the source index, but when writing to an index we always need a "green" status because we use wait_for_active_shards="all" to ensure durable writes.

So WAIT_FOR_YELLOW_SOURCE should continue to wait for a "yellow" status, but the createIndex action should wait for a "green" status.

@asus4you
Copy link

asus4you commented Apr 23, 2024

has this been resolved? we have ES cluster in k8s and seems like we have a similar issue but since kibana is down we wont be able to perform these steps? how to get this working for cluster? what is the resolution to this?
any help is appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Migrations impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:medium Medium Level of Effort project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
6 participants