migrations fail with "Not enough active copies to meet shard count of [ALL] (have 1, needed 2)" #127136

rudolf · 2022-03-08T11:19:29Z

We've observed some Kibana upgrades to 8.0+ on a single node Elasticsearch cluster can fail with:

[{"type":"unavailable_shards_exception","reason":"[.kibana_task_manager_7.17.1_reindex_temp][0] Not enough active copies to meet shard count of [ALL] (have 1, needed 2). Timeout: [1m], request: [BulkShardRequest [[.kibana_task_manager_7.17.1_reindex_temp][0]] containing [32] requests]"},{"type":"unavailable_shards_exception","reason":"[.kibana_task_manager_7.17.1_reindex_temp][0] Not enough active copies to meet shard count of [ALL] (have 1, needed 2). Timeout: [1m]

This issue has been observed in Elasticsearch Service (ESS) and Elastic Cloud Enterprise (ECE). There is an orchestration issue where the shutdown metadata is not cleared (c.f Get shutdown API). This usually happens after some failed configuration changes.

Related to the above, the number of replicas (set by index.auto_expand_replicas) may be incorrect in a single-node cluster (c.f elastic/elasticsearch#84788).

Workaround

a) Delete the shutdown metadata (c.f Delete shutdown API). This will allow the shards to be allocated and the Kibana saved objects migration to complete.

b) Wait for the next restart of Kibana (this happens automatically in ECE/ESS in this particular scenario).

c) Check that Kibana is healthy and accessible. Kibana logs should reveal that the saved objects migration successfully completed. For example:

[.kibana] Migration completed after 124282ms
...
[.kibana_task_manager] Migration completed after 81877ms

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-03-08T11:19:31Z

Pinging @elastic/kibana-core (Team:Core)

romain-chanu · 2022-03-10T14:06:36Z

After further investigation, we found-out that this issue is caused by the following:

a) Orchestration issue in ESS/ECE where the shutdown metadata is not cleared (c.f Get shutdown API). This usually happens after some failed configuration changes. If so, the workaround here is to delete the shutdown metadata (c.f Delete shutdown API). This will allow the shards to be allocated and the Kibana saved objects migration to complete.

b) Related to the above, the number of replicas (set by index.auto_expand_replicas) may be incorrect in a single-node cluster (c.f elastic/elasticsearch#84788).

pgayvallet · 2022-03-31T08:50:54Z

From #129016:

We want to:

In any case:

Be able to identify it, and to assign a unique error code to it

Add online documentation describing how to fix, or work around, the failure

it can either be one page per failure or one page listing all the failures, TBD

Surface the error code, and the link to the documentation, in the failure's log

When the failure's cause can be predetermined:

fail-fast during the migration

surface the problem in Upgrade Assistant

rudolf · 2022-05-18T10:34:54Z

We believe the root cause of this problem will be fixed in 8.3 elastic/elasticsearch#86047

rudolf · 2022-05-23T23:04:26Z

elastic/elasticsearch#85277 Also seems to address a part of this problem although it was backported to 7.17.2 we still saw occurrences of this in 7.17.2

rudolf · 2022-07-13T12:46:36Z

This error occurs under the following circumstances:

Kibana creates the temporary index but not all shards are ready within the timeout
Because shards weren't ready we poll for the index status to become yellow meaning the primary shard has been started
Once the status is yellow we start writing batches of documents into the temporary index with wait_for_active_shards="all". If, at this point the replica shard hasn't been started and the index status is still yellow the batch indexing request will fail and crash the Kibana process.

Apart from the referenced bugs for single node clusters this is almost always a temporary problem, so if Kibana gets restarted the index status is usually green eventually and the migration can successfully complete without intervention.

To fix this we should wait for the temporary index status to become "green". We initially chose an "yellow" index status because that's all that we need for reading from the source index, but when writing to an index we always need a "green" status because we use wait_for_active_shards="all" to ensure durable writes.

So WAIT_FOR_YELLOW_SOURCE should continue to wait for a "yellow" status, but the createIndex action should wait for a "green" status.

asus4you · 2024-04-23T23:56:59Z

has this been resolved? we have ES cluster in k8s and seems like we have a similar issue but since kibana is down we wont be able to perform these steps? how to get this working for cluster? what is the resolution to this?
any help is appreciated

rudolf added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Feature:Migrations labels Mar 8, 2022

pgayvallet mentioned this issue Mar 31, 2022

[meta] Better handling of the most common migration issues #129016

Closed

4 tasks

exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:medium Medium Level of Effort labels Mar 31, 2022

lukeelmers self-assigned this May 12, 2022

lukeelmers mentioned this issue Jun 22, 2022

[migrations] Add descriptive logs for unavailable_shards_exception #134951

Closed

rudolf mentioned this issue Jul 19, 2022

Migrations wait for index status green if create index returns acknowledged=false or shardsAcknowledged=false #136605

Merged

9 tasks

rudolf closed this as completed in #136605 Jul 27, 2022

rudolf mentioned this issue May 17, 2023

Migrations fail on single node clusters due to unavailable shards exception #157968

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

migrations fail with "Not enough active copies to meet shard count of [ALL] (have 1, needed 2)" #127136

migrations fail with "Not enough active copies to meet shard count of [ALL] (have 1, needed 2)" #127136

rudolf commented Mar 8, 2022 •

edited by pgayvallet

Loading

elasticmachine commented Mar 8, 2022

romain-chanu commented Mar 10, 2022 •

edited

Loading

pgayvallet commented Mar 31, 2022

rudolf commented May 18, 2022

rudolf commented May 23, 2022

rudolf commented Jul 13, 2022

asus4you commented Apr 23, 2024 •

edited

Loading

migrations fail with "Not enough active copies to meet shard count of [ALL] (have 1, needed 2)" #127136

migrations fail with "Not enough active copies to meet shard count of [ALL] (have 1, needed 2)" #127136

Comments

rudolf commented Mar 8, 2022 • edited by pgayvallet Loading

Workaround

elasticmachine commented Mar 8, 2022

romain-chanu commented Mar 10, 2022 • edited Loading

pgayvallet commented Mar 31, 2022

rudolf commented May 18, 2022

rudolf commented May 23, 2022

rudolf commented Jul 13, 2022

asus4you commented Apr 23, 2024 • edited Loading

rudolf commented Mar 8, 2022 •

edited by pgayvallet

Loading

romain-chanu commented Mar 10, 2022 •

edited

Loading

asus4you commented Apr 23, 2024 •

edited

Loading