Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrations fail with "this action would add [2] shards, but this cluster currently has [X]/[Y] maximum normal shards open" #128578

Closed
Tracked by #129016
pgayvallet opened this issue Mar 28, 2022 · 3 comments · Fixed by #132072
Assignees
Labels
Feature:Migrations impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:medium Medium Level of Effort project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@pgayvallet
Copy link
Contributor

pgayvallet commented Mar 28, 2022

Part of #129016

We've observed some Kibana upgrades to 7.17+ can fail with:

[.kibana_task_manager] Unexpected Elasticsearch ResponseError: statusCode: 400, method: PUT, url: /.kibana_task_manager_7.17.1_reindex_temp?wait_for_active_shards=all&timeout=60s 
error: [illegal_argument_exception]: Validation Failed: 1: this action would add [2] shards, 
but this cluster currently has [2999]/[3000] maximum normal shards open;,

Even if the root issue is the state of the ES cluster (), this issue is here to track the problem and discuss the potential solutions or workaround, either from within Kibana or upstream.

Potential solutions could be:

  • Have (Kibana?) system indices not count in the cluster's shard count
  • Add a warning in upgrade assistant when the cluster don't meed the shard requirement for a migration
  • Fail the migration early if the cluster is in a state that will not allow to complete the migration
  • Surface the problem in the health API

Links:

@pgayvallet pgayvallet added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Feature:Migrations labels Mar 28, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@Bamieh
Copy link
Member

Bamieh commented Mar 28, 2022

Add a warning in upgrade assistant when the cluster don't meed the shard requirement for a migration

do you think this warning should come from Kibana's side or ES's?


Fail the migration early if the cluster is in a state that will not allow to complete the migration

IMO this would be the most feasible and impactful solution from the ones proposed.

We can also provide a short link on how to fix the issue, similar to the link we have when the users hit performance issues in Kibana:

Average event loop delay threshold exceeded ${warnThreshold}ms. Received ${meanDurationMs}ms.
Check https://ela.st/event-loop-delay-considerations for more information about scaling Kibana.

@pgayvallet
Copy link
Contributor Author

From #129016:

We want to:

In any case:

  • Be able to identify it, and to assign a unique error code to it
  • Add online documentation describing how to fix, or work around, the failure
    • it can either be one page per failure or one page listing all the failures, TBD
  • Surface the error code, and the link to the documentation, in the failure's log

When the failure's cause can be predetermined:

  • fail-fast during the migration
  • surface the problem in Upgrade Assistant

@exalate-issue-sync exalate-issue-sync bot added impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:medium Medium Level of Effort labels Mar 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Migrations impact:needs-assessment Product and/or Engineering needs to evaluate the impact of the change. loe:medium Medium Level of Effort project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants