Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore metadata of deleted indices at start #48918

Merged

Conversation

DaveCTurner
Copy link
Contributor

Today in 6.x it is possible to add an index tombstone to the graveyard without
deleting the corresponding index metadata, because the deletion is slightly
deferred. If you shut down the node and upgrade to 7.x when in this state then
the node will fail to apply any cluster states, reporting

java.lang.IllegalStateException: Cannot delete index [...], it is still part of the cluster state.

This commit addresses this situation by skipping over any index metadata with a
corresponding tombstone, allowing this metadata to be cleaned up by the 7.x
node.

Today in 6.x it is possible to add an index tombstone to the graveyard without
deleting the corresponding index metadata, because the deletion is slightly
deferred. If you shut down the node and upgrade to 7.x when in this state then
the node will fail to apply any cluster states, reporting

    java.lang.IllegalStateException: Cannot delete index [...], it is still part of the cluster state.

This commit addresses this situation by skipping over any index metadata with a
corresponding tombstone, allowing this metadata to be cleaned up by the 7.x
node.
@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.5.0 v7.6.0 labels Nov 9, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

@DaveCTurner
Copy link
Contributor Author

@DaveCTurner
Copy link
Contributor Author

This needs more work because it doesn't address the case where you're in this state in a rolling upgrade.

@DaveCTurner
Copy link
Contributor Author

This needs more work because it doesn't address the case where you're in this state in a rolling upgrade.

Discussed this and decided it's ok for a rolling upgrade to fall back on a full cluster restart if it happens to be in this state, and this PR will allow the full cluster restart to proceed.

Copy link
Contributor

@andrershov andrershov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I've left one request to rename the function name


final MetaData metaData = internalCluster().getInstance(ClusterService.class).state().metaData();
final Path[] paths = internalCluster().getInstance(NodeEnvironment.class).nodeDataPaths();
writeBrokenMeta(metaStateService -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not your change, but the name "writeBrokenMeta" looks invalid for two reasons - in this test we write well-formed metadata, this method performs full-cluster restart and I think this should be reflected in the method name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more substantial change to this method (including fixing its name) is incoming in https://github.com/elastic/elasticsearch/pull/48733/files#diff-a53ee618ca95b1bde55d7f5508a03d6aR511.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However the metadata written here is well-formed but still broken - it contains a tombstone for an index that is not properly deleted.

Copy link
Contributor

@andrershov andrershov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jimczi jimczi removed the v7.5.0 label Nov 12, 2019
@DaveCTurner DaveCTurner merged commit 13170a7 into elastic:master Nov 12, 2019
@DaveCTurner DaveCTurner deleted the 2019-11-08-half-deleted-index-import branch November 12, 2019 11:02
DaveCTurner added a commit that referenced this pull request Nov 12, 2019
Today in 6.x it is possible to add an index tombstone to the graveyard without
deleting the corresponding index metadata, because the deletion is slightly
deferred. If you shut down the node and upgrade to 7.x when in this state then
the node will fail to apply any cluster states, reporting

    java.lang.IllegalStateException: Cannot delete index [...], it is still part of the cluster state.

This commit addresses this situation by skipping over any index metadata with a
corresponding tombstone, allowing this metadata to be cleaned up by the 7.x
node.
DaveCTurner added a commit that referenced this pull request Nov 12, 2019
Today in 6.x it is possible to add an index tombstone to the graveyard without
deleting the corresponding index metadata, because the deletion is slightly
deferred. If you shut down the node and upgrade to 7.x when in this state then
the node will fail to apply any cluster states, reporting

    java.lang.IllegalStateException: Cannot delete index [...], it is still part of the cluster state.

This commit addresses this situation by skipping over any index metadata with a
corresponding tombstone, allowing this metadata to be cleaned up by the 7.x
node.
@dpeddi
Copy link

dpeddi commented Jan 28, 2021

I got this error while upgrading from 6.7 to 7.0.1.
After reading this I tried another upgrade upgrade to 7.10.2 but the error is still present.

Just for other users/readers, I was able to restart my cluster by manually removing all the state with the latest lucene/luke after reading the "Cannot delete index [...], it is still part of the cluster state." message for each node start

@DaveCTurner
Copy link
Contributor Author

Just for other users/readers, I was able to restart my cluster by manually removing all the state with the latest lucene/luke

For the sake of other users/readers, this advice is extremely dangerous and we do not recommend following it. Editing the contents of the data path using a tool like Luke can result in arbitrary and silent data loss.

@dpeddi the actual solution is not to upgrade to 7.0.1, this version is already long past EOL. The current recommendation is to upgrade to the latest 6.8 and then the latest 7.x (7.10.2 at time of writing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v7.5.0 v7.6.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants