Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent import of dangling indices from a later version #34264

Closed
Bukhtawar opened this issue Oct 3, 2018 · 8 comments · Fixed by #48652
Closed

Prevent import of dangling indices from a later version #34264

Bukhtawar opened this issue Oct 3, 2018 · 8 comments · Fixed by #48652
Labels
>bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) good first issue low hanging fruit help wanted adoptme

Comments

@Bukhtawar
Copy link
Contributor

A higher version of an index 6.3.2 can be restored on a cluster with node version 6.3.1 however when additional nodes try to join the cluster, their membership is restricted based on compatibility checks
which ensures that all indices in the given metadata will not be created with a newer version of elasticsearch as well as that all indices are newer or equal to the minimum index compatibility version.
based on Version#minimumIndexCompatibilityVersion

Elasticsearch version (bin/elasticsearch --version): 6.3.1

Plugins installed: []

JVM version (java -version): JDK 10

OS version (uname -a if on a Unix-like system): Linux ip-10-212-18-25 4.9.70-25.242.amzn1.x86_64

Description of the problem including expected versus actual behavior:

Steps to reproduce:

  1. Restore an index of version 6.3.2 on a datanode with version 6.3.1
  2. Spin up a new data node with version 6.3.1
  3. Notice node is unable to join the cluster

Provide logs (if relevant):
[2018-10-03T00:00:10,030][INFO ][o.e.d.z.ZenDiscovery ] [wowvdMi] failed to send join request to master [{5GZT6AM}{5GZT6AMsSSy2x5vCdDLFXA}{1aYFoazmRXWt2OKoo3Slfg}{10.xx.xxx.xxx}{10.xx.xxx.xxx:9300}], reason [RemoteTransportException[[5GZT6AM][10.xx.xx.xxx:9300][internal:discovery/zen/join]]; nested: IllegalStateException[index [index-docs_2018092113/fKi49vf3TzKDshgg9ydzaQ] version not supported: 6.3.2 the node version is: 6.3.1]; ]

@ywelsch ywelsch added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Oct 3, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@original-brownbear original-brownbear self-assigned this Oct 17, 2018
@original-brownbear
Copy link
Member

I have this reproduced => fixing now to not allow restoring a newer index version to an older datanode version.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 21, 2018
* Restore should check minimum version in the cluster and not
the current master node's version for compatibility
* Closes elastic#34264
@original-brownbear
Copy link
Member

I talked about this with @DaveCTurner and the behavior here seems to not necessarily be a bug.

We are not actually restoring any data to the old data node in a mixed cluster, all we're doing is acknowledging the restore which puts the newer index in the state and prevent old datanodes from joining.

With the change I suggested in #34676 we're preventing putting starting the restore during rolling upgrade which may be less confusing but also reduces functionality during rolling upgrade (which seems to work fine if you have new version data nodes present).

=> maybe the behavior is ok as is?

@DaveCTurner
Copy link
Contributor

I'd like more information from @Bukhtawar because I tried the steps given to reproduce this and got the expected error when trying to restore a snapshot taken on a 6.4.2 cluster into a 6.4.1 cluster:

POST /_snapshot/my_backup/snapshot_1/_restore
# 500 Internal Server Error
# {
#   "status": 500,
#   "error": {
#     "reason": "[my_backup:snapshot_1/qHahtNWXQ3elpe8rYqHEkA] the snapshot was created with Elasticsearch version [6.4.2] which is higher than the version of this node [6.4.1]",
#     "root_cause": [
#       {
#         "reason": "[my_backup:snapshot_1/qHahtNWXQ3elpe8rYqHEkA] the snapshot was created with Elasticsearch version [6.4.2] which is higher than the version of this node [6.4.1]",
#         "type": "snapshot_restore_exception"
#       }
#     ],
#     "type": "snapshot_restore_exception"
#   }
# }

To reproduce this needed a cluster that comprised a mix of 6.4.1 and 6.4.2 version nodes. We only expect a cluster of mixed versions to occur during a rolling upgrade, and in this situation we don't expect there to be more 6.4.1 nodes joining the cluster - in fact there are other things you can do during a rolling upgrade that will block the older nodes from joining the cluster, such as simply creating an index, and the solution in all cases is to upgrade the removed node. I think perhaps the docs could be clearer on this subject, but I can't see that we should change the behaviour here.

Did this occur during a rolling uprgade, and if so why were there more 6.4.1 nodes joining the cluster?

@Bukhtawar
Copy link
Contributor Author

@DaveCTurner Thanks for taking a look.

We needed to urgently resize our clusters by adding additional capacity which didn't work coz the restore of a higher version of index from another snapshot repository failed the membership checks and subsequently forced us to delete that index.
Well we did try to restore a 6.3.2 index onto a 6.3.1 node which worked without the need for a mixed cluster

@DaveCTurner
Copy link
Contributor

Well we did try to restore a 6.3.2 index onto a 6.3.1 node which worked without the need for a mixed cluster

I cannot currently see how to reproduce this in a single-version cluster. My attempt, described above, failed with an exception. Can you explain how to reproduce this?

@Bukhtawar
Copy link
Contributor Author

  • Start-up a single-node ES 6.2.4 cluster (say N1) . Create an index (say idx-624) on it and add some documents. The index would have the version as 6.2.4 in its metadata.
  • Start-up another single-node ES 6.2.3 cluster (say N2). Allow N1 to join this cluster. Ensure that the master of the cluster is still N2, the older version ES cluster. The combined cluster now would also show index: idx-624.
  • Initiate a snapshot on this cluster (say S1). The S3 file structure would look as the below.
    • Root level metadata will contain the master version which is 6.2.3
    • Index metadata for idx-624 will contain the index version which is 6.2.4

Root
|-snap-.dat -> Root level metadata file contains the master ES version 6.2.3
|-indexes
|-
|-meta-.dat -> Contains the index version 6.2.4

  • Restore S1 onto an ES cluster in version ES-6.2.3. During restore, ES validates the root level metadata and restore will succeed.
  • ES cluster now contains an index idx-624 of a higher version than the current ES version.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Oct 27, 2018

Aha, thanks, that helps. This is a very strange sequence of operations - you are essentially merging two distinct clusters together, which only works today because of the lenience that dangling indices provide. Indeed when I try this I see the vital log message as the later-versioned index is brought into the earlier-versioned cluster:

[2018-10-27T09:37:36,225][INFO ][o.e.g.LocalAllocateDangledIndices] [VNUG9FA] auto importing dangled indices [[i/aU903bhuQ_-V1_9t-XC99A]/OPEN] from [{u13ixhL}{u13ixhL1SPK-fcYmr3AVQg}{yetE4T05SjeCbUFlxdyxBw}{127.0.0.1}{127.0.0.1:9301}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]

I think we should not import the dangling 6.4.2 index in this case, because as soon as that has happened no more 6.4.1 nodes can join the cluster - there is no need to snapshot and restore anything. Good catch.

@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) good first issue low hanging fruit and removed :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs feedback_needed labels Oct 27, 2018
@DaveCTurner DaveCTurner added the help wanted adoptme label Oct 27, 2018
@DaveCTurner DaveCTurner changed the title A higher version of an index can be restored in a cluster but when nodes try to join the cluster, membership fails Prevent import of dangling indices from a later version Oct 27, 2018
DaveCTurner pushed a commit that referenced this issue Oct 31, 2019
Today it is possible that we import a dangling index that was created in a
newer version than one or more of the nodes in the cluster. Such an index would
prevent the older node(s) from rejoining the cluster if they were to briefly
leave it for some reason. This commit prevents the import of such dangling
indices.

Fixes #34264
DaveCTurner pushed a commit that referenced this issue Oct 31, 2019
Today it is possible that we import a dangling index that was created in a
newer version than one or more of the nodes in the cluster. Such an index would
prevent the older node(s) from rejoining the cluster if they were to briefly
leave it for some reason. This commit prevents the import of such dangling
indices.

Fixes #34264
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) good first issue low hanging fruit help wanted adoptme
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants