Adjusting the number of replicas can remove valid shard copies from the in-sync set #21719

ywelsch · 2016-11-21T21:04:48Z

Assume a 3-node cluster (1 master, 2 data nodes) with an index with 1 primary and 1 replica. Decommission the node with the replica shard by shutting it down and wiping its state. The master moves the replica shard to unassigned, but keeps its allocation id in the in-sync set as long as no replication operations happen on the primary. Now, decrease the number of replicas to zero. This does not remove the allocation id of the replica from the in-sync set. Shut down the node with the primary. This will move the primary shard to unassigned AND update the in-sync set:
A logic in IndexMetaDataUpdater comes into play that limits the number of in-sync replica entries to the maximum number of shards that can be active (namely 1, as we have decreased the number of replicas to 0). The set of in-sync allocation ids contains the allocation id of the primary and the replica. The algorithm in IndexMetaDataUpdater has no way to chose which id to eliminate from the in-sync replica set and eliminates one at random. Assume the primary is eliminated. After restarting the node with the primary, it cannot be automatically allocated as its shard copy does not match the entry in the in-sync replica set.

The text was updated successfully, but these errors were encountered:

vanga · 2016-11-30T11:04:13Z

I am not sure if this is the write place to put this
But, I ran into something like this twice now

I did following operations (I don't remember the order of these)

"Added new node"
"Disabled allocation"
"Added more nodes"
"Excluded few nodes from shard allocation "
"Changed # of replicas to 0"

I also had changed node awareness in between (disabled by setting awareness attribute to null)

I ended up with unassigned shards even after reverting all cluster updates I did before.
I looked at the disk and I could see that shard data is there.
Routing table in cluster state says NODE_LEFT as the reason

I waited for almost an hour, some shards didn't get allocated. I manually pinged _cluster/reroute which didn't make any difference.

I had to restart one of the node where those unassigned shard data is there on disk. Even then some indices are still in unassigned state.

Routing information for the shard that didn't get allocated even after restarting

               "3": [
                  {
                     "state": "UNASSIGNED",
                     "primary": true,
                     "node": null,
                     "relocating_node": null,
                     "shard": 3,
                     "index": "test",
                     "recovery_source": {
                        "type": "EXISTING_STORE"
                     },
                     "unassigned_info": {
                        "reason": "NODE_LEFT",
                        "at": "2016-11-30T09:05:10.817Z",
                        "delayed": false,
                        "details": "node_left[fWo-O27hRW-egfvtpQO-BQ]",
                        "allocation_status": "no_valid_shard_copy"
                     }
                  },

curl -XGET 'http://localhost:9200/_shard_stores?pretty'

{
  "indices" : {
    "test" : {
      "shards" : {
        "3" : {
          "stores" : [
            {
              "fWo-O27hRW-egfvtpQO-BQ" : {
                "name" : "data-primary-255684",
                "ephemeral_id" : "aJV_xqKqR6i_QLC1lWortg",
                "transport_address" : "10.0.34.17:9300",
                "attributes" : {
                  "rack_id" : "primary"
                }
              },
              "allocation_id" : "45790dx0S9Sb2T7nHQnbGw",
              "allocation" : "unused"
            }
          ]
        }
      }
    }
  }
}

ywelsch · 2016-11-30T11:30:49Z

Can you also provide the output of

curl -XGET 'http://localhost:9200/_cluster/allocation/explain' -d'{
  "index": "test",
  "shard": 3,
  "primary": true
}'

As workaround, if you're absolutely sure that only this one node has data for this shard (i.e. all nodes were available while running the _shard_stores command), you can use the allocate_stale_primary command (which is part of the reroute command, see https://www.elastic.co/guide/en/elasticsearch/reference/master/cluster-reroute.html ) to explicitly tell the system to chose the shard data on node data-primary-255684 as primary copy. The risk with running the allocate_stale_primary command is that if the data on that node is not the most recent one (there exists another newer copy of the data somewhere else), you will experience data loss as the newer shard copy will be discarded.

vanga · 2016-11-30T11:48:20Z

I am doing some modifications to the cluster, so this output may have some irrelevant info
https://gist.github.com/vanga/7ff86fc51127e2be42670cd4c78cd9ec

Main index was fixed by restarting the node, not sure why this shard didn't get allocated even though I see corresponding files for this shard on disk.

you can use the allocate_stale_primary command (which is part of the reroute command,

After doing this manually now its allocated.
Thanks.

This commit makes two changes to how the in-sync allocations set is updated: - the set is only trimmed when it grows. This prevents trimming too eagerly when the number of replicas was decreased while shards were unassigned. - the allocation id of an active primary that failed is only removed from the in-sync set if another replica gets promoted to primary. This prevents the situation where the only available shard copy in the cluster gets removed the in-sync set. Closes #21719

ywelsch added :Allocation >bug labels Nov 21, 2016

ywelsch mentioned this issue Dec 5, 2016

Trim in-sync allocations set only when it grows #21976

Merged

ywelsch closed this as completed in 13e1a6f Dec 7, 2016

lcawl added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Allocation labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjusting the number of replicas can remove valid shard copies from the in-sync set #21719

Adjusting the number of replicas can remove valid shard copies from the in-sync set #21719

ywelsch commented Nov 21, 2016

vanga commented Nov 30, 2016 •

edited

Loading

ywelsch commented Nov 30, 2016

vanga commented Nov 30, 2016 •

edited

Loading

Adjusting the number of replicas can remove valid shard copies from the in-sync set #21719

Adjusting the number of replicas can remove valid shard copies from the in-sync set #21719

Comments

ywelsch commented Nov 21, 2016

vanga commented Nov 30, 2016 • edited Loading

ywelsch commented Nov 30, 2016

vanga commented Nov 30, 2016 • edited Loading

vanga commented Nov 30, 2016 •

edited

Loading

vanga commented Nov 30, 2016 •

edited

Loading