Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjusting the number of replicas can remove valid shard copies from the in-sync set #21719

Closed
ywelsch opened this issue Nov 21, 2016 · 3 comments
Labels
>bug :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can.

Comments

@ywelsch
Copy link
Contributor

ywelsch commented Nov 21, 2016

Assume a 3-node cluster (1 master, 2 data nodes) with an index with 1 primary and 1 replica. Decommission the node with the replica shard by shutting it down and wiping its state. The master moves the replica shard to unassigned, but keeps its allocation id in the in-sync set as long as no replication operations happen on the primary. Now, decrease the number of replicas to zero. This does not remove the allocation id of the replica from the in-sync set. Shut down the node with the primary. This will move the primary shard to unassigned AND update the in-sync set:
A logic in IndexMetaDataUpdater comes into play that limits the number of in-sync replica entries to the maximum number of shards that can be active (namely 1, as we have decreased the number of replicas to 0). The set of in-sync allocation ids contains the allocation id of the primary and the replica. The algorithm in IndexMetaDataUpdater has no way to chose which id to eliminate from the in-sync replica set and eliminates one at random. Assume the primary is eliminated. After restarting the node with the primary, it cannot be automatically allocated as its shard copy does not match the entry in the in-sync replica set.

@vanga
Copy link

vanga commented Nov 30, 2016

I am not sure if this is the write place to put this
But, I ran into something like this twice now

I did following operations (I don't remember the order of these)

"Added new node"
"Disabled allocation"
"Added more nodes"
"Excluded few nodes from shard allocation "
"Changed # of replicas to 0"

I also had changed node awareness in between (disabled by setting awareness attribute to null)

I ended up with unassigned shards even after reverting all cluster updates I did before.
I looked at the disk and I could see that shard data is there.
Routing table in cluster state says NODE_LEFT as the reason

I waited for almost an hour, some shards didn't get allocated. I manually pinged _cluster/reroute which didn't make any difference.

I had to restart one of the node where those unassigned shard data is there on disk. Even then some indices are still in unassigned state.

Routing information for the shard that didn't get allocated even after restarting

               "3": [
                  {
                     "state": "UNASSIGNED",
                     "primary": true,
                     "node": null,
                     "relocating_node": null,
                     "shard": 3,
                     "index": "test",
                     "recovery_source": {
                        "type": "EXISTING_STORE"
                     },
                     "unassigned_info": {
                        "reason": "NODE_LEFT",
                        "at": "2016-11-30T09:05:10.817Z",
                        "delayed": false,
                        "details": "node_left[fWo-O27hRW-egfvtpQO-BQ]",
                        "allocation_status": "no_valid_shard_copy"
                     }
                  },

curl -XGET 'http://localhost:9200/_shard_stores?pretty'

{
  "indices" : {
    "test" : {
      "shards" : {
        "3" : {
          "stores" : [
            {
              "fWo-O27hRW-egfvtpQO-BQ" : {
                "name" : "data-primary-255684",
                "ephemeral_id" : "aJV_xqKqR6i_QLC1lWortg",
                "transport_address" : "10.0.34.17:9300",
                "attributes" : {
                  "rack_id" : "primary"
                }
              },
              "allocation_id" : "45790dx0S9Sb2T7nHQnbGw",
              "allocation" : "unused"
            }
          ]
        }
      }
    }
  }
}

@ywelsch
Copy link
Contributor Author

ywelsch commented Nov 30, 2016

Can you also provide the output of

curl -XGET 'http://localhost:9200/_cluster/allocation/explain' -d'{
  "index": "test",
  "shard": 3,
  "primary": true
}'

As workaround, if you're absolutely sure that only this one node has data for this shard (i.e. all nodes were available while running the _shard_stores command), you can use the allocate_stale_primary command (which is part of the reroute command, see https://www.elastic.co/guide/en/elasticsearch/reference/master/cluster-reroute.html ) to explicitly tell the system to chose the shard data on node data-primary-255684 as primary copy. The risk with running the allocate_stale_primary command is that if the data on that node is not the most recent one (there exists another newer copy of the data somewhere else), you will experience data loss as the newer shard copy will be discarded.

@vanga
Copy link

vanga commented Nov 30, 2016

I am doing some modifications to the cluster, so this output may have some irrelevant info
https://gist.github.com/vanga/7ff86fc51127e2be42670cd4c78cd9ec

Main index was fixed by restarting the node, not sure why this shard didn't get allocated even though I see corresponding files for this shard on disk.

you can use the allocate_stale_primary command (which is part of the reroute command,

After doing this manually now its allocated.
Thanks.

@ywelsch ywelsch closed this as completed in 13e1a6f Dec 7, 2016
ywelsch added a commit that referenced this issue Dec 7, 2016
This commit makes two changes to how the in-sync allocations set is updated:

- the set is only trimmed when it grows. This prevents trimming too eagerly when the number of replicas was decreased while shards were unassigned.
- the allocation id of an active primary that failed is only removed from the in-sync set if another replica gets promoted to primary. This prevents the situation where the only available shard copy in the cluster gets removed the in-sync set.

Closes #21719
ywelsch added a commit that referenced this issue Dec 7, 2016
This commit makes two changes to how the in-sync allocations set is updated:

- the set is only trimmed when it grows. This prevents trimming too eagerly when the number of replicas was decreased while shards were unassigned.
- the allocation id of an active primary that failed is only removed from the in-sync set if another replica gets promoted to primary. This prevents the situation where the only available shard copy in the cluster gets removed the in-sync set.

Closes #21719
@lcawl lcawl added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Allocation labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can.
Projects
None yet
Development

No branches or pull requests

3 participants