You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[BUG][Remote store] All Indices not getting recovered after quorum loss when cluster manager nodes are replaced, failing with TranslogCorruptedException
#10790
Describe the bug
On clusters with remote store enabled, all indices are not getting recovered after quorum loss when cluster manager nodes are replaced
Scenario: Multi node domain with dedicated cluster manager nodes -> Index Data -> Terminate all cluster manager nodes and replace them -> Fix quorum loss by setting cluster.initial_master_nodes with new cluster manager node IPs on one of the cluster manager nodes and restarting node -> Fix detached data nodes -> TranslogCorruptedException, NoSuchFileException
Logs
2023-10-20T15:28:42,174][INFO ][c.a.c.e.logger ] [349d19b6dc461975ef0515e0f5224587] GET /_cat/master h=node 200 OK 33 1
[2023-10-20T15:28:42,607][DEBUG][o.o.a.a.c.a.TransportClusterAllocationExplainAction] [349d19b6dc461975ef0515e0f5224587] explaining the allocation for [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false], found shard [[test-2sdhve7ec7][0], node[null], [P], recovery_source[remote store recovery [gVl21sdOSBq8YIEddlPKNw]], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-10-20T14:21:09.795Z], failed_attempts[5], failed_nodes[[9FWkpDa3RnWWcdZuTvs66g, gnxcaMOIQiar3G6hqACpzg]], delayed=false, details[failed shard on node [9FWkpDa3RnWWcdZuTvs66g]: failed recovery, failure RecoveryFailedException[[test-2sdhve7ec7][0]: Recovery failed on {ae0618cb89019b62790046bef5909871}{9FWkpDa3RnWWcdZuTvs66g}{egtY9SR5TEGIB13UuJVOYw}{<redacted>}{dir}
...
REDACTED
...
; nested: IndexShardRecoveryException[failed recovery]; nested: TranslogCorruptedException[translog from source [<redacted>/data/nodes/0/indices/00c68yx9h1TqyFrM-rxWlFRw/0/translog/translog.ckp] is corrupted]; nested: NoSuchFileException[<redacted>/data/nodes/0/indices/00c68yx9h1TqyFrM-rxWlFRw/0/translog/translog.ckp]; ], allocation_status[deciders_no]]]
rahulkarajgikar
changed the title
[BUG][Remote store] All Indices not getting recovered after quorum loss, failing with TranslogCorruptedException
[BUG][Remote store] All Indices not getting recovered after quorum loss when master nodes are replaced, failing with TranslogCorruptedException
Oct 20, 2023
rahulkarajgikar
changed the title
[BUG][Remote store] All Indices not getting recovered after quorum loss when master nodes are replaced, failing with TranslogCorruptedException
[BUG][Remote store] All Indices not getting recovered after quorum loss when cluster manager nodes are replaced, failing with TranslogCorruptedException
Oct 20, 2023
Describe the bug
On clusters with remote store enabled, all indices are not getting recovered after quorum loss when cluster manager nodes are replaced
Scenario: Multi node domain with dedicated cluster manager nodes -> Index Data -> Terminate all cluster manager nodes and replace them -> Fix quorum loss by setting
cluster.initial_master_nodes
with new cluster manager node IPs on one of the cluster manager nodes and restarting node -> Fix detached data nodes ->TranslogCorruptedException
,NoSuchFileException
Logs
Additional context
Meta issue: #10523
The text was updated successfully, but these errors were encountered: