[BUG][Remote store] All Indices not getting recovered after quorum loss when cluster manager nodes are replaced, failing with TranslogCorruptedException #10790

rahulkarajgikar · 2023-10-20T15:59:49Z

Describe the bug
On clusters with remote store enabled, all indices are not getting recovered after quorum loss when cluster manager nodes are replaced

Scenario: Multi node domain with dedicated cluster manager nodes -> Index Data -> Terminate all cluster manager nodes and replace them -> Fix quorum loss by setting cluster.initial_master_nodes with new cluster manager node IPs on one of the cluster manager nodes and restarting node -> Fix detached data nodes -> TranslogCorruptedException, NoSuchFileException

Logs

2023-10-20T15:28:42,174][INFO ][c.a.c.e.logger           ] [349d19b6dc461975ef0515e0f5224587] GET /_cat/master h=node 200 OK 33 1
[2023-10-20T15:28:42,607][DEBUG][o.o.a.a.c.a.TransportClusterAllocationExplainAction] [349d19b6dc461975ef0515e0f5224587] explaining the allocation for [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false], found shard [[test-2sdhve7ec7][0], node[null], [P], recovery_source[remote store recovery [gVl21sdOSBq8YIEddlPKNw]], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-10-20T14:21:09.795Z], failed_attempts[5], failed_nodes[[9FWkpDa3RnWWcdZuTvs66g, gnxcaMOIQiar3G6hqACpzg]], delayed=false, details[failed shard on node [9FWkpDa3RnWWcdZuTvs66g]: failed recovery, failure RecoveryFailedException[[test-2sdhve7ec7][0]: Recovery failed on {ae0618cb89019b62790046bef5909871}{9FWkpDa3RnWWcdZuTvs66g}{egtY9SR5TEGIB13UuJVOYw}{<redacted>}{dir}
...
REDACTED
...
; nested: IndexShardRecoveryException[failed recovery]; nested: TranslogCorruptedException[translog from source [<redacted>/data/nodes/0/indices/00c68yx9h1TqyFrM-rxWlFRw/0/translog/translog.ckp] is corrupted]; nested: NoSuchFileException[<redacted>/data/nodes/0/indices/00c68yx9h1TqyFrM-rxWlFRw/0/translog/translog.ckp]; ], allocation_status[deciders_no]]]

Additional context
Meta issue: #10523

The text was updated successfully, but these errors were encountered:

shwetathareja · 2023-10-30T08:49:28Z

@sachinpkale / @gbbafna can you check if this is already fixed and resolve it.

rahulkarajgikar added bug Something isn't working untriaged labels Oct 20, 2023

psychbot added the Storage:Remote label Oct 20, 2023

shwetathareja removed the untriaged label Oct 30, 2023

linuxpi closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][Remote store] All Indices not getting recovered after quorum loss when cluster manager nodes are replaced, failing with TranslogCorruptedException #10790

[BUG][Remote store] All Indices not getting recovered after quorum loss when cluster manager nodes are replaced, failing with TranslogCorruptedException #10790

rahulkarajgikar commented Oct 20, 2023 •

edited

Loading

shwetathareja commented Oct 30, 2023

[BUG][Remote store] All Indices not getting recovered after quorum loss when cluster manager nodes are replaced, failing with TranslogCorruptedException #10790

[BUG][Remote store] All Indices not getting recovered after quorum loss when cluster manager nodes are replaced, failing with TranslogCorruptedException #10790

Comments

rahulkarajgikar commented Oct 20, 2023 • edited Loading

shwetathareja commented Oct 30, 2023

rahulkarajgikar commented Oct 20, 2023 •

edited

Loading