Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][Remote store] All Indices not getting recovered after quorum loss when cluster manager nodes are replaced, failing with TranslogCorruptedException #10790

Closed
rahulkarajgikar opened this issue Oct 20, 2023 · 1 comment
Labels
bug Something isn't working Storage:Remote

Comments

@rahulkarajgikar
Copy link
Contributor

rahulkarajgikar commented Oct 20, 2023

Describe the bug
On clusters with remote store enabled, all indices are not getting recovered after quorum loss when cluster manager nodes are replaced

Scenario: Multi node domain with dedicated cluster manager nodes -> Index Data -> Terminate all cluster manager nodes and replace them -> Fix quorum loss by setting cluster.initial_master_nodes with new cluster manager node IPs on one of the cluster manager nodes and restarting node -> Fix detached data nodes -> TranslogCorruptedException, NoSuchFileException

Logs

2023-10-20T15:28:42,174][INFO ][c.a.c.e.logger           ] [349d19b6dc461975ef0515e0f5224587] GET /_cat/master h=node 200 OK 33 1
[2023-10-20T15:28:42,607][DEBUG][o.o.a.a.c.a.TransportClusterAllocationExplainAction] [349d19b6dc461975ef0515e0f5224587] explaining the allocation for [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false], found shard [[test-2sdhve7ec7][0], node[null], [P], recovery_source[remote store recovery [gVl21sdOSBq8YIEddlPKNw]], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-10-20T14:21:09.795Z], failed_attempts[5], failed_nodes[[9FWkpDa3RnWWcdZuTvs66g, gnxcaMOIQiar3G6hqACpzg]], delayed=false, details[failed shard on node [9FWkpDa3RnWWcdZuTvs66g]: failed recovery, failure RecoveryFailedException[[test-2sdhve7ec7][0]: Recovery failed on {ae0618cb89019b62790046bef5909871}{9FWkpDa3RnWWcdZuTvs66g}{egtY9SR5TEGIB13UuJVOYw}{<redacted>}{dir}
...
REDACTED
...
; nested: IndexShardRecoveryException[failed recovery]; nested: TranslogCorruptedException[translog from source [<redacted>/data/nodes/0/indices/00c68yx9h1TqyFrM-rxWlFRw/0/translog/translog.ckp] is corrupted]; nested: NoSuchFileException[<redacted>/data/nodes/0/indices/00c68yx9h1TqyFrM-rxWlFRw/0/translog/translog.ckp]; ], allocation_status[deciders_no]]]

Additional context
Meta issue: #10523

@rahulkarajgikar rahulkarajgikar added bug Something isn't working untriaged labels Oct 20, 2023
@rahulkarajgikar rahulkarajgikar changed the title [BUG][Remote store] All Indices not getting recovered after quorum loss, failing with TranslogCorruptedException [BUG][Remote store] All Indices not getting recovered after quorum loss when master nodes are replaced, failing with TranslogCorruptedException Oct 20, 2023
@rahulkarajgikar rahulkarajgikar changed the title [BUG][Remote store] All Indices not getting recovered after quorum loss when master nodes are replaced, failing with TranslogCorruptedException [BUG][Remote store] All Indices not getting recovered after quorum loss when cluster manager nodes are replaced, failing with TranslogCorruptedException Oct 20, 2023
@shwetathareja
Copy link
Member

@sachinpkale / @gbbafna can you check if this is already fixed and resolve it.

@linuxpi linuxpi closed this as completed Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage:Remote
Projects
None yet
Development

No branches or pull requests

4 participants