Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Seeing replica shard failure when running experimental segrep remote integrated index #8325

Closed
ashking94 opened this issue Jun 28, 2023 · 1 comment · Fixed by #8433
Labels
bug Something isn't working Storage:Durability Issues and PRs related to the durability framework

Comments

@ashking94
Copy link
Member

Describe the bug
While doing OSB for so workload, with index config of 3 shard & 1 replica, noticing that the replica shards fail after sometime with following stacktrace.

[2023-06-28T19:09:59,817][WARN ][o.o.c.r.a.AllocationService] [node-1] failing shard [failed shard, shard [so][1], node[ccufnmQhTLi6zBHeNrKvaA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=_Xgzcz-KQxyOUeYRJH4qTg], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-06-28T19:09:59.716Z], failed_attempts[4], failed_nodes[[ccufnmQhTLi6zBHeNrKvaA, BU-8J5poQRavqhIY6qB39A]], delayed=false, details[failed shard on node [BU-8J5poQRavqhIY6qB39A]: failed recovery, failure RecoveryFailedException[[so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-1}{BU-8J5poQRavqhIY6qB39A}{FiIcV_RQSXCDfhA88sNk8A}{172.32.32.222}{172.32.32.222:9300}{dim}{shard_indexing_pressure_enabled=true} ([so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-1}{BU-8J5poQRavqhIY6qB39A}{FiIcV_RQSXCDfhA88sNk8A}{172.32.32.222}{172.32.32.222:9300}{dim}{shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-1}{BU-8J5poQRavqhIY6qB39A}{FiIcV_RQSXCDfhA88sNk8A}{172.32.32.222}{172.32.32.222:9300}{dim}{shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[node-2][172.32.32.17:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[node-1][172.32.32.222:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: NullPointerException[Cannot invoke "org.opensearch.index.store.remote.metadata.RemoteSegmentMetadata.getMetadata()" because the return value of "org.opensearch.index.store.RemoteSegmentStoreDirectory.readLatestMetadataFile()" is null]; ], allocation_status[no_attempt]], message [failed recovery], failure [RecoveryFailedException[[so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-3}{ccufnmQhTLi6zBHeNrKvaA}{WYHwIkghSMmPQGwRaoqGXA}{172.32.32.113}{172.32.32.113:9300}{dim}{shard_indexing_pressure_enabled=true} ([so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-3}{ccufnmQhTLi6zBHeNrKvaA}{WYHwIkghSMmPQGwRaoqGXA}{172.32.32.113}{172.32.32.113:9300}{dim}{shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-3}{ccufnmQhTLi6zBHeNrKvaA}{WYHwIkghSMmPQGwRaoqGXA}{172.32.32.113}{172.32.32.113:9300}{dim}{shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[node-2][172.32.32.17:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[node-3][172.32.32.113:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: NullPointerException[Cannot invoke "org.opensearch.index.store.remote.metadata.RemoteSegmentMetadata.getMetadata()" because the return value of "org.opensearch.index.store.RemoteSegmentStoreDirectory.readLatestMetadataFile()" is null]; ], markAsStale [true]]
org.opensearch.indices.recovery.RecoveryFailedException: [so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-3}{ccufnmQhTLi6zBHeNrKvaA}{WYHwIkghSMmPQGwRaoqGXA}{172.32.32.113}{172.32.32.113:9300}{dim}{shard_indexing_pressure_enabled=true} ([so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-3}{ccufnmQhTLi6zBHeNrKvaA}{WYHwIkghSMmPQGwRaoqGXA}{172.32.32.113}{172.32.32.113:9300}{dim}{shard_indexing_pressure_enabled=true})
	at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:134) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:176) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:192) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:738) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:668) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1476) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
	at java.lang.Thread.run(Thread.java:1623) [?:?]
Caused by: org.opensearch.indices.recovery.RecoveryFailedException: [so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-3}{ccufnmQhTLi6zBHeNrKvaA}{WYHwIkghSMmPQGwRaoqGXA}{172.32.32.113}{172.32.32.113:9300}{dim}{shard_indexing_pressure_enabled=true}
	... 8 more
Caused by: org.opensearch.transport.RemoteTransportException: [node-2][172.32.32.17:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.opensearch.transport.RemoteTransportException: [node-3][172.32.32.113:9300][internal:index/shard/replication/segments_sync]
Caused by: org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
	at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:458) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.StepListener.whenComplete(StepListener.java:93) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:165) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.replication.SegmentReplicationTargetService.start(SegmentReplicationTargetService.java:435) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.replication.SegmentReplicationTargetService$ReplicationRunner.run(SegmentReplicationTargetService.java:423) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	... 4 more
Caused by: java.lang.NullPointerException: Cannot invoke "org.opensearch.index.store.remote.metadata.RemoteSegmentMetadata.getMetadata()" because the return value of "org.opensearch.index.store.RemoteSegmentStoreDirectory.readLatestMetadataFile()" is null
	at org.opensearch.indices.replication.RemoteStoreReplicationSource.getCheckpointMetadata(RemoteStoreReplicationSource.java:57) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:163) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.replication.SegmentReplicationTargetService.start(SegmentReplicationTargetService.java:435) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.replication.SegmentReplicationTargetService$ReplicationRunner.run(SegmentReplicationTargetService.java:423) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	... 4 more

To Reproduce
Steps to reproduce the behavior:
Run OSB with so workload for 3 shards 1 replica setup.

Expected behavior
Replicas shards should stay assigned and replicate data through remote store.

@ashking94 ashking94 added bug Something isn't working untriaged labels Jun 28, 2023
@ashking94
Copy link
Member Author

This issue is encountered just after remote index is created. Currently suspecting this to be happening due to force segment sync happening during peer recovery before the segments have been uploaded.

@anasalkouz anasalkouz added Storage:Durability Issues and PRs related to the durability framework and removed untriaged labels Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage:Durability Issues and PRs related to the durability framework
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants