You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
While doing OSB for so workload, with index config of 3 shard & 1 replica, noticing that the replica shards fail after sometime with following stacktrace.
[2023-06-28T19:09:59,817][WARN ][o.o.c.r.a.AllocationService] [node-1] failing shard [failed shard, shard [so][1], node[ccufnmQhTLi6zBHeNrKvaA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=_Xgzcz-KQxyOUeYRJH4qTg], unassigned_info[[reason=ALLOCATION_FAILED], at[2023-06-28T19:09:59.716Z], failed_attempts[4], failed_nodes[[ccufnmQhTLi6zBHeNrKvaA, BU-8J5poQRavqhIY6qB39A]], delayed=false, details[failed shard on node [BU-8J5poQRavqhIY6qB39A]: failed recovery, failure RecoveryFailedException[[so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-1}{BU-8J5poQRavqhIY6qB39A}{FiIcV_RQSXCDfhA88sNk8A}{172.32.32.222}{172.32.32.222:9300}{dim}{shard_indexing_pressure_enabled=true} ([so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-1}{BU-8J5poQRavqhIY6qB39A}{FiIcV_RQSXCDfhA88sNk8A}{172.32.32.222}{172.32.32.222:9300}{dim}{shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-1}{BU-8J5poQRavqhIY6qB39A}{FiIcV_RQSXCDfhA88sNk8A}{172.32.32.222}{172.32.32.222:9300}{dim}{shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[node-2][172.32.32.17:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[node-1][172.32.32.222:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: NullPointerException[Cannot invoke "org.opensearch.index.store.remote.metadata.RemoteSegmentMetadata.getMetadata()" because the return value of "org.opensearch.index.store.RemoteSegmentStoreDirectory.readLatestMetadataFile()" is null]; ], allocation_status[no_attempt]], message [failed recovery], failure [RecoveryFailedException[[so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-3}{ccufnmQhTLi6zBHeNrKvaA}{WYHwIkghSMmPQGwRaoqGXA}{172.32.32.113}{172.32.32.113:9300}{dim}{shard_indexing_pressure_enabled=true} ([so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-3}{ccufnmQhTLi6zBHeNrKvaA}{WYHwIkghSMmPQGwRaoqGXA}{172.32.32.113}{172.32.32.113:9300}{dim}{shard_indexing_pressure_enabled=true})]; nested: RecoveryFailedException[[so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-3}{ccufnmQhTLi6zBHeNrKvaA}{WYHwIkghSMmPQGwRaoqGXA}{172.32.32.113}{172.32.32.113:9300}{dim}{shard_indexing_pressure_enabled=true}]; nested: RemoteTransportException[[node-2][172.32.32.17:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[node-3][172.32.32.113:9300][internal:index/shard/replication/segments_sync]]; nested: ReplicationFailedException[Segment Replication failed]; nested: NullPointerException[Cannot invoke "org.opensearch.index.store.remote.metadata.RemoteSegmentMetadata.getMetadata()" because the return value of "org.opensearch.index.store.RemoteSegmentStoreDirectory.readLatestMetadataFile()" is null]; ], markAsStale [true]]
org.opensearch.indices.recovery.RecoveryFailedException: [so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-3}{ccufnmQhTLi6zBHeNrKvaA}{WYHwIkghSMmPQGwRaoqGXA}{172.32.32.113}{172.32.32.113:9300}{dim}{shard_indexing_pressure_enabled=true} ([so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-3}{ccufnmQhTLi6zBHeNrKvaA}{WYHwIkghSMmPQGwRaoqGXA}{172.32.32.113}{172.32.32.113:9300}{dim}{shard_indexing_pressure_enabled=true})
at org.opensearch.indices.recovery.RecoveryTarget.notifyListener(RecoveryTarget.java:134) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.indices.replication.common.ReplicationTarget.fail(ReplicationTarget.java:176) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.indices.replication.common.ReplicationCollection.fail(ReplicationCollection.java:192) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.onException(PeerRecoveryTargetService.java:738) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.indices.recovery.PeerRecoveryTargetService$RecoveryResponseHandler.handleException(PeerRecoveryTargetService.java:668) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1476) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
at java.lang.Thread.run(Thread.java:1623) [?:?]
Caused by: org.opensearch.indices.recovery.RecoveryFailedException: [so][1]: Recovery failed from {node-2}{O5KaS_uHRpe5wAmi6mroAQ}{oW0vTwL1R9-Jwt28vd3-CQ}{172.32.32.17}{172.32.32.17:9300}{dim}{shard_indexing_pressure_enabled=true} into {node-3}{ccufnmQhTLi6zBHeNrKvaA}{WYHwIkghSMmPQGwRaoqGXA}{172.32.32.113}{172.32.32.113:9300}{dim}{shard_indexing_pressure_enabled=true}
... 8 more
Caused by: org.opensearch.transport.RemoteTransportException: [node-2][172.32.32.17:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.opensearch.transport.RemoteTransportException: [node-3][172.32.32.113:9300][internal:index/shard/replication/segments_sync]
Caused by: org.opensearch.indices.replication.common.ReplicationFailedException: Segment Replication failed
at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:458) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.common.util.concurrent.ListenableFuture.addListener(ListenableFuture.java:82) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.action.StepListener.whenComplete(StepListener.java:93) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:165) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.indices.replication.SegmentReplicationTargetService.start(SegmentReplicationTargetService.java:435) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.indices.replication.SegmentReplicationTargetService$ReplicationRunner.run(SegmentReplicationTargetService.java:423) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
... 4 more
Caused by: java.lang.NullPointerException: Cannot invoke "org.opensearch.index.store.remote.metadata.RemoteSegmentMetadata.getMetadata()" because the return value of "org.opensearch.index.store.RemoteSegmentStoreDirectory.readLatestMetadataFile()" is null
at org.opensearch.indices.replication.RemoteStoreReplicationSource.getCheckpointMetadata(RemoteStoreReplicationSource.java:57) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.indices.replication.SegmentReplicationTarget.startReplication(SegmentReplicationTarget.java:163) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.indices.replication.SegmentReplicationTargetService.start(SegmentReplicationTargetService.java:435) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
at org.opensearch.indices.replication.SegmentReplicationTargetService$ReplicationRunner.run(SegmentReplicationTargetService.java:423) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
... 4 more
To Reproduce
Steps to reproduce the behavior:
Run OSB with so workload for 3 shards 1 replica setup.
Expected behavior
Replicas shards should stay assigned and replicate data through remote store.
The text was updated successfully, but these errors were encountered:
This issue is encountered just after remote index is created. Currently suspecting this to be happening due to force segment sync happening during peer recovery before the segments have been uploaded.
Describe the bug
While doing OSB for
so
workload, with index config of 3 shard & 1 replica, noticing that the replica shards fail after sometime with following stacktrace.To Reproduce
Steps to reproduce the behavior:
Run OSB with
so
workload for 3 shards 1 replica setup.Expected behavior
Replicas shards should stay assigned and replicate data through remote store.
The text was updated successfully, but these errors were encountered: