[Segment Replication BUG] Replica shard fails during segment replication during indexing / bulk indexing calls #4129

ashking94 · 2022-08-04T17:02:43Z

Describe the bug
When doing indexing (single/bulk), the index becomes yellow when segment replication kicks in. The replica shard is failing and then recovery kicks in which makes the index green again. Seeing the below error in the logs -

opensearch-node2    | "stacktrace": ["org.opensearch.OpenSearchException: Segment Replication failed",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:235) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:201) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:193) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1370) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
opensearch-node2    | "at java.lang.Thread.run(Thread.java:833) [?:?]",
opensearch-node2    | "Caused by: org.opensearch.transport.RemoteTransportException: [opensearch-node1][172.25.0.3:9300][internal:index/shard/replication/get_checkpoint_info]",
opensearch-node2    | "Caused by: org.opensearch.OpenSearchException: Shard copy [test2][0] on node {opensearch-node2}{fYz26kZhTYSDXagrx9FLUw}{p7U2QewKQcq_njca3Ib4uA}{172.25.0.2}{172.25.0.2:9300}{dimr}{shard_indexing_pressure_enabled=true} already replicating",
opensearch-node2    | "at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:159) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:103) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:86) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]",
opensearch-node2    | "at java.lang.Thread.run(Thread.java:833) ~[?:?]"] }
opensearch-node2    | {"type": "server", "timestamp": "2022-08-04T16:53:57,726Z", "level": "WARN", "component": "o.o.i.e.Engine", "cluster.name": "opensearch-cluster", "node.name": "opensearch-node2", "message": " [test2][0] failed engine [replication failure]", "cluster.uuid": "cefVHzE2TEi5DkXNFyxGXA", "node.id": "fYz26kZhTYSDXagrx9FLUw" , 
opensearch-node2    | "stacktrace": ["org.opensearch.OpenSearchException: Segment Replication failed",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:235) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:201) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:193) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1370) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
opensearch-node2    | "at java.lang.Thread.run(Thread.java:833) [?:?]",
opensearch-node2    | "Caused by: org.opensearch.transport.RemoteTransportException: [opensearch-node1][172.25.0.3:9300][internal:index/shard/replication/get_checkpoint_info]",
opensearch-node2    | "Caused by: org.opensearch.OpenSearchException: Shard copy [test2][0] on node {opensearch-node2}{fYz26kZhTYSDXagrx9FLUw}{p7U2QewKQcq_njca3Ib4uA}{172.25.0.2}{172.25.0.2:9300}{dimr}{shard_indexing_pressure_enabled=true} already replicating",
opensearch-node2    | "at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:159) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:103) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:86) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]",
opensearch-node2    | "at java.lang.Thread.run(Thread.java:833) ~[?:?]"] }
opensearch-node2    | {"type": "server", "timestamp": "2022-08-04T16:53:57,731Z", "level": "WARN", "component": "o.o.i.c.IndicesClusterStateService", "cluster.name": "opensearch-cluster", "node.name": "opensearch-node2", "message": "[test2][0] marking and sending shard failed due to [shard failure, reason [replication failure]]", "cluster.uuid": "cefVHzE2TEi5DkXNFyxGXA", "node.id": "fYz26kZhTYSDXagrx9FLUw" , 
opensearch-node2    | "stacktrace": ["org.opensearch.OpenSearchException: Segment Replication failed",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:235) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:201) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:193) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1370) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
opensearch-node2    | "at java.lang.Thread.run(Thread.java:833) [?:?]",
opensearch-node2    | "Caused by: org.opensearch.transport.RemoteTransportException: [opensearch-node1][172.25.0.3:9300][internal:index/shard/replication/get_checkpoint_info]",
opensearch-node2    | "Caused by: org.opensearch.OpenSearchException: Shard copy [test2][0] on node {opensearch-node2}{fYz26kZhTYSDXagrx9FLUw}{p7U2QewKQcq_njca3Ib4uA}{172.25.0.2}{172.25.0.2:9300}{dimr}{shard_indexing_pressure_enabled=true} already replicating",
opensearch-node2    | "at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:159) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:103) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:86) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]",
opensearch-node2    | "at java.lang.Thread.run(Thread.java:833) ~[?:?]"] }

To Reproduce
Steps to reproduce the behavior:

Create index with segment type as Replication.
curl -X PUT "localhost:9200/test2?pretty" -H 'Content-Type: application/json' -d' { "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 1, "replication": {"type": "SEGMENT"} }}}'
Run below curl multiple times (I had run 10 times)

curl --location --request POST "localhost:9201/test2/_doc" \
--header 'Content-Type: application/json' \
--data-raw "{
  \"name\":\"abc\"
}"

Run below bulk index curl -

curl -X POST "localhost:9200/_bulk?pretty" -H 'Content-Type: application/json' -d'
{ "index" : { "_index" : "test2", "_id" : "1" } }
{ "field10" : "value1" }
{ "delete" : { "_index" : "test2", "_id" : "2" } }
{ "create" : { "_index" : "test2", "_id" : "3" } }
{ "field10" : "value3" }
{ "update" : {"_id" : "1", "_index" : "test2"} }
{ "doc" : {"field20" : "value2"} }
'

This seems to be related to when cluster state is published and at the same time replication call is going on. So, might need to rerun the above steps to reproduce the error.

Expected behavior
A clear and concise description of what you expected to happen.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

dreamer-89 · 2022-08-04T17:52:43Z

Looks like this exception is thrown when a replica shard already have an on-going replication and is expected based on test. May be this needs graceful handling of exception on replica to avoid shard failure.

@mch2 : WDYT ?

mch2 · 2022-08-04T22:30:05Z

@dreamer-89 Yeah this should not be failing the replica, it would catch up to the new cp after the current replication event completes.

I think this is happening bc we are mapping DiscoveryNode > SegmentReplicationSourceHandler here:

        if (nodesToHandlers.putIfAbsent(
            request.getTargetNode(),
            createTargetHandler(request.getTargetNode(), copyState, fileChunkWriter)
        ) != null) {
            throw new OpenSearchException(
                "Shard copy {} on node {} already replicating",
                request.getCheckpoint().getShardId(),
                request.getTargetNode()
            );
        }

This needs to be mapped by allocation ID, not the node. edit - just realized your repro instructions are running with 1 shard, but this would be a problem with multiple replicas on a single node.

I think there is also buggy logic here with how we are handling replication in the source service. The source service keeps track of which replicas are actively copying by creating a SegmentReplicationSourceHandler that holds a reference to an incref'd CopyState. That CopyState ensures files relating to the copy event are not wiped before the object is closed.

However, to clear that CopyState a subsequent call to GET_SEGMENT_FILES to send the files is required. So if that call never comes it will remain indefinitely. We could add a TLL on that handler that recognizes a copy hasn't started after a certain amount of time and decRef in that case so the CopyState is released.

I think the source service should also be able to handle two calls to GET_CHECKPOINT_INFO without failing the replication event if copy hasn't actually started.

mch2 · 2022-08-10T03:16:50Z

Have opened #4182 to cover moving this to allocationID over node. I have not been able to repro after applying this change but I think we should leave this open to explore more enhancements.

mch2 · 2022-08-15T17:52:05Z

closing this one bc I haven't seen it since, please reopen if needed.

ashking94 added bug Something isn't working untriaged labels Aug 4, 2022

ashking94 changed the title ~~[BUG] Segment replication fails the replica during indexing / bulk indexing calls~~ [BUG] Replica shard fails during segment replication during indexing / bulk indexing calls Aug 4, 2022

dreamer-89 mentioned this issue Aug 4, 2022

[Segment Replication] Experimental Release Tracking #3969

Closed

13 tasks

mch2 changed the title ~~[BUG] Replica shard fails during segment replication during indexing / bulk indexing calls~~ [Segment Replication BUG] Replica shard fails during segment replication during indexing / bulk indexing calls Aug 4, 2022

mch2 added distributed framework and removed untriaged labels Aug 8, 2022

mch2 self-assigned this Aug 8, 2022

mch2 mentioned this issue Aug 10, 2022

[Segment Replication] Fix OngoingSegmentReplications to key by allocation ID instead of DiscoveryNode. #4182

Merged

5 tasks

mch2 closed this as completed Aug 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Segment Replication BUG] Replica shard fails during segment replication during indexing / bulk indexing calls #4129

[Segment Replication BUG] Replica shard fails during segment replication during indexing / bulk indexing calls #4129

ashking94 commented Aug 4, 2022

dreamer-89 commented Aug 4, 2022 •

edited

Loading

mch2 commented Aug 4, 2022 •

edited

Loading

mch2 commented Aug 10, 2022

mch2 commented Aug 15, 2022

[Segment Replication BUG] Replica shard fails during segment replication during indexing / bulk indexing calls #4129

[Segment Replication BUG] Replica shard fails during segment replication during indexing / bulk indexing calls #4129

Comments

ashking94 commented Aug 4, 2022

dreamer-89 commented Aug 4, 2022 • edited Loading

mch2 commented Aug 4, 2022 • edited Loading

mch2 commented Aug 10, 2022

mch2 commented Aug 15, 2022

dreamer-89 commented Aug 4, 2022 •

edited

Loading

mch2 commented Aug 4, 2022 •

edited

Loading