Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication BUG] Replica shard fails during segment replication during indexing / bulk indexing calls #4129

Closed
ashking94 opened this issue Aug 4, 2022 · 4 comments
Assignees
Labels
bug Something isn't working distributed framework

Comments

@ashking94
Copy link
Member

Describe the bug
When doing indexing (single/bulk), the index becomes yellow when segment replication kicks in. The replica shard is failing and then recovery kicks in which makes the index green again. Seeing the below error in the logs -

opensearch-node2    | "stacktrace": ["org.opensearch.OpenSearchException: Segment Replication failed",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:235) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:201) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:193) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1370) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
opensearch-node2    | "at java.lang.Thread.run(Thread.java:833) [?:?]",
opensearch-node2    | "Caused by: org.opensearch.transport.RemoteTransportException: [opensearch-node1][172.25.0.3:9300][internal:index/shard/replication/get_checkpoint_info]",
opensearch-node2    | "Caused by: org.opensearch.OpenSearchException: Shard copy [test2][0] on node {opensearch-node2}{fYz26kZhTYSDXagrx9FLUw}{p7U2QewKQcq_njca3Ib4uA}{172.25.0.2}{172.25.0.2:9300}{dimr}{shard_indexing_pressure_enabled=true} already replicating",
opensearch-node2    | "at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:159) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:103) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:86) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]",
opensearch-node2    | "at java.lang.Thread.run(Thread.java:833) ~[?:?]"] }
opensearch-node2    | {"type": "server", "timestamp": "2022-08-04T16:53:57,726Z", "level": "WARN", "component": "o.o.i.e.Engine", "cluster.name": "opensearch-cluster", "node.name": "opensearch-node2", "message": " [test2][0] failed engine [replication failure]", "cluster.uuid": "cefVHzE2TEi5DkXNFyxGXA", "node.id": "fYz26kZhTYSDXagrx9FLUw" , 
opensearch-node2    | "stacktrace": ["org.opensearch.OpenSearchException: Segment Replication failed",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:235) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:201) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:193) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1370) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
opensearch-node2    | "at java.lang.Thread.run(Thread.java:833) [?:?]",
opensearch-node2    | "Caused by: org.opensearch.transport.RemoteTransportException: [opensearch-node1][172.25.0.3:9300][internal:index/shard/replication/get_checkpoint_info]",
opensearch-node2    | "Caused by: org.opensearch.OpenSearchException: Shard copy [test2][0] on node {opensearch-node2}{fYz26kZhTYSDXagrx9FLUw}{p7U2QewKQcq_njca3Ib4uA}{172.25.0.2}{172.25.0.2:9300}{dimr}{shard_indexing_pressure_enabled=true} already replicating",
opensearch-node2    | "at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:159) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:103) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:86) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]",
opensearch-node2    | "at java.lang.Thread.run(Thread.java:833) ~[?:?]"] }
opensearch-node2    | {"type": "server", "timestamp": "2022-08-04T16:53:57,731Z", "level": "WARN", "component": "o.o.i.c.IndicesClusterStateService", "cluster.name": "opensearch-cluster", "node.name": "opensearch-node2", "message": "[test2][0] marking and sending shard failed due to [shard failure, reason [replication failure]]", "cluster.uuid": "cefVHzE2TEi5DkXNFyxGXA", "node.id": "fYz26kZhTYSDXagrx9FLUw" , 
opensearch-node2    | "stacktrace": ["org.opensearch.OpenSearchException: Segment Replication failed",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:235) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:201) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:193) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1370) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]",
opensearch-node2    | "at java.lang.Thread.run(Thread.java:833) [?:?]",
opensearch-node2    | "Caused by: org.opensearch.transport.RemoteTransportException: [opensearch-node1][172.25.0.3:9300][internal:index/shard/replication/get_checkpoint_info]",
opensearch-node2    | "Caused by: org.opensearch.OpenSearchException: Shard copy [test2][0] on node {opensearch-node2}{fYz26kZhTYSDXagrx9FLUw}{p7U2QewKQcq_njca3Ib4uA}{172.25.0.2}{172.25.0.2:9300}{dimr}{shard_indexing_pressure_enabled=true} already replicating",
opensearch-node2    | "at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:159) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:103) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:86) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]",
opensearch-node2    | "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]",
opensearch-node2    | "at java.lang.Thread.run(Thread.java:833) ~[?:?]"] }

To Reproduce
Steps to reproduce the behavior:

  1. Create index with segment type as Replication.
    curl -X PUT "localhost:9200/test2?pretty" -H 'Content-Type: application/json' -d' { "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 1, "replication": {"type": "SEGMENT"} }}}'
  2. Run below curl multiple times (I had run 10 times)
curl --location --request POST "localhost:9201/test2/_doc" \
--header 'Content-Type: application/json' \
--data-raw "{
  \"name\":\"abc\"
}"
  1. Run below bulk index curl -
curl -X POST "localhost:9200/_bulk?pretty" -H 'Content-Type: application/json' -d'
{ "index" : { "_index" : "test2", "_id" : "1" } }
{ "field10" : "value1" }
{ "delete" : { "_index" : "test2", "_id" : "2" } }
{ "create" : { "_index" : "test2", "_id" : "3" } }
{ "field10" : "value3" }
{ "update" : {"_id" : "1", "_index" : "test2"} }
{ "doc" : {"field20" : "value2"} }
'

  1. This seems to be related to when cluster state is published and at the same time replication call is going on. So, might need to rerun the above steps to reproduce the error.

Expected behavior
A clear and concise description of what you expected to happen.

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@ashking94 ashking94 added bug Something isn't working untriaged labels Aug 4, 2022
@ashking94 ashking94 changed the title [BUG] Segment replication fails the replica during indexing / bulk indexing calls [BUG] Replica shard fails during segment replication during indexing / bulk indexing calls Aug 4, 2022
@dreamer-89
Copy link
Member

dreamer-89 commented Aug 4, 2022

Looks like this exception is thrown when a replica shard already have an on-going replication and is expected based on test. May be this needs graceful handling of exception on replica to avoid shard failure.

@mch2 : WDYT ?

@mch2 mch2 changed the title [BUG] Replica shard fails during segment replication during indexing / bulk indexing calls [Segment Replication BUG] Replica shard fails during segment replication during indexing / bulk indexing calls Aug 4, 2022
@mch2
Copy link
Member

mch2 commented Aug 4, 2022

@dreamer-89 Yeah this should not be failing the replica, it would catch up to the new cp after the current replication event completes.

I think this is happening bc we are mapping DiscoveryNode > SegmentReplicationSourceHandler here:

        if (nodesToHandlers.putIfAbsent(
            request.getTargetNode(),
            createTargetHandler(request.getTargetNode(), copyState, fileChunkWriter)
        ) != null) {
            throw new OpenSearchException(
                "Shard copy {} on node {} already replicating",
                request.getCheckpoint().getShardId(),
                request.getTargetNode()
            );
        }

This needs to be mapped by allocation ID, not the node. edit - just realized your repro instructions are running with 1 shard, but this would be a problem with multiple replicas on a single node.

I think there is also buggy logic here with how we are handling replication in the source service. The source service keeps track of which replicas are actively copying by creating a SegmentReplicationSourceHandler that holds a reference to an incref'd CopyState. That CopyState ensures files relating to the copy event are not wiped before the object is closed.

However, to clear that CopyState a subsequent call to GET_SEGMENT_FILES to send the files is required. So if that call never comes it will remain indefinitely. We could add a TLL on that handler that recognizes a copy hasn't started after a certain amount of time and decRef in that case so the CopyState is released.

I think the source service should also be able to handle two calls to GET_CHECKPOINT_INFO without failing the replication event if copy hasn't actually started.

@mch2
Copy link
Member

mch2 commented Aug 10, 2022

Have opened #4182 to cover moving this to allocationID over node. I have not been able to repro after applying this change but I think we should leave this open to explore more enhancements.

@mch2
Copy link
Member

mch2 commented Aug 15, 2022

closing this one bc I haven't seen it since, please reopen if needed.

@mch2 mch2 closed this as completed Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
None yet
Development

No branches or pull requests

3 participants