-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Closed
Labels
:Distributed Indexing/EngineAnything around managing Lucene and the Translog in an open shard.Anything around managing Lucene and the Translog in an open shard.>bugTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.v7.5.0
Description
Today (7.5.0) if we try and index a document into a shard that already contains 2147483519 documents then it is rejected by Lucene and a no-op is written to the translog to record this. However, since #30226 we also try and record no-ops themselves as documents in Lucene; the index is already full so we fail to add the tombstone too. The failure to add this tombstone is fatal to the shard:
[2020-01-17T09:20:02,405][WARN ][o.e.i.c.IndicesClusterStateService] [node-0] [i][0] marking and sending shard failed due to [shard failure, reason [no-op origin[PRIMARY] seq#[2147483519] failed at document level]]
java.lang.IllegalArgumentException: number of documents in the index cannot exceed 2147483519
at org.apache.lucene.index.DocumentsWriterPerThread.reserveOneDoc(DocumentsWriterPerThread.java:225) ~[lucene-core-8.3.0.jar:8.3.0 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:10:03]
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234) ~[lucene-core-8.3.0.jar:8.3.0 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:10:03]
at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:495) ~[lucene-core-8.3.0.jar:8.3.0 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:10:03]
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) ~[lucene-core-8.3.0.jar:8.3.0 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:10:03]
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1213) ~[lucene-core-8.3.0.jar:8.3.0 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:10:03]
at org.elasticsearch.index.engine.InternalEngine.innerNoOp(InternalEngine.java:1533) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:937) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:796) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:768) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:725) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:258) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:161) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:193) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:118) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:79) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:917) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:108) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:394) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:316) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$21(IndexShard.java:2752) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:113) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:285) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:237) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2726) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:858) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:312) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.action.support.replication.TransportReplicationAction.handlePrimaryRequest(TransportReplicationAction.java:275) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) ~[?:?]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:315) ~[?:?]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:752) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) ~[elasticsearch-7.5.0.jar:7.5.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.5.0.jar:7.5.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
However, we immediately restart the shard:
[2020-01-17T09:20:02,408][INFO ][o.e.c.r.a.AllocationService] [node-0] Cluster health status changed from [GREEN] to [RED] (reason: [shards failed [[i][0]]]).
[2020-01-17T09:20:02,963][INFO ][o.e.c.r.a.AllocationService] [node-0] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[i][0]]]).
I think this is ok - the operation fails before it makes it to the translog so there's nothing to replay - but it would be good to confirm that there's no risk we do something bad (e.g. leak a seqno) here.
Metadata
Metadata
Assignees
Labels
:Distributed Indexing/EngineAnything around managing Lucene and the Translog in an open shard.Anything around managing Lucene and the Translog in an open shard.>bugTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.v7.5.0