[CI] AssertionError: pre-6.0 shard copy ___ unexpected to send valid local checkpoint -1 #39914

droberts195 · 2019-03-11T12:23:04Z

Several tests failed in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+multijob-unix-compatibility/os=oraclelinux-7/66/console but the root cause is an assertion failure in one of the nodes in the test cluster:

[2019-03-11T11:03:27,277][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [upgraded-node-0] fatal error in thread [elasticsearch[upgraded-node-0][management][T#3]], exiting
java.lang.AssertionError: pre-6.0 shard copy xGd_6iBuSxurWc0jpsHQvg unexpected to send valid local checkpoint -1
    at org.elasticsearch.index.seqno.ReplicationTracker.updateLocalCheckpoint(ReplicationTracker.java:902) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.index.seqno.ReplicationTracker.updateLocalCheckpoint(ReplicationTracker.java:935) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.index.shard.IndexShard.updateLocalCheckpointForShard(IndexShard.java:1919) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.updateLocalCheckpointForShard(TransportReplicationAction.java:1058) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:171) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:166) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.index.seqno.RetentionLeaseBackgroundSyncAction.sendReplicaRequest(RetentionLeaseBackgroundSyncAction.java:144) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1175) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplica(ReplicationOperation.java:166) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplicas(ReplicationOperation.java:153) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:127) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:424) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:370) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:273) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:240) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2540) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:987) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:369) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:324) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:311) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:250) ~[?:?]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:192) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$0(SecurityServerTransportInterceptor.java:299) ~[?:?]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.authz.AuthorizationService.authorizeSystemUser(AuthorizationService.java:374) ~[?:?]
    at org.elasticsearch.xpack.security.authz.AuthorizationService.authorize(AuthorizationService.java:185) ~[?:?]
    at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.lambda$inbound$1(ServerTransportFilter.java:150) ~[?:?]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$authenticateAsync$2(AuthenticationService.java:245) ~[?:?]
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$lookForExistingAuthentication$6(AuthenticationService.java:305) ~[?:?]
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lookForExistingAuthentication(AuthenticationService.java:316) ~[?:?]
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.authenticateAsync(AuthenticationService.java:243) ~[?:?]
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.access$000(AuthenticationService.java:195) ~[?:?]
    at org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:138) ~[?:?]
    at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.inbound(ServerTransportFilter.java:133) ~[?:?]
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:306) ~[?:?]
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:686) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]

The full set of rolling upgrade cluster logs for the suite that failed is cluster_logs.zip

(Note: there have been several failures of BWC tests for 5.6 upgrading to 6.7 today. They may or may not all be due to this problem. I will update the issue once I've looked, but I'm raising this issue early in case the problem is a blocker for 6.7.)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-03-11T12:23:06Z

Pinging @elastic/es-distributed

droberts195 · 2019-03-11T12:41:10Z

Same thing in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+multijob-unix-compatibility/os=centos-7&&immutable/66/console:

[2019-03-11T11:35:35,747][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [upgraded-node-0] fatal error in thread [elasticsearch[upgraded-node-0][management][T#2]], exiting
java.lang.AssertionError: pre-6.0 shard copy VBI92HxeRnq4B6UUgBeS1A unexpected to send valid local checkpoint -1

is in x-pack/qa/rolling-upgrade/with-system-key/build/cluster/v5.6.16#oneThirdUpgradedTestCluster node0/elasticsearch-6.7.0-SNAPSHOT/logs/rolling-upgrade.log in the logs that can be downloaded from https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+multijob-unix-compatibility/os=centos-7&&immutable/66/gcsObjects/. The line numbers in the full stack trace are exactly the same as in the original issue description so I won't paste it here.

droberts195 · 2019-03-11T12:57:28Z

Same thing in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+intake/333/console

That was built off commit 63587f7 so the problem pre-dates the most recent 6.7 commits. I don't recall seeing it last Friday though, so it may have been introduced last Friday or else was masked by other failures that day.

dnhatn · 2019-03-11T12:59:01Z

I am on it now. This failure relates to #39482.

droberts195 · 2019-03-11T13:05:38Z

Thanks @dnhatn. It makes sense, as the earliest failure I can find is in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+intake/329/console, which unfortunately happened just after I finished test triage on Friday. And that was built from a24a1f5.

Do you think this is a blocker for 6.7.0?

dnhatn · 2019-03-11T13:11:15Z

@droberts195 Yes, this should be a blocker since it can prevent rolling upgrade from 5.6.x to 6.7.

dnhatn · 2019-03-11T15:23:49Z

I opened #39922.

If a primary on 6.7 and a replica on 5.6 are running more than 5 minutes (retention leases background sync interval), the retention leases background sync will be triggered, and it will trip 6.7 node due to the illegal checkpoint value. We can fix the problem by making the returned checkpoint depends on the node version. This PR, however, chooses to enforce retention leases require soft deletes, and make retention leases sync noop if soft deletes is disabled instead. Closes #39914

droberts195 added >test-failure Triaged test failures from CI :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Mar 11, 2019

dnhatn self-assigned this Mar 11, 2019

dnhatn added blocker v6.7.0 labels Mar 11, 2019

dnhatn mentioned this issue Mar 11, 2019

Enforce retention leases require soft deletes #39922

Merged

dnhatn closed this as completed in #39922 Mar 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] AssertionError: pre-6.0 shard copy ___ unexpected to send valid local checkpoint -1 #39914

[CI] AssertionError: pre-6.0 shard copy ___ unexpected to send valid local checkpoint -1 #39914

droberts195 commented Mar 11, 2019

elasticmachine commented Mar 11, 2019

droberts195 commented Mar 11, 2019

droberts195 commented Mar 11, 2019

dnhatn commented Mar 11, 2019

droberts195 commented Mar 11, 2019

dnhatn commented Mar 11, 2019

dnhatn commented Mar 11, 2019

[CI] AssertionError: pre-6.0 shard copy ___ unexpected to send valid local checkpoint -1 #39914

[CI] AssertionError: pre-6.0 shard copy ___ unexpected to send valid local checkpoint -1 #39914

Comments

droberts195 commented Mar 11, 2019

elasticmachine commented Mar 11, 2019

droberts195 commented Mar 11, 2019

droberts195 commented Mar 11, 2019

dnhatn commented Mar 11, 2019

droberts195 commented Mar 11, 2019

dnhatn commented Mar 11, 2019

dnhatn commented Mar 11, 2019