Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] AssertionError: pre-6.0 shard copy ___ unexpected to send valid local checkpoint -1 #39914

Closed
droberts195 opened this issue Mar 11, 2019 · 7 comments · Fixed by #39922
Closed
Assignees
Labels
blocker :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test-failure Triaged test failures from CI v6.7.0

Comments

@droberts195
Copy link
Contributor

Several tests failed in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+multijob-unix-compatibility/os=oraclelinux-7/66/console but the root cause is an assertion failure in one of the nodes in the test cluster:

[2019-03-11T11:03:27,277][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [upgraded-node-0] fatal error in thread [elasticsearch[upgraded-node-0][management][T#3]], exiting
java.lang.AssertionError: pre-6.0 shard copy xGd_6iBuSxurWc0jpsHQvg unexpected to send valid local checkpoint -1
    at org.elasticsearch.index.seqno.ReplicationTracker.updateLocalCheckpoint(ReplicationTracker.java:902) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.index.seqno.ReplicationTracker.updateLocalCheckpoint(ReplicationTracker.java:935) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.index.shard.IndexShard.updateLocalCheckpointForShard(IndexShard.java:1919) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.updateLocalCheckpointForShard(TransportReplicationAction.java:1058) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:171) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:166) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.index.seqno.RetentionLeaseBackgroundSyncAction.sendReplicaRequest(RetentionLeaseBackgroundSyncAction.java:144) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicasProxy.performOn(TransportReplicationAction.java:1175) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplica(ReplicationOperation.java:166) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.ReplicationOperation.performOnReplicas(ReplicationOperation.java:153) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:127) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:424) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:370) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:273) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:240) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2540) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:987) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:369) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:324) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:311) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:250) ~[?:?]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:192) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$0(SecurityServerTransportInterceptor.java:299) ~[?:?]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.authz.AuthorizationService.authorizeSystemUser(AuthorizationService.java:374) ~[?:?]
    at org.elasticsearch.xpack.security.authz.AuthorizationService.authorize(AuthorizationService.java:185) ~[?:?]
    at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.lambda$inbound$1(ServerTransportFilter.java:150) ~[?:?]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$authenticateAsync$2(AuthenticationService.java:245) ~[?:?]
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$lookForExistingAuthentication$6(AuthenticationService.java:305) ~[?:?]
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lookForExistingAuthentication(AuthenticationService.java:316) ~[?:?]
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.authenticateAsync(AuthenticationService.java:243) ~[?:?]
    at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.access$000(AuthenticationService.java:195) ~[?:?]
    at org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:138) ~[?:?]
    at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.inbound(ServerTransportFilter.java:133) ~[?:?]
    at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:306) ~[?:?]
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:686) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.7.0-SNAPSHOT.jar:6.7.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]

The full set of rolling upgrade cluster logs for the suite that failed is cluster_logs.zip

(Note: there have been several failures of BWC tests for 5.6 upgrading to 6.7 today. They may or may not all be due to this problem. I will update the issue once I've looked, but I'm raising this issue early in case the problem is a blocker for 6.7.)

@droberts195 droberts195 added >test-failure Triaged test failures from CI :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Mar 11, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@droberts195
Copy link
Contributor Author

Same thing in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+multijob-unix-compatibility/os=centos-7&&immutable/66/console:

[2019-03-11T11:35:35,747][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [upgraded-node-0] fatal error in thread [elasticsearch[upgraded-node-0][management][T#2]], exiting
java.lang.AssertionError: pre-6.0 shard copy VBI92HxeRnq4B6UUgBeS1A unexpected to send valid local checkpoint -1

is in x-pack/qa/rolling-upgrade/with-system-key/build/cluster/v5.6.16#oneThirdUpgradedTestCluster node0/elasticsearch-6.7.0-SNAPSHOT/logs/rolling-upgrade.log in the logs that can be downloaded from https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+multijob-unix-compatibility/os=centos-7&&immutable/66/gcsObjects/. The line numbers in the full stack trace are exactly the same as in the original issue description so I won't paste it here.

@droberts195
Copy link
Contributor Author

Same thing in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+intake/333/console

That was built off commit 63587f7 so the problem pre-dates the most recent 6.7 commits. I don't recall seeing it last Friday though, so it may have been introduced last Friday or else was masked by other failures that day.

@dnhatn dnhatn self-assigned this Mar 11, 2019
@dnhatn
Copy link
Member

dnhatn commented Mar 11, 2019

I am on it now. This failure relates to #39482.

@droberts195
Copy link
Contributor Author

Thanks @dnhatn. It makes sense, as the earliest failure I can find is in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.7+intake/329/console, which unfortunately happened just after I finished test triage on Friday. And that was built from a24a1f5.

Do you think this is a blocker for 6.7.0?

@dnhatn
Copy link
Member

dnhatn commented Mar 11, 2019

@droberts195 Yes, this should be a blocker since it can prevent rolling upgrade from 5.6.x to 6.7.

@dnhatn
Copy link
Member

dnhatn commented Mar 11, 2019

I opened #39922.

dnhatn added a commit that referenced this issue Mar 12, 2019
If a primary on 6.7 and a replica on 5.6 are running more than 5 minutes
(retention leases background sync interval), the retention leases
background sync will be triggered, and it will trip 6.7 node due to the
illegal checkpoint value. We can fix the problem by making the returned
checkpoint depends on the node version. This PR, however, chooses to
enforce retention leases require soft deletes, and make retention leases
sync noop if soft deletes is disabled instead.

Closes #39914
dnhatn added a commit that referenced this issue Mar 12, 2019
If a primary on 6.7 and a replica on 5.6 are running more than 5 minutes
(retention leases background sync interval), the retention leases
background sync will be triggered, and it will trip 6.7 node due to the
illegal checkpoint value. We can fix the problem by making the returned
checkpoint depends on the node version. This PR, however, chooses to
enforce retention leases require soft deletes, and make retention leases
sync noop if soft deletes is disabled instead.

Closes #39914
dnhatn added a commit that referenced this issue Mar 12, 2019
If a primary on 6.7 and a replica on 5.6 are running more than 5 minutes
(retention leases background sync interval), the retention leases
background sync will be triggered, and it will trip 6.7 node due to the
illegal checkpoint value. We can fix the problem by making the returned
checkpoint depends on the node version. This PR, however, chooses to
enforce retention leases require soft deletes, and make retention leases
sync noop if soft deletes is disabled instead.

Closes #39914
dnhatn added a commit that referenced this issue Mar 12, 2019
If a primary on 6.7 and a replica on 5.6 are running more than 5 minutes
(retention leases background sync interval), the retention leases
background sync will be triggered, and it will trip 6.7 node due to the
illegal checkpoint value. We can fix the problem by making the returned
checkpoint depends on the node version. This PR, however, chooses to
enforce retention leases require soft deletes, and make retention leases
sync noop if soft deletes is disabled instead.

Closes #39914
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test-failure Triaged test failures from CI v6.7.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants