Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] IndicesClusterStateServiceRandomUpdatesTests.testRandomClusterStateUpdates #32308

Closed
andyb-elastic opened this issue Jul 24, 2018 · 3 comments · Fixed by #32374
Closed
Assignees
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >test-failure Triaged test failures from CI

Comments

@andyb-elastic
Copy link
Contributor

Doesn't reproduce. Has occurred 6 times in the last 90 days

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=centos/2576/console

build-2576.txt

REPRODUCE WITH: ./gradlew :server:test \
  -Dtests.seed=BDB828B6AFCB1CB3 \
  -Dtests.class=org.elasticsearch.indices.cluster.IndicesClusterStateServiceRandomUpdatesTests \
  -Dtests.method="testRandomClusterStateUpdates" \
  -Dtests.security.manager=true \
  -Dtests.locale=hi-IN \
  -Dtests.timezone=Canada/Mountain
FAILURE 0.16s J0 | IndicesClusterStateServiceRandomUpdatesTests.testRandomClusterStateUpdates <<< FAILURES!                                                                                  
   > Throwable #1: java.lang.AssertionError: a replica can only be promoted when active. current: [index_ytqkcgwvfrlcilx][0], node[node_002], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=96qpiMtlQ_ODQQrpk67g6g], unassigned_info[[reason=INDEX_CREATED], at[2018-07-23T23:45:43.608Z], delayed=false, allocation_status[no_attempt]] new: [index_ytqkcgwvfrlcilx][0], node[node_002], [P], recovery_source[existing recovery], s[INITIALIZING], a[id=96qpiMtlQ_ODQQrpk67g6g], unassigned_info[[reason=ALLOCATION_FAILED], at[2018-07-23T23:45:43.639Z], failed_attempts[1], delayed=false, details[failed shard on node [node_003]: fake shard failure, failure Exception[null]], allocation_status[no_attempt]]                                                   
   >    at __randomizedtesting.SeedInfo.seed([BDB828B6AFCB1CB3:C53F2165F0F0CAFA]:0)           
   >    at org.elasticsearch.indices.cluster.AbstractIndicesClusterStateServiceTestCase$MockIndexShard.updateShardState(AbstractIndicesClusterStateServiceTestCase.java:365)                 
   >    at org.elasticsearch.indices.cluster.IndicesClusterStateService.updateShard(IndicesClusterStateService.java:582)                                                                     
   >    at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:529)                                                            
   >    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:230)                                                               
   >    at org.elasticsearch.indices.cluster.IndicesClusterStateServiceRandomUpdatesTests.testRandomClusterStateUpdates(IndicesClusterStateServiceRandomUpdatesTests.java:127)               
   >    at java.lang.Thread.run(Thread.java:748) 
@andyb-elastic andyb-elastic added >test-failure Triaged test failures from CI :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Jul 24, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn
Copy link
Member

dnhatn commented Jul 24, 2018

@bleskes is working on the fix.

bleskes added a commit to bleskes/elasticsearch that referenced this issue Jul 25, 2018
…it. primary with the same aId

In rare cases it is possible that a nodes gets an instruction to replace a replica
shard that's in POST_RECOVERY with a new initializing primary with the same allocation id.
This can happen by batching cluster states that include the starting of the replica, with
closing of the indices, opening it up again and allocating the primary shard to the node in
question. The node should then clean it's initializing replica and replace it with a new
initializing primary.

Closes elastic#32308
@droberts195
Copy link
Contributor

bleskes added a commit that referenced this issue Jul 30, 2018
…it. primary with the same aId (#32374)

In rare cases it is possible that a nodes gets an instruction to replace a replica
shard that's in `POST_RECOVERY` with a new initializing primary with the same allocation id.
This can happen by batching cluster states that include the starting of the replica, with
closing of the indices, opening it up again and allocating the primary shard to the node in
question. The node should then clean it's initializing replica and replace it with a new
initializing primary.

I'm not sure whether the test I added really adds enough value as existing tests found this. The main reason I added is to allow for simpler reproduction and to double check I fixed it. I'm open to discuss if we should keep.

Closes #32308
bleskes added a commit that referenced this issue Jul 30, 2018
…it. primary with the same aId (#32374)

In rare cases it is possible that a nodes gets an instruction to replace a replica
shard that's in `POST_RECOVERY` with a new initializing primary with the same allocation id.
This can happen by batching cluster states that include the starting of the replica, with
closing of the indices, opening it up again and allocating the primary shard to the node in
question. The node should then clean it's initializing replica and replace it with a new
initializing primary.

I'm not sure whether the test I added really adds enough value as existing tests found this. The main reason I added is to allow for simpler reproduction and to double check I fixed it. I'm open to discuss if we should keep.

Closes #32308
bleskes added a commit that referenced this issue Jul 30, 2018
…it. primary with the same aId (#32374)

In rare cases it is possible that a nodes gets an instruction to replace a replica
shard that's in `POST_RECOVERY` with a new initializing primary with the same allocation id.
This can happen by batching cluster states that include the starting of the replica, with
closing of the indices, opening it up again and allocating the primary shard to the node in
question. The node should then clean it's initializing replica and replace it with a new
initializing primary.

I'm not sure whether the test I added really adds enough value as existing tests found this. The main reason I added is to allow for simpler reproduction and to double check I fixed it. I'm open to discuss if we should keep.

Closes #32308
bleskes added a commit that referenced this issue Jul 31, 2018
…it. primary with the same aId (#32374)

In rare cases it is possible that a nodes gets an instruction to replace a replica
shard that's in `POST_RECOVERY` with a new initializing primary with the same allocation id.
This can happen by batching cluster states that include the starting of the replica, with
closing of the indices, opening it up again and allocating the primary shard to the node in
question. The node should then clean it's initializing replica and replace it with a new
initializing primary.

I'm not sure whether the test I added really adds enough value as existing tests found this. The main reason I added is to allow for simpler reproduction and to double check I fixed it. I'm open to discuss if we should keep.

Closes #32308
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >test-failure Triaged test failures from CI
Projects
None yet
5 participants