IndexShardIT#testStressMaybeFlushOrRollTranslogGeneration occasionally fails with an assertion error. #31629

jtibshirani · 2018-06-27T20:14:21Z

In a 6.3 build, IndexShardIT#testStressMaybeFlushOrRollTranslogGeneration failed with an assertion error around the total number of flushes. This looks similar to #25773, which was already addressed, but is likely different because of the details of the assertion failure: this build failed comparing 1 to 2 flushes, while the previous failure compared 1 with a very large number like 775331. I unfortunately wasn't able to reproduce this locally.

Link to the build: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+multijob-unix-compatibility/os=amazon/149/console

Command to reproduce:

./gradlew :server:integTest \
  -Dtests.seed=32E55CC3C4A38C89 \
  -Dtests.class=org.elasticsearch.index.shard.IndexShardIT \
  -Dtests.method="testStressMaybeFlushOrRollTranslogGeneration" \
  -Dtests.security.manager=true \
  -Dtests.locale=he \
  -Dtests.timezone=Pacific/Johnston

Relevant excerpt from the logs:

FAILURE 0.15s J3 | IndexShardIT.testStressMaybeFlushOrRollTranslogGeneration <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: <1L>
   >      but: was <2L>
   > 	at __randomizedtesting.SeedInfo.seed([32E55CC3C4A38C89:DFE968D623B47EAF]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.index.shard.IndexShardIT.lambda$testStressMaybeFlushOrRollTranslogGeneration$6(IndexShardIT.java:448)
   > 	at org.elasticsearch.index.shard.IndexShardIT.testStressMaybeFlushOrRollTranslogGeneration(IndexShardIT.java:463)
   > 	at java.lang.Thread.run(Thread.java:748)

Full log: index_shard_log.txt

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-06-27T20:15:14Z

Pinging @elastic/es-distributed

bleskes · 2018-07-06T09:32:42Z

@dnhatn any news on this one?

Relates #31629

dnhatn · 2018-07-10T22:45:48Z

I looked at this. However, we have no clue here; I enabled debug log for the test.

hub-cap · 2018-07-18T15:53:12Z

Another test failure, hopefully with more logging this time!

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+multijob-unix-compatibility/os=amazon/192/console

DaveCTurner · 2018-08-06T07:11:33Z

Unfortunately not, the log levels were only increased in 6.4-and-later.

Relates #31629

dnhatn · 2018-09-13T15:46:04Z

Discussed this with the team, we prefer to reduce the logging level of the test then close it since this test has not failed for a month.

dnhatn · 2019-02-13T23:46:03Z

Failed again https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/8188/consoleText. I am able to reproduce it now.

droberts195 · 2019-02-14T13:10:05Z

This failed again with the same error in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+intake/221/console

12:39:54 FAILURE 10.3s J1 | IndexShardIT.testStressMaybeFlushOrRollTranslogGeneration <<< FAILURES!
12:39:54    > Throwable #1: java.lang.AssertionError: flush stats: total=[0 vs 2], periodic=[0 vs 2]
12:39:54    > Expected: <1L>
12:39:54    >      but: was <2L>
12:39:54    > 	at __randomizedtesting.SeedInfo.seed([BF7D8085A35D96E6:5271B490444A64C0]:0)
12:39:54    > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
12:39:54    > 	at org.elasticsearch.index.shard.IndexShardIT.lambda$testStressMaybeFlushOrRollTranslogGeneration$6(IndexShardIT.java:467)
12:39:54    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
12:39:54    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:822)
12:39:54    > 	at org.elasticsearch.index.shard.IndexShardIT.testStressMaybeFlushOrRollTranslogGeneration(IndexShardIT.java:477)
12:39:54    > 	at java.lang.Thread.run(Thread.java:748)

The repro command was:

./gradlew :server:integTest \
  -Dtests.seed=BF7D8085A35D96E6 \
  -Dtests.class=org.elasticsearch.index.shard.IndexShardIT \
  -Dtests.security.manager=true \
  -Dtests.locale=en-US \
  -Dtests.timezone=Etc/UTC \
  -Dcompiler.java=11 \
  -Druntime.java=8

This didn't reproduce locally for me.

The predicate shouldPeriodicallyFlush is determined by the uncommitted translog size and the local checkpoint. The uncommitted translog size depends on the local checkpoint. The condition shouldPeriodicallyFlush can be true twice in in the test in the following scenario: 1. Index doc-0 and advances the local checkpoint to 0, the condition shouldPeriodicallyFlush remains false. 2. Index doc-1 and add it to translog, but the local checkpoint is not advanced yet (still 0). The condition shouldPeriodicallyFlush becomes true because the uncommitted translog size is 216bytes (2ops + gen-1 + gen-2) > 180bytes and the translog generation of the new index commit would advance from 1 to 2. > [2019-02-13T23:33:58,257][TRACE][o.e.i.e.Engine ] [node_s_0] > [test][0] committing writer with commit data [{local_checkpoint=0, > max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g, > min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q, > retention_leases=primary_term:1;version:0;, translog_generation=2, > max_seq_no=1}] 1. The shouldPeriodicallyFlush becomes true again after the local checkpoint is advanced to 1 because the uncommitted translog size is 216bytes (2ops + gen-2 + gen-3) > 180bytes and the translog generation of the new index commit would advance from 2 to 4. > [2019-02-13T23:33:58,264][TRACE][o.e.i.e.Engine ] [node_s_0] > [test][0] committing writer with commit data [{local_checkpoint=1, > max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g, > min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q, > retention_leases=primary_term:1;version:0;, translog_generation=4, > max_seq_no=1}] We need to relax the assertion in this test to cover this situation. Closes #31629

jtibshirani added >test-failure Triaged test failures from CI :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Jun 27, 2018

jtibshirani changed the title ~~[CI] IndexShardIT#testStressMaybeFlushOrRollTranslogGeneration occasionally hits assertion error.~~ [CI] IndexShardIT#testStressMaybeFlushOrRollTranslogGeneration occasionally fails with an assertion error. Jun 27, 2018

dnhatn self-assigned this Jun 27, 2018

jtibshirani changed the title ~~[CI] IndexShardIT#testStressMaybeFlushOrRollTranslogGeneration occasionally fails with an assertion error.~~ IndexShardIT#testStressMaybeFlushOrRollTranslogGeneration occasionally fails with an assertion error. Jun 27, 2018

dnhatn added a commit that referenced this issue Jul 10, 2018

Increase logging level for testStressMaybeFlush

030f406

Relates #31629

dnhatn added a commit that referenced this issue Jul 10, 2018

Increase logging level for testStressMaybeFlush

25cd835

Relates #31629

dnhatn added a commit that referenced this issue Sep 13, 2018

TEST: decrease logging level in the flush test

b307113

Relates #31629

dnhatn added a commit that referenced this issue Sep 13, 2018

TEST: decrease logging level in the flush test

31d6278

Relates #31629

dnhatn added a commit that referenced this issue Sep 13, 2018

TEST: decrease logging level in the flush test

4d5c4b4

Relates #31629

dnhatn closed this as completed Sep 13, 2018

dnhatn reopened this Feb 13, 2019

dnhatn mentioned this issue Feb 14, 2019

Advance max_seq_no before add operation to Lucene #38879

Merged

dnhatn mentioned this issue Feb 14, 2019

Relax testStressMaybeFlushOrRollTranslogGeneration #38918

Merged

dnhatn closed this as completed in #38918 Feb 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IndexShardIT#testStressMaybeFlushOrRollTranslogGeneration occasionally fails with an assertion error. #31629

IndexShardIT#testStressMaybeFlushOrRollTranslogGeneration occasionally fails with an assertion error. #31629

jtibshirani commented Jun 27, 2018

elasticmachine commented Jun 27, 2018

bleskes commented Jul 6, 2018

dnhatn commented Jul 10, 2018

hub-cap commented Jul 18, 2018

DaveCTurner commented Aug 6, 2018

dnhatn commented Sep 13, 2018

dnhatn commented Feb 13, 2019

droberts195 commented Feb 14, 2019

IndexShardIT#testStressMaybeFlushOrRollTranslogGeneration occasionally fails with an assertion error. #31629

IndexShardIT#testStressMaybeFlushOrRollTranslogGeneration occasionally fails with an assertion error. #31629

Comments

jtibshirani commented Jun 27, 2018

elasticmachine commented Jun 27, 2018

bleskes commented Jul 6, 2018

dnhatn commented Jul 10, 2018

hub-cap commented Jul 18, 2018

DaveCTurner commented Aug 6, 2018

dnhatn commented Sep 13, 2018

dnhatn commented Feb 13, 2019

droberts195 commented Feb 14, 2019