Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexShardIT#testStressMaybeFlushOrRollTranslogGeneration occasionally fails with an assertion error. #31629

Closed
jtibshirani opened this issue Jun 27, 2018 · 8 comments · Fixed by #38918
Assignees
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >test-failure Triaged test failures from CI

Comments

@jtibshirani
Copy link
Contributor

In a 6.3 build, IndexShardIT#testStressMaybeFlushOrRollTranslogGeneration failed with an assertion error around the total number of flushes. This looks similar to #25773, which was already addressed, but is likely different because of the details of the assertion failure: this build failed comparing 1 to 2 flushes, while the previous failure compared 1 with a very large number like 775331. I unfortunately wasn't able to reproduce this locally.


Link to the build: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+multijob-unix-compatibility/os=amazon/149/console

Command to reproduce:

./gradlew :server:integTest \
  -Dtests.seed=32E55CC3C4A38C89 \
  -Dtests.class=org.elasticsearch.index.shard.IndexShardIT \
  -Dtests.method="testStressMaybeFlushOrRollTranslogGeneration" \
  -Dtests.security.manager=true \
  -Dtests.locale=he \
  -Dtests.timezone=Pacific/Johnston

Relevant excerpt from the logs:

FAILURE 0.15s J3 | IndexShardIT.testStressMaybeFlushOrRollTranslogGeneration <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: <1L>
   >      but: was <2L>
   > 	at __randomizedtesting.SeedInfo.seed([32E55CC3C4A38C89:DFE968D623B47EAF]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.index.shard.IndexShardIT.lambda$testStressMaybeFlushOrRollTranslogGeneration$6(IndexShardIT.java:448)
   > 	at org.elasticsearch.index.shard.IndexShardIT.testStressMaybeFlushOrRollTranslogGeneration(IndexShardIT.java:463)
   > 	at java.lang.Thread.run(Thread.java:748)

Full log: index_shard_log.txt

@jtibshirani jtibshirani added >test-failure Triaged test failures from CI :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Jun 27, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@jtibshirani jtibshirani changed the title [CI] IndexShardIT#testStressMaybeFlushOrRollTranslogGeneration occasionally hits assertion error. [CI] IndexShardIT#testStressMaybeFlushOrRollTranslogGeneration occasionally fails with an assertion error. Jun 27, 2018
@dnhatn dnhatn self-assigned this Jun 27, 2018
@jtibshirani jtibshirani changed the title [CI] IndexShardIT#testStressMaybeFlushOrRollTranslogGeneration occasionally fails with an assertion error. IndexShardIT#testStressMaybeFlushOrRollTranslogGeneration occasionally fails with an assertion error. Jun 27, 2018
@bleskes
Copy link
Contributor

bleskes commented Jul 6, 2018

@dnhatn any news on this one?

dnhatn added a commit that referenced this issue Jul 10, 2018
dnhatn added a commit that referenced this issue Jul 10, 2018
@dnhatn
Copy link
Member

dnhatn commented Jul 10, 2018

I looked at this. However, we have no clue here; I enabled debug log for the test.

@hub-cap
Copy link
Contributor

hub-cap commented Jul 18, 2018

Another test failure, hopefully with more logging this time!

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+multijob-unix-compatibility/os=amazon/192/console

@DaveCTurner
Copy link
Contributor

Unfortunately not, the log levels were only increased in 6.4-and-later.

dnhatn added a commit that referenced this issue Sep 13, 2018
dnhatn added a commit that referenced this issue Sep 13, 2018
dnhatn added a commit that referenced this issue Sep 13, 2018
@dnhatn
Copy link
Member

dnhatn commented Sep 13, 2018

Discussed this with the team, we prefer to reduce the logging level of the test then close it since this test has not failed for a month.

@dnhatn dnhatn closed this as completed Sep 13, 2018
@dnhatn
Copy link
Member

dnhatn commented Feb 13, 2019

@droberts195
Copy link
Contributor

This failed again with the same error in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+intake/221/console

12:39:54 FAILURE 10.3s J1 | IndexShardIT.testStressMaybeFlushOrRollTranslogGeneration <<< FAILURES!
12:39:54    > Throwable #1: java.lang.AssertionError: flush stats: total=[0 vs 2], periodic=[0 vs 2]
12:39:54    > Expected: <1L>
12:39:54    >      but: was <2L>
12:39:54    > 	at __randomizedtesting.SeedInfo.seed([BF7D8085A35D96E6:5271B490444A64C0]:0)
12:39:54    > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
12:39:54    > 	at org.elasticsearch.index.shard.IndexShardIT.lambda$testStressMaybeFlushOrRollTranslogGeneration$6(IndexShardIT.java:467)
12:39:54    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
12:39:54    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:822)
12:39:54    > 	at org.elasticsearch.index.shard.IndexShardIT.testStressMaybeFlushOrRollTranslogGeneration(IndexShardIT.java:477)
12:39:54    > 	at java.lang.Thread.run(Thread.java:748)

The repro command was:

./gradlew :server:integTest \
  -Dtests.seed=BF7D8085A35D96E6 \
  -Dtests.class=org.elasticsearch.index.shard.IndexShardIT \
  -Dtests.security.manager=true \
  -Dtests.locale=en-US \
  -Dtests.timezone=Etc/UTC \
  -Dcompiler.java=11 \
  -Druntime.java=8

This didn't reproduce locally for me.

dnhatn added a commit that referenced this issue Feb 15, 2019
The predicate shouldPeriodicallyFlush is determined by the uncommitted
translog size and the local checkpoint. The uncommitted translog size
depends on the local checkpoint. The condition shouldPeriodicallyFlush
can be true twice in in the test in the following scenario:

1. Index doc-0 and advances the local checkpoint to 0, the condition
shouldPeriodicallyFlush remains false.

2. Index doc-1 and add it to translog, but the local checkpoint is not
advanced yet (still 0). The condition shouldPeriodicallyFlush becomes
true because the uncommitted translog size is 216bytes (2ops + gen-1 +
gen-2) > 180bytes and the translog generation of the new index commit
would advance from 1 to 2.

> [2019-02-13T23:33:58,257][TRACE][o.e.i.e.Engine           ] [node_s_0]
> [test][0] committing writer with commit data [{local_checkpoint=0,
> max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g,
> min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q,
> retention_leases=primary_term:1;version:0;, translog_generation=2,
> max_seq_no=1}]

1. The shouldPeriodicallyFlush becomes true again after the local
checkpoint is advanced to 1 because the uncommitted translog size is
216bytes (2ops + gen-2 + gen-3) > 180bytes and the translog generation
of the new index commit would advance from 2 to 4.

> [2019-02-13T23:33:58,264][TRACE][o.e.i.e.Engine           ] [node_s_0]
> [test][0] committing writer with commit data [{local_checkpoint=1,
> max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g,
> min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q,
> retention_leases=primary_term:1;version:0;, translog_generation=4,
> max_seq_no=1}]

We need to relax the assertion in this test to cover this situation.

Closes #31629
dnhatn added a commit that referenced this issue Feb 15, 2019
The predicate shouldPeriodicallyFlush is determined by the uncommitted
translog size and the local checkpoint. The uncommitted translog size
depends on the local checkpoint. The condition shouldPeriodicallyFlush
can be true twice in in the test in the following scenario:

1. Index doc-0 and advances the local checkpoint to 0, the condition
shouldPeriodicallyFlush remains false.

2. Index doc-1 and add it to translog, but the local checkpoint is not
advanced yet (still 0). The condition shouldPeriodicallyFlush becomes
true because the uncommitted translog size is 216bytes (2ops + gen-1 +
gen-2) > 180bytes and the translog generation of the new index commit
would advance from 1 to 2.

> [2019-02-13T23:33:58,257][TRACE][o.e.i.e.Engine           ] [node_s_0]
> [test][0] committing writer with commit data [{local_checkpoint=0,
> max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g,
> min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q,
> retention_leases=primary_term:1;version:0;, translog_generation=2,
> max_seq_no=1}]

1. The shouldPeriodicallyFlush becomes true again after the local
checkpoint is advanced to 1 because the uncommitted translog size is
216bytes (2ops + gen-2 + gen-3) > 180bytes and the translog generation
of the new index commit would advance from 2 to 4.

> [2019-02-13T23:33:58,264][TRACE][o.e.i.e.Engine           ] [node_s_0]
> [test][0] committing writer with commit data [{local_checkpoint=1,
> max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g,
> min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q,
> retention_leases=primary_term:1;version:0;, translog_generation=4,
> max_seq_no=1}]

We need to relax the assertion in this test to cover this situation.

Closes #31629
dnhatn added a commit that referenced this issue Feb 15, 2019
The predicate shouldPeriodicallyFlush is determined by the uncommitted
translog size and the local checkpoint. The uncommitted translog size
depends on the local checkpoint. The condition shouldPeriodicallyFlush
can be true twice in in the test in the following scenario:

1. Index doc-0 and advances the local checkpoint to 0, the condition
shouldPeriodicallyFlush remains false.

2. Index doc-1 and add it to translog, but the local checkpoint is not
advanced yet (still 0). The condition shouldPeriodicallyFlush becomes
true because the uncommitted translog size is 216bytes (2ops + gen-1 +
gen-2) > 180bytes and the translog generation of the new index commit
would advance from 1 to 2.

> [2019-02-13T23:33:58,257][TRACE][o.e.i.e.Engine           ] [node_s_0]
> [test][0] committing writer with commit data [{local_checkpoint=0,
> max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g,
> min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q,
> retention_leases=primary_term:1;version:0;, translog_generation=2,
> max_seq_no=1}]

1. The shouldPeriodicallyFlush becomes true again after the local
checkpoint is advanced to 1 because the uncommitted translog size is
216bytes (2ops + gen-2 + gen-3) > 180bytes and the translog generation
of the new index commit would advance from 2 to 4.

> [2019-02-13T23:33:58,264][TRACE][o.e.i.e.Engine           ] [node_s_0]
> [test][0] committing writer with commit data [{local_checkpoint=1,
> max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g,
> min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q,
> retention_leases=primary_term:1;version:0;, translog_generation=4,
> max_seq_no=1}]

We need to relax the assertion in this test to cover this situation.

Closes #31629
dnhatn added a commit that referenced this issue Feb 16, 2019
The predicate shouldPeriodicallyFlush is determined by the uncommitted
translog size and the local checkpoint. The uncommitted translog size
depends on the local checkpoint. The condition shouldPeriodicallyFlush
can be true twice in in the test in the following scenario:

1. Index doc-0 and advances the local checkpoint to 0, the condition
shouldPeriodicallyFlush remains false.

2. Index doc-1 and add it to translog, but the local checkpoint is not
advanced yet (still 0). The condition shouldPeriodicallyFlush becomes
true because the uncommitted translog size is 216bytes (2ops + gen-1 +
gen-2) > 180bytes and the translog generation of the new index commit
would advance from 1 to 2.

> [2019-02-13T23:33:58,257][TRACE][o.e.i.e.Engine           ] [node_s_0]
> [test][0] committing writer with commit data [{local_checkpoint=0,
> max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g,
> min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q,
> retention_leases=primary_term:1;version:0;, translog_generation=2,
> max_seq_no=1}]

1. The shouldPeriodicallyFlush becomes true again after the local
checkpoint is advanced to 1 because the uncommitted translog size is
216bytes (2ops + gen-2 + gen-3) > 180bytes and the translog generation
of the new index commit would advance from 2 to 4.

> [2019-02-13T23:33:58,264][TRACE][o.e.i.e.Engine           ] [node_s_0]
> [test][0] committing writer with commit data [{local_checkpoint=1,
> max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g,
> min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q,
> retention_leases=primary_term:1;version:0;, translog_generation=4,
> max_seq_no=1}]

We need to relax the assertion in this test to cover this situation.

Closes #31629
dnhatn added a commit that referenced this issue Feb 16, 2019
The predicate shouldPeriodicallyFlush is determined by the uncommitted
translog size and the local checkpoint. The uncommitted translog size
depends on the local checkpoint. The condition shouldPeriodicallyFlush
can be true twice in in the test in the following scenario:

1. Index doc-0 and advances the local checkpoint to 0, the condition
shouldPeriodicallyFlush remains false.

2. Index doc-1 and add it to translog, but the local checkpoint is not
advanced yet (still 0). The condition shouldPeriodicallyFlush becomes
true because the uncommitted translog size is 216bytes (2ops + gen-1 +
gen-2) > 180bytes and the translog generation of the new index commit
would advance from 1 to 2.

> [2019-02-13T23:33:58,257][TRACE][o.e.i.e.Engine           ] [node_s_0]
> [test][0] committing writer with commit data [{local_checkpoint=0,
> max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g,
> min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q,
> retention_leases=primary_term:1;version:0;, translog_generation=2,
> max_seq_no=1}]

1. The shouldPeriodicallyFlush becomes true again after the local
checkpoint is advanced to 1 because the uncommitted translog size is
216bytes (2ops + gen-2 + gen-3) > 180bytes and the translog generation
of the new index commit would advance from 2 to 4.

> [2019-02-13T23:33:58,264][TRACE][o.e.i.e.Engine           ] [node_s_0]
> [test][0] committing writer with commit data [{local_checkpoint=1,
> max_unsafe_auto_id_timestamp=-1, translog_uuid=fFp1Yqd4QiqKDD4ZrC8F-g,
> min_retained_seq_no=0, history_uuid=cn31yrwVQk-Vs7qcg4bi_Q,
> retention_leases=primary_term:1;version:0;, translog_generation=4,
> max_seq_no=1}]

We need to relax the assertion in this test to cover this situation.

Closes #31629
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants