Flaky-test: recoverLongTimeAfterMultipleWriteErrors #240

rdhabalia · 2017-02-23T23:19:06Z

Tests run: 219, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 81.903 sec <<< FAILURE! - in TestSuite
recoverLongTimeAfterMultipleWriteErrors(org.apache.bookkeeper.mledger.impl.ManagedLedgerErrorsTest)  Time elapsed: 0.014 sec  <<< FAILURE!
java.lang.AssertionError: expected [1] but found [2]
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerErrorsTest.recoverLongTimeAfterMultipleWriteErrors(ManagedLedgerErrorsTest.java:470)
Results :
Failed tests: 
  ManagedLedgerErrorsTest.recoverLongTimeAfterMultipleWriteErrors:470 expected [1] but found [2]

The text was updated successfully, but these errors were encountered:

ivankelly · 2018-09-07T13:51:14Z

I repro'd on a checkout from when this was reported, and it looks like a legit bug and potentially serious.
However, I haven't been able to get a direct repro on master, though it looks like the bug still exists.

The bug occurs as follows:

A managed ledger, with current ledger as 3
The managed ledger attempts to write two entries to 3
First entry write to 3 fails, 3 is closed, deleted, and replaced by 5 in the managed ledger, the writes are retried
Second entry write fails, 3 is closed, deleted from the ledger list (it's not there) and ledger 6 is created and added to the ledger list.

The solution should be to make sure the ledgerClosed() checks whether the passed in ledger is the current ledger and only try to add a new one if so. However, I would like a reliable test for this first, but we'll need to modify the mocks first.

(cc @merlimat )

If we have a managed ledger, ml and we write 2 entries to it, if both entries fail, both will end up calling ManagedLedgerImpl#ledgerClosed with the ledger the write failed on as a parameter. However, depending on timing, the second call to ledgerClosed could end up adding a new ledger to the ledger list, even though the current ledger is _not_ failing (as the failing ledger was replaced by the first call). This was the cause of a flake in ManagedLedgerErrorsTest#recoverLongTimeAfterMultipleWriteErrors as reported in (apache#240). However, it's not possible to get a deterministic test for this as the timings need to be very precise. The failing addComplete needs to run before first error handling completes, but the runnable with ledgerClosed for the second failure needs to run after the first error handling completes, but before the write resends from the first error handling complete.

…2573) If we have a managed ledger, ml and we write 2 entries to it, if both entries fail, both will end up calling ManagedLedgerImpl#ledgerClosed with the ledger the write failed on as a parameter. However, depending on timing, the second call to ledgerClosed could end up adding a new ledger to the ledger list, even though the current ledger is _not_ failing (as the failing ledger was replaced by the first call). This was the cause of a flake in ManagedLedgerErrorsTest#recoverLongTimeAfterMultipleWriteErrors as reported in (#240). However, it's not possible to get a deterministic test for this as the timings need to be very precise. The failing addComplete needs to run before first error handling completes, but the runnable with ledgerClosed for the second failure needs to run after the first error handling completes, but before the write resends from the first error handling complete.

ivankelly · 2018-09-14T08:52:41Z

Closed by #2573

to fix apache#239

liangyepianzhou · 2023-03-07T05:53:26Z

It becomes a flaky test again.

Example failure

https://github.com/apache/pulsar/actions/runs/4329269822/jobs/7559697736

Exception stacktrace

Error:  recoverLongTimeAfterMultipleWriteErrors(org.apache.bookkeeper.mledger.impl.ManagedLedgerErrorsTest)  Time elapsed: 0.025 s  <<< FAILURE!
java.lang.AssertionError: expected [null] but found [org.apache.bookkeeper.mledger.ManagedLedgerException$ManagedLedgerAlreadyClosedException: Waiting to recover from failure]
	at org.testng.Assert.fail(Assert.java:99)
	at org.testng.Assert.failNotSame(Assert.java:1033)
	at org.testng.Assert.assertNull(Assert.java:965)
	at org.testng.Assert.assertNull(Assert.java:953)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerErrorsTest.recoverLongTimeAfterMultipleWriteErrors(ManagedLedgerErrorsTest.java:541)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
	at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:45)
	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:73)
	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

tisonkun · 2023-05-21T02:54:00Z

Replace with a new issue so that we have better visibility #20364.

merlimat self-assigned this Feb 24, 2017

merlimat added this to the 1.17 milestone Feb 24, 2017

merlimat added the area/test label Feb 24, 2017

merlimat modified the milestones: 1.17, 1.18 Mar 31, 2017

merlimat modified the milestones: 1.18, 1.19 Jun 15, 2017

merlimat modified the milestones: 1.19.0-incubating, 1.20.0-incubating Jul 31, 2017

rdhabalia modified the milestones: 1.20.0-incubating, 1.21.0-incubating Sep 14, 2017

nkurihar modified the milestones: 1.21.0-incubating, 1.22.0-incubating Nov 29, 2017

merlimat modified the milestones: 1.22.0-incubating, 2.0.0-incubating Feb 15, 2018

sijie pushed a commit to sijie/pulsar that referenced this issue Mar 4, 2018

fixing getting assignments from assignment topic (apache#240)

6b0e087

sijie modified the milestones: 2.0.0-incubating, 2.1.0-incubating Apr 25, 2018

sijie modified the milestones: 2.1.0-incubating, 2.2.0-incubating Jun 18, 2018

ivankelly added the triage/week-35 label Aug 30, 2018

sijie removed this from the 2.2.0-incubating milestone Sep 12, 2018

ivankelly mentioned this issue Sep 13, 2018

ManagedLedger only closes ledger on error if current ledger (#240) #2573

Merged

ivankelly closed this as completed Sep 14, 2018

hangc0276 pushed a commit to hangc0276/pulsar that referenced this issue May 26, 2021

change channel writable to debug avoid to disk full (apache#240)

996486f

to fix apache#239

xiaotongwang1 mentioned this issue Aug 4, 2021

Pulsar 2.7.0+ KOP 2.7.2.x getPartitionedTopicMetadata timeout #11532

Closed

liangyepianzhou reopened this Mar 7, 2023

liangyepianzhou changed the title ~~Intermittent test failures in ManagedLedgerErrorsTest.recoverLongTimeAfterMultipleWriteErrors~~ Flaky-test: recoverLongTimeAfterMultipleWriteErrors Mar 7, 2023

sijie mentioned this issue Mar 7, 2023

ISSUE-240: Flaky-test: recoverLongTimeAfterMultipleWriteErrors streamnative/pulsar-archived#5598

Open

tisonkun mentioned this issue May 21, 2023

Flaky-test: recoverLongTimeAfterMultipleWriteErrors #20364

Closed

tisonkun closed this as completed May 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky-test: recoverLongTimeAfterMultipleWriteErrors #240

Flaky-test: recoverLongTimeAfterMultipleWriteErrors #240

rdhabalia commented Feb 23, 2017

ivankelly commented Sep 7, 2018

ivankelly commented Sep 14, 2018

liangyepianzhou commented Mar 7, 2023

tisonkun commented May 21, 2023

Flaky-test: recoverLongTimeAfterMultipleWriteErrors #240

Flaky-test: recoverLongTimeAfterMultipleWriteErrors #240

Comments

rdhabalia commented Feb 23, 2017

ivankelly commented Sep 7, 2018

ivankelly commented Sep 14, 2018

liangyepianzhou commented Mar 7, 2023

Example failure

Exception stacktrace

tisonkun commented May 21, 2023