Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky-test: recoverLongTimeAfterMultipleWriteErrors #240

Closed
rdhabalia opened this issue Feb 23, 2017 · 4 comments
Closed

Flaky-test: recoverLongTimeAfterMultipleWriteErrors #240

rdhabalia opened this issue Feb 23, 2017 · 4 comments
Assignees

Comments

@rdhabalia
Copy link
Contributor

Tests run: 219, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 81.903 sec <<< FAILURE! - in TestSuite
recoverLongTimeAfterMultipleWriteErrors(org.apache.bookkeeper.mledger.impl.ManagedLedgerErrorsTest)  Time elapsed: 0.014 sec  <<< FAILURE!
java.lang.AssertionError: expected [1] but found [2]
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerErrorsTest.recoverLongTimeAfterMultipleWriteErrors(ManagedLedgerErrorsTest.java:470)
Results :
Failed tests: 
  ManagedLedgerErrorsTest.recoverLongTimeAfterMultipleWriteErrors:470 expected [1] but found [2]
@merlimat merlimat self-assigned this Feb 24, 2017
@merlimat merlimat added this to the 1.17 milestone Feb 24, 2017
@merlimat merlimat modified the milestones: 1.17, 1.18 Mar 31, 2017
@merlimat merlimat modified the milestones: 1.18, 1.19 Jun 15, 2017
sijie pushed a commit to sijie/pulsar that referenced this issue Mar 4, 2018
@ivankelly
Copy link
Contributor

I repro'd on a checkout from when this was reported, and it looks like a legit bug and potentially serious.
However, I haven't been able to get a direct repro on master, though it looks like the bug still exists.

The bug occurs as follows:

  • A managed ledger, with current ledger as 3
  • The managed ledger attempts to write two entries to 3
  • First entry write to 3 fails, 3 is closed, deleted, and replaced by 5 in the managed ledger, the writes are retried
  • Second entry write fails, 3 is closed, deleted from the ledger list (it's not there) and ledger 6 is created and added to the ledger list.

The solution should be to make sure the ledgerClosed() checks whether the passed in ledger is the current ledger and only try to add a new one if so. However, I would like a reliable test for this first, but we'll need to modify the mocks first.

(cc @merlimat )

@sijie sijie removed this from the 2.2.0-incubating milestone Sep 12, 2018
ivankelly added a commit to ivankelly/pulsar that referenced this issue Sep 13, 2018
If we have a managed ledger, ml and we write 2 entries to it, if both
entries fail, both will end up calling ManagedLedgerImpl#ledgerClosed
with the ledger the write failed on as a parameter.

However, depending on timing, the second call to ledgerClosed could
end up adding a new ledger to the ledger list, even though the current
ledger is _not_ failing (as the failing ledger was replaced by the
first call).

This was the cause of a flake in
ManagedLedgerErrorsTest#recoverLongTimeAfterMultipleWriteErrors as
reported in (apache#240). However, it's not possible to get a deterministic
test for this as the timings need to be very precise. The failing
addComplete needs to run before first error handling completes, but
the runnable with ledgerClosed for the second failure needs to run
after the first error handling completes, but before the write resends
from the first error handling complete.
merlimat pushed a commit that referenced this issue Sep 13, 2018
…2573)

If we have a managed ledger, ml and we write 2 entries to it, if both
entries fail, both will end up calling ManagedLedgerImpl#ledgerClosed
with the ledger the write failed on as a parameter.

However, depending on timing, the second call to ledgerClosed could
end up adding a new ledger to the ledger list, even though the current
ledger is _not_ failing (as the failing ledger was replaced by the
first call).

This was the cause of a flake in
ManagedLedgerErrorsTest#recoverLongTimeAfterMultipleWriteErrors as
reported in (#240). However, it's not possible to get a deterministic
test for this as the timings need to be very precise. The failing
addComplete needs to run before first error handling completes, but
the runnable with ledgerClosed for the second failure needs to run
after the first error handling completes, but before the write resends
from the first error handling complete.
@ivankelly
Copy link
Contributor

Closed by #2573

@liangyepianzhou
Copy link
Contributor

It becomes a flaky test again.

Example failure

https://github.com/apache/pulsar/actions/runs/4329269822/jobs/7559697736

Exception stacktrace

Error:  recoverLongTimeAfterMultipleWriteErrors(org.apache.bookkeeper.mledger.impl.ManagedLedgerErrorsTest)  Time elapsed: 0.025 s  <<< FAILURE!
java.lang.AssertionError: expected [null] but found [org.apache.bookkeeper.mledger.ManagedLedgerException$ManagedLedgerAlreadyClosedException: Waiting to recover from failure]
	at org.testng.Assert.fail(Assert.java:99)
	at org.testng.Assert.failNotSame(Assert.java:1033)
	at org.testng.Assert.assertNull(Assert.java:965)
	at org.testng.Assert.assertNull(Assert.java:953)
	at org.apache.bookkeeper.mledger.impl.ManagedLedgerErrorsTest.recoverLongTimeAfterMultipleWriteErrors(ManagedLedgerErrorsTest.java:541)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
	at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:45)
	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:73)
	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

@liangyepianzhou liangyepianzhou changed the title Intermittent test failures in ManagedLedgerErrorsTest.recoverLongTimeAfterMultipleWriteErrors Flaky-test: recoverLongTimeAfterMultipleWriteErrors Mar 7, 2023
@tisonkun
Copy link
Member

Replace with a new issue so that we have better visibility #20364.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants