-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky-test: recoverLongTimeAfterMultipleWriteErrors #240
Comments
I repro'd on a checkout from when this was reported, and it looks like a legit bug and potentially serious. The bug occurs as follows:
The solution should be to make sure the ledgerClosed() checks whether the passed in ledger is the current ledger and only try to add a new one if so. However, I would like a reliable test for this first, but we'll need to modify the mocks first. (cc @merlimat ) |
If we have a managed ledger, ml and we write 2 entries to it, if both entries fail, both will end up calling ManagedLedgerImpl#ledgerClosed with the ledger the write failed on as a parameter. However, depending on timing, the second call to ledgerClosed could end up adding a new ledger to the ledger list, even though the current ledger is _not_ failing (as the failing ledger was replaced by the first call). This was the cause of a flake in ManagedLedgerErrorsTest#recoverLongTimeAfterMultipleWriteErrors as reported in (apache#240). However, it's not possible to get a deterministic test for this as the timings need to be very precise. The failing addComplete needs to run before first error handling completes, but the runnable with ledgerClosed for the second failure needs to run after the first error handling completes, but before the write resends from the first error handling complete.
…2573) If we have a managed ledger, ml and we write 2 entries to it, if both entries fail, both will end up calling ManagedLedgerImpl#ledgerClosed with the ledger the write failed on as a parameter. However, depending on timing, the second call to ledgerClosed could end up adding a new ledger to the ledger list, even though the current ledger is _not_ failing (as the failing ledger was replaced by the first call). This was the cause of a flake in ManagedLedgerErrorsTest#recoverLongTimeAfterMultipleWriteErrors as reported in (#240). However, it's not possible to get a deterministic test for this as the timings need to be very precise. The failing addComplete needs to run before first error handling completes, but the runnable with ledgerClosed for the second failure needs to run after the first error handling completes, but before the write resends from the first error handling complete.
Closed by #2573 |
It becomes a flaky test again. Example failurehttps://github.com/apache/pulsar/actions/runs/4329269822/jobs/7559697736 Exception stacktraceError: recoverLongTimeAfterMultipleWriteErrors(org.apache.bookkeeper.mledger.impl.ManagedLedgerErrorsTest) Time elapsed: 0.025 s <<< FAILURE!
java.lang.AssertionError: expected [null] but found [org.apache.bookkeeper.mledger.ManagedLedgerException$ManagedLedgerAlreadyClosedException: Waiting to recover from failure]
at org.testng.Assert.fail(Assert.java:99)
at org.testng.Assert.failNotSame(Assert.java:1033)
at org.testng.Assert.assertNull(Assert.java:965)
at org.testng.Assert.assertNull(Assert.java:953)
at org.apache.bookkeeper.mledger.impl.ManagedLedgerErrorsTest.recoverLongTimeAfterMultipleWriteErrors(ManagedLedgerErrorsTest.java:541)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:45)
at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:73)
at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829) |
Replace with a new issue so that we have better visibility #20364. |
The text was updated successfully, but these errors were encountered: