Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky-test: TransactionEndToEndTest.produceCommitTest #18466

Closed
1 of 2 tasks
codelipenghui opened this issue Nov 15, 2022 · 6 comments · Fixed by #20006
Closed
1 of 2 tasks

Flaky-test: TransactionEndToEndTest.produceCommitTest #18466

codelipenghui opened this issue Nov 15, 2022 · 6 comments · Fixed by #20006

Comments

@codelipenghui
Copy link
Contributor

Search before asking

  • I searched in the issues and found nothing similar.

Example failure

https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137

Exception stacktrace

Error:  Tests run: 52, Failures: 22, Errors: 0, Skipped: 27, Time elapsed: 617.386 s <<< FAILURE! - in org.apache.pulsar.client.impl.TransactionEndToEndTest
[1386](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1387)
  Error:  produceCommitTest(org.apache.pulsar.client.impl.TransactionEndToEndTest)  Time elapsed: 300.015 s  <<< FAILURE!
[1387](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1388)
  org.testng.internal.thread.ThreadTimeoutException: Method org.apache.pulsar.client.impl.TransactionEndToEndTest.produceCommitTest() didn't finish within the time-out 300000
[1388](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1389)
  	at org.testng.internal.MethodInvocationHelper.invokeWithTimeoutWithNewExecutor(MethodInvocationHelper.java:371)
[1389](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1390)
  	at org.testng.internal.MethodInvocationHelper.invokeWithTimeout(MethodInvocationHelper.java:282)
[1390](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1391)
  	at org.testng.internal.TestInvoker.invokeMethod(TestInvoker.java:605)
[1391](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1392)
  	at org.testng.internal.TestInvoker.retryFailed(TestInvoker.java:214)
[1392](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1393)
  	at org.testng.internal.MethodRunner.runInSequence(MethodRunner.java:58)
[1393](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1394)
  	at org.testng.internal.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:822)
[1394](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1395)
  	at org.testng.internal.TestInvoker.invokeTestMethods(TestInvoker.java:147)
[1395](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1396)
  	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
[1396](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1397)
  	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:128)
[1397](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1398)
  	at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
[1398](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1399)
  	at org.testng.TestRunner.privateRun(TestRunner.java:764)
[1399](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1400)
  	at org.testng.TestRunner.run(TestRunner.java:585)
[1400](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1401)
  	at org.testng.SuiteRunner.runTest(SuiteRunner.java:384)
[1401](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1402)
  	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:378)
[1402](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1403)
  	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:337)
[1403](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1404)
  	at org.testng.SuiteRunner.run(SuiteRunner.java:286)
[1404](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1405)
  	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:53)
[1405](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1406)
  	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:96)
[1406](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1407)
  	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1218)
[1407](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1408)
  	at org.testng.TestNG.runSuitesLocally(TestNG.java:1140)
[1408](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1409)
  	at org.testng.TestNG.runSuites(TestNG.java:1069)
[1409](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1410)
  	at org.testng.TestNG.run(TestNG.java:1037)
[1410](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1411)
  	at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:135)
[1411](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1412)
  	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeSingleClass(TestNGDirectoryTestSuite.java:112)
[1412](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1413)
  	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeLazy(TestNGDirectoryTestSuite.java:123)
[1413](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1414)
  	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:90)
[1414](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1415)
  	at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:146)
[1415](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1416)
  	at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
[1416](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1417)
  	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
[1417](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1418)
  	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
[1418](https://github.com/apache/pulsar/actions/runs/3461842392/jobs/5780214137#step:10:1419)
  	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@liangyepianzhou
Copy link
Contributor

Ok, I'm willing to resolve it.

@liangyepianzhou liangyepianzhou self-assigned this Nov 16, 2022
@liangyepianzhou
Copy link
Contributor

liangyepianzhou commented Nov 17, 2022

This flaky test can be resolved by #18486 and #17836

liangyepianzhou added a commit that referenced this issue Nov 24, 2022
## Motivation
1. fix flaky test #18466 caused by txn async send method
2. decrease run time by optimizing receive method 
## Modification
1. fix flaky test
   * modify `producer.newMessage(txn1).value(("Hello Txn - " + i).getBytes(UTF_8)).sendAsync();` to `producer.newMessage(txn1).value(("Hello Txn - " + i).getBytes(UTF_8)).send();` 
This also can be resolved by #17836 and #18486 later.
2. decrease run time by optimizing receive method 
    * modify
 `    Message<byte[]> message = consumer.receive(5, TimeUnit.SECONDS);
                   Assert.assertNull(message);` to
                   `  Message<byte[]> message = consumer.receive(300, TimeUnit.MILLISECONDS);
                            Assert.assertNull(message);`
   * modify `message = consumer.receive();` to `message = consumer.receive(5, TimeUnit.SECONDS);`
   * keep other `consumer.receive(x, y)` no change.
lifepuzzlefun pushed a commit to lifepuzzlefun/pulsar that referenced this issue Dec 9, 2022
## Motivation
1. fix flaky test apache#18466 caused by txn async send method
2. decrease run time by optimizing receive method 
## Modification
1. fix flaky test
   * modify `producer.newMessage(txn1).value(("Hello Txn - " + i).getBytes(UTF_8)).sendAsync();` to `producer.newMessage(txn1).value(("Hello Txn - " + i).getBytes(UTF_8)).send();` 
This also can be resolved by apache#17836 and apache#18486 later.
2. decrease run time by optimizing receive method 
    * modify
 `    Message<byte[]> message = consumer.receive(5, TimeUnit.SECONDS);
                   Assert.assertNull(message);` to
                   `  Message<byte[]> message = consumer.receive(300, TimeUnit.MILLISECONDS);
                            Assert.assertNull(message);`
   * modify `message = consumer.receive();` to `message = consumer.receive(5, TimeUnit.SECONDS);`
   * keep other `consumer.receive(x, y)` no change.
@github-actions
Copy link

The issue had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Dec 21, 2022
lifepuzzlefun pushed a commit to lifepuzzlefun/pulsar that referenced this issue Jan 10, 2023
## Motivation
1. fix flaky test apache#18466 caused by txn async send method
2. decrease run time by optimizing receive method 
## Modification
1. fix flaky test
   * modify `producer.newMessage(txn1).value(("Hello Txn - " + i).getBytes(UTF_8)).sendAsync();` to `producer.newMessage(txn1).value(("Hello Txn - " + i).getBytes(UTF_8)).send();` 
This also can be resolved by apache#17836 and apache#18486 later.
2. decrease run time by optimizing receive method 
    * modify
 `    Message<byte[]> message = consumer.receive(5, TimeUnit.SECONDS);
                   Assert.assertNull(message);` to
                   `  Message<byte[]> message = consumer.receive(300, TimeUnit.MILLISECONDS);
                            Assert.assertNull(message);`
   * modify `message = consumer.receive();` to `message = consumer.receive(5, TimeUnit.SECONDS);`
   * keep other `consumer.receive(x, y)` no change.
@lhotari
Copy link
Member

lhotari commented Mar 23, 2023

This flaky test can be resolved by #18486 and #17836

@liangyepianzhou this problem remains. #18522 didn't resolve this either.

Latest failure: https://github.com/apache/pulsar/actions/runs/4498077999/jobs/7914492741?pr=19904#step:11:547

  Error:  Tests run: 144, Failures: 1, Errors: 0, Skipped: 140, Time elapsed: 126.933 s <<< FAILURE! - in org.apache.pulsar.client.impl.TransactionEndToEndTest
  Error:  produceCommitTest(org.apache.pulsar.client.impl.TransactionEndToEndTest)  Time elapsed: 10.631 s  <<< FAILURE!
  org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.ExecutionException: org.apache.pulsar.client.api.transaction.TransactionCoordinatorClientException$InvalidTxnStatusException: [(2,0)] with unexpected state : TIME_OUT, expect OPEN state!
  	at org.apache.pulsar.client.api.PulsarClientException.unwrap(PulsarClientException.java:1111)
  	at org.apache.pulsar.client.impl.TypedMessageBuilderImpl.send(TypedMessageBuilderImpl.java:90)
  	at org.apache.pulsar.client.impl.TransactionEndToEndTest.produceCommitTest(TransactionEndToEndTest.java:285)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
  	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:139)
  	at org.testng.internal.invokers.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:47)
  	at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:76)
  	at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
  	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
  	at java.base/java.lang.Thread.run(Thread.java:833)
  Caused by: java.util.concurrent.ExecutionException: org.apache.pulsar.client.api.transaction.TransactionCoordinatorClientException$InvalidTxnStatusException: [(2,0)] with unexpected state : TIME_OUT, expect OPEN state!
  	at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
  	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
  	at org.apache.pulsar.client.impl.TypedMessageBuilderImpl.send(TypedMessageBuilderImpl.java:88)
  	... 13 more
  Caused by: org.apache.pulsar.client.api.transaction.TransactionCoordinatorClientException$InvalidTxnStatusException: [(2,0)] with unexpected state : TIME_OUT, expect OPEN state!
  	at org.apache.pulsar.client.impl.transaction.TransactionImpl.checkIfOpen(TransactionImpl.java:259)
  	at org.apache.pulsar.client.impl.PartitionedProducerImpl.internalSendWithTxnAsync(PartitionedProducerImpl.java:228)
  	at org.apache.pulsar.client.impl.TypedMessageBuilderImpl.sendAsync(TypedMessageBuilderImpl.java:99)
  	at org.apache.pulsar.client.impl.TypedMessageBuilderImpl.send(TypedMessageBuilderImpl.java:80)
  	... 13 more

do you have a chance to fix this? thanks

@lhotari
Copy link
Member

lhotari commented Mar 23, 2023

I wonder if the most recent failure is different.
The stack trace contains
org.apache.pulsar.client.api.transaction.TransactionCoordinatorClientException$InvalidTxnStatusException: [(2,0)] with unexpected state : TIME_OUT, expect OPEN state!

@liangyepianzhou
Copy link
Contributor

There may have been some recent changes or unresolved issues from before. I will make another attempt to address this problem.

@github-actions github-actions bot removed the Stale label Mar 24, 2023
liangyepianzhou added a commit that referenced this issue Apr 13, 2023
Fixes #18466
### Motivation
There are two main goals in solving this issue:

1. Fix the unstable tests in `produceCommitTest`.
2. Prevent transaction timeouts created by other tests from affecting the `testTxnTimeoutAtTransactionMetadataStore` test during its execution.


### Modification
1. Change the message-sending method to synchronous. (fix `produceCommitTest`)
2. Increase the transaction timeout to 10 minutes (fix `testTxnTimeoutAtTransactionMetadataStore`).
poorbarcode pushed a commit that referenced this issue May 7, 2023
Fixes #18466
### Motivation
There are two main goals in solving this issue:

1. Fix the unstable tests in `produceCommitTest`.
2. Prevent transaction timeouts created by other tests from affecting the `testTxnTimeoutAtTransactionMetadataStore` test during its execution.

### Modification
1. Change the message-sending method to synchronous. (fix `produceCommitTest`)
2. Increase the transaction timeout to 10 minutes (fix `testTxnTimeoutAtTransactionMetadataStore`).

(cherry picked from commit 653271e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants