Skip to content

KAFKA-13592:Fix flaky test ControllerIntegrationTest.testTopicIdUpgradeAfterReassigningPartitions#11687

Merged
hachikuji merged 10 commits intoapache:trunkfrom
Kvicii:KAFKA-13592
Jun 7, 2022
Merged

KAFKA-13592:Fix flaky test ControllerIntegrationTest.testTopicIdUpgradeAfterReassigningPartitions#11687
hachikuji merged 10 commits intoapache:trunkfrom
Kvicii:KAFKA-13592

Conversation

@Kvicii
Copy link
Contributor

@Kvicii Kvicii commented Jan 18, 2022

this issue similar to #11666.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@Kvicii Kvicii changed the title Fix flaky test ControllerIntegrationTest.testTopicIdUpgradeAfterReassigningPartitions KAFKA-13592;Fix flaky test ControllerIntegrationTest.testTopicIdUpgradeAfterReassigningPartitions Jan 18, 2022
@Kvicii Kvicii changed the title KAFKA-13592;Fix flaky test ControllerIntegrationTest.testTopicIdUpgradeAfterReassigningPartitions KAFKA-13592Fix flaky test ControllerIntegrationTest.testTopicIdUpgradeAfterReassigningPartitions Jan 18, 2022
@Kvicii Kvicii changed the title KAFKA-13592Fix flaky test ControllerIntegrationTest.testTopicIdUpgradeAfterReassigningPartitions KAFKA-13592:Fix flaky test ControllerIntegrationTest.testTopicIdUpgradeAfterReassigningPartitions Jan 18, 2022
@dajac
Copy link
Member

dajac commented Jan 19, 2022

@Kvicii Thanks for the PR. I am a bit confused because the PR refactors the code but it does not seem to fix any issues. Did I miss something? Have you been able to make the test fail?

@Kvicii
Copy link
Contributor Author

Kvicii commented Jan 19, 2022

@dajac The first build failed didn't seem to be because of this test, so I rebuilt it.Can you tell me what the expectations should be, thanks a lot

@dajac
Copy link
Member

dajac commented Jan 19, 2022

@Kvicii ControllerIntegrationTest.testTopicIdUpgradeAfterReassigningPartitions does not fail all the time but only occasionally. It seems that there is a race condition in the test which causes it. I suggest to first try to reproduce the failure by running the test in a while loop. You can do the following:

while ./gradlew cleanTest :core:test --tests ControllerIntegrationTest.testTopicIdUpgradeAfterReassigningPartitions; do :; done;

Once we understand the issue, we can improve the test to fix it.

@Kvicii
Copy link
Contributor Author

Kvicii commented Jan 20, 2022

@dajac I used computeUntilTrue, but it is also very difficult to reproduce using a script.I tried a few other ways, but none seem to work😔

@divijvaidya
Copy link
Member

@Kvicii can you please fix the conflicts on this one? I think that you have the correct fix here and once you have fixed the conflicts, we can ask for a review from the community.

@hachikuji
Copy link
Contributor

@Kvicii Thanks for the patch. I checked this out and ControllerIntegrationTest.testTopicIdUpgradeAfterReassigningPartitions still fails 5-10% of the time when run locally. The new failure looks like this:

org.opentest4j.AssertionFailedError: expected same topic ID but it can not be found ==> 
Expected :Some(6_zllbH4TES3yUSJZ4ATsw)
Actual   :None
	at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:55)
	at org.junit.jupiter.api.AssertionUtils.failNotEqual(AssertionUtils.java:62)
	at org.junit.jupiter.api.AssertEquals.assertEquals(AssertEquals.java:182)
	at org.junit.jupiter.api.Assertions.assertEquals(Assertions.java:1152)
	at kafka.controller.ControllerIntegrationTest.testTopicIdUpgradeAfterReassigningPartitions(ControllerIntegrationTest.scala:1473)

I think we need to account for a delay between when the change hits zk and when it is reflected in the ControllerContext. I've pushed a patch here on top of this branch: f2dd2f5. This logic first waits for a consistent ControllerContext, then asserts zookeeper state. I wasn't able to reproduce the failure anymore with this patch. Feel free to include it here.

Also, could we move the unrelated cleanups into a separate PR? I know it can be difficult to overlook messy code, but it makes review more difficult since we have to look at every line to see which changes actually matter. And, on the other hand, it is easy to review cleanup PRs if they are not changing any logic.

@Kvicii
Copy link
Contributor Author

Kvicii commented May 31, 2022

@hachikuji ok, let me split other PR.

@hachikuji
Copy link
Contributor

@Kvicii What did you think about the changes here f2dd2f5? I think we need something like this to get rid of some remaining race conditions in the test.

@Kvicii
Copy link
Contributor Author

Kvicii commented Jun 1, 2022

@hachikuji I agree with you.If f2dd2f5 make sure the unit test pass.

@hachikuji
Copy link
Contributor

@Kvicii Thanks for checking. Do you want to pull those changes here or shall I do a separate PR?

@Kvicii
Copy link
Contributor Author

Kvicii commented Jun 1, 2022

@hachikuji I moved those changes here, thx.

Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM.

@hachikuji hachikuji merged commit 09570f2 into apache:trunk Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants