Skip to content

Conversation

@chickenchickenlove
Copy link
Contributor

@chickenchickenlove chickenchickenlove commented Jan 15, 2026

Description

This test was flaky because it assumes the Streams group will reach
GroupState.NOT_READY, but depending on timing and environment the
Streams changelog topic could be created successfully. When that
happened, the group progressed to ASSIGNING/RECONCILING/STABLE, the
test failed to observe NOT_READY within the timeout, and it sometimes
produced Reconciliation failed logs during consumer shutdown due to an
unfinished onTasksAssigned-related event.

This change makes the behavior deterministic by allowing
createStreamsGroup() to inject a replication factor into
StreamsRebalanceData.TopicInfo. In
testDescribeStreamsGroupsNotReady, we pass an intentionally impossible
replication factor for the current cluster (e.g., 9999), ensuring the
changelog topic creation attempt always fails. As a result, the internal
topic remains missing and the Streams group reliably stays in
NOT_READY, eliminating the timing-dependent state transition that
caused the flakiness.

Detail Flaky Case and Non Flaky Case

image

Result

Related PR

Reviewers: Lucas Brutschy lbrutschy@confluent.io, Lianet Magrans
lmagrans@confluent.io

@github-actions github-actions bot added triage PRs from the community core Kafka Broker tests Test fixes (including flaky tests) small Small PRs labels Jan 15, 2026
@chickenchickenlove
Copy link
Contributor Author

@lucasbru @lianetm Hi!
Sorry for the sudden mention.
I’m reaching out because you seem to have the best context on this area from the previous PRs and reviews around this flaky test!

When you have bandwidth, could you please take a look this PR?
This PR makes testDescribeStreamsGroupsNotReady deterministic by injecting an intentionally unavailable replication factor into StreamsRebalanceData.TopicInfo, so internal topic creation always fails and the group reliably stays in NOT_READY (instead of racing to ASSIGNING/RECONCILING/STABLE between waitUntilTrue polls).

I’d really appreciate your feedback, especially on whether this approach matches the intent of the test and the Streams group state machine.

Thanks for taking time! 🙇‍♂️

@lianetm lianetm added ci-approved and removed triage PRs from the community labels Jan 16, 2026
@lucasbru lucasbru self-assigned this Jan 16, 2026
@lucasbru lucasbru requested review from Copilot and lucasbru January 16, 2026 13:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a flaky test by making the behavior deterministic. The test testDescribeStreamsGroupsNotReady was intermittently failing because it expected a Streams group to reach GroupState.NOT_READY, but depending on timing, the changelog topic could be successfully created, causing the group to transition to other states.

Changes:

  • Added a replicationFactor parameter to the createStreamsGroup method in IntegrationTestHarness.scala
  • Updated testDescribeStreamsGroupsNotReady to pass an impossible replication factor (9999) to ensure changelog topic creation fails
  • The changelog topic creation failure ensures the group reliably stays in NOT_READY state

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
core/src/test/scala/integration/kafka/api/IntegrationTestHarness.scala Added optional replicationFactor parameter to createStreamsGroup method and wired it into TopicInfo for changelog topics
core/src/test/scala/integration/kafka/api/PlaintextAdminIntegrationTest.scala Updated test to pass an impossible replication factor (9999) to deterministically trigger NOT_READY state

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


val boxed: Optional[java.lang.Short] =
replicationFactor.map[java.lang.Short](s => java.lang.Short.valueOf(s))

Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is trailing whitespace on line 257 after the closing parenthesis. This should be removed to maintain code cleanliness and adhere to standard formatting practices.

Suggested change

Copilot uses AI. Check for mistakes.
changelogTopics = Set(testTopicName + "-changelog"),
streamsGroupId = streamsGroupId
streamsGroupId = streamsGroupId,
replicationFactor = Optional.of(unavailableReplicationFactorInThisCluster),
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a trailing comma after the last parameter in the function call. While this is syntactically valid in Scala, it's inconsistent with the other usages of createStreamsGroup in the same file (e.g., lines 4421-4425, 4516-4519) where no trailing comma is used. Consider removing it for consistency.

Suggested change
replicationFactor = Optional.of(unavailableReplicationFactorInThisCluster),
replicationFactor = Optional.of(unavailableReplicationFactorInThisCluster)

Copilot uses AI. Check for mistakes.
@lucasbru
Copy link
Member

Ah, this makes sense to me. So my earlier fix was just avoiding the NPE, but we'd still time out. Check the copilot comments, otherwise LGTM.

@chickenchickenlove
Copy link
Contributor Author

@lucasbru
Thanks for your quick feedback 🙇‍♂️
I addressed your comments and Copilot’s in a new commit!
When you get a chance, please take another look 🙇‍♂️

Copy link
Member

@lianetm lianetm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM too, thanks!

@lianetm lianetm merged commit 3d6687b into apache:trunk Jan 18, 2026
33 of 35 checks passed
@chickenchickenlove
Copy link
Contributor Author

I'm so sorry 🙇‍♂️ .
It seems it's still flaky at times.
Executing streams.poll(JDuration.ofMillis(500L)) only once might also trigger a non-deterministic code path.
I'll dig into it more and open a PR if I can reproduce it / find the root cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved core Kafka Broker small Small PRs tests Test fixes (including flaky tests)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants