Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-17182: Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer #17700

Draft
wants to merge 36 commits into
base: trunk
Choose a base branch
from

Conversation

kirktrue
Copy link
Collaborator

@kirktrue kirktrue commented Nov 5, 2024

This change reduces fetch session cache evictions on the broker for AsyncKafkaConsumer by altering its logic to determine which partitions it includes in fetch requests.

Consumer implementations fetch data from the cluster and temporarily buffer it in memory until the user next calls Consumer.poll(). When a fetch request is being generated, partitions that already have buffered data are not included in the fetch request.

The ClassicKafkaConsumer performs much of its fetch logic and network I/O in the application thread. On poll(), if there is any locally-buffered data, the ClassicKafkaConsumer does not fetch any new data and simply returns the buffered data to the user from poll().

On the other hand, the AsyncKafkaConsumer consumer splits its logic and network I/O between two threads, which results in a potential race condition during fetch. The AsyncKafkaConsumer also checks for buffered data on its application thread. If it finds there is none, it signals the background thread to create a fetch request. However, it's possible for the background thread to receive data from a previous fetch and buffer it before the fetch request logic starts. When that occurs, as the background thread creates a new fetch request, it skips any buffered data, which has the unintended result that those partitions get added to the fetch request's "to remove" set. This signals to the broker to remove those partitions from its internal cache.

This issue is technically possible in the ClassicKafkaConsumer too, since the heartbeat thread performs network I/O in addition to the application thread. However, because of the frequency at which the AsyncKafkaConsumer's background thread runs, it is ~100x more likely to happen.

The core decision is: what should the background thread do if it is asked to create a fetch request and it discovers there's buffered data. There were multiple proposals to address this issue in the AsyncKafkaConsumer. Among them are:

  1. The background thread should omit buffered partitions from the fetch request as before (this is the existing behavior)
  2. The background thread should skip the fetch request generation entirely if there are any buffered partitions
  3. The background thread should include buffered partitions in the fetch request, but use a small “max bytes” value
  4. The background thread should skip fetching from the nodes that have buffered partitions

Option 3 won out. The change in AsyncKafkaConsumer is to include in the fetch request any partition with buffered data. By using a "max bytes" size of 1, this should cause the fetch response to return as little data as possible. In that way, the consumer doesn't buffer too much data on the client before it can be returned from poll().

Here are the results of our internal stress testing:

  • ClassicKafkaConsumer—after the initial spike during test start up, the average rate settles down to ~0.14 evictions/second CLASSIC
  • AsyncKafkaConsumer, (w/o fix)—after startup, the evictions still settle down, but they are about 100x higher than the ClassicKafkaConsumer at ~1.48 evictions/second CONSUMER-before
  • AsyncKafkaConsumer (w/ fix)—the eviction rate is now closer to the ClassicKafkaConsumer at ~0.22 evictions/second CONSUMER-after

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@kirktrue kirktrue changed the title KAFKA-17439: Make polling for new records an explicit action/event in the new consumer KAFKA-17182: Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer Nov 5, 2024
@kirktrue kirktrue added ctr Consumer Threading Refactor (KIP-848) ci-approved labels Nov 5, 2024
@kirktrue kirktrue added the KIP-848 The Next Generation of the Consumer Rebalance Protocol label Nov 7, 2024
@kirktrue kirktrue marked this pull request as ready for review November 23, 2024 00:21
@kirktrue kirktrue marked this pull request as draft November 26, 2024 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-approved clients consumer ctr Consumer Threading Refactor (KIP-848) KIP-848 The Next Generation of the Consumer Rebalance Protocol
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant