-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
miscellaneous idempotency fixes #22687
Conversation
/dt |
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52372#01910ff3-c809-41ea-9ea3-e5f7173b5566 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52372#01910ff3-c80d-4b77-adc4-4087884b2f17 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52372#01911009-8451-417d-8844-77ba4a69a81a ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52372#01911009-8453-4ee4-b449-039e780c418a ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52395#019111ec-dbe3-40b3-9929-f008b7cad1a4 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52453#01912131-ab0f-4ddc-a168-9c84cc37dc0e |
new failures in https://buildkite.com/redpanda/redpanda/builds/52372#01910ff3-c80b-4daa-88c8-3ab32d1b3aba:
new failures in https://buildkite.com/redpanda/redpanda/builds/52372#01911009-8451-417d-8844-77ba4a69a81a:
new failures in https://buildkite.com/redpanda/redpanda/builds/52372#01910ff3-c80d-4b77-adc4-4087884b2f17:
new failures in https://buildkite.com/redpanda/redpanda/builds/52372#01910ff3-c80f-4c65-916c-c4b3481bc0c7:
new failures in https://buildkite.com/redpanda/redpanda/builds/52372#01911009-8453-4ee4-b449-039e780c418a:
new failures in https://buildkite.com/redpanda/redpanda/builds/52395#019111ec-435d-44a9-89e1-6170567876a0:
new failures in https://buildkite.com/redpanda/redpanda/builds/52395#019111ec-4362-4005-91bb-403bb648b13f:
new failures in https://buildkite.com/redpanda/redpanda/builds/52395#019111ec-dbdf-449f-b570-f961e0144371:
new failures in https://buildkite.com/redpanda/redpanda/builds/52395#019111ec-dbe5-4fc3-9236-0e445d88b5ca:
new failures in https://buildkite.com/redpanda/redpanda/builds/52422#0191150d-0bb6-471c-8f62-d05a02b7bc07:
new failures in https://buildkite.com/redpanda/redpanda/builds/52422#0191150d-dca7-4cbe-8666-acac7b6b3f54:
|
/dt |
1 similar comment
/dt |
/dt |
Factoring out the code into a utility, to be used in a later commit.
To be used to reset the producer state with new epoch for idempotent producers that decide to bump the epoch on the client side (which is totally fine as the idempotency is per session and client can independently decide to bump the epoch on it's side).
volatile Exception ex = null; | ||
volatile long counter = 0; | ||
|
||
void produceLoop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in future, it would be great to make it easier to create this applications using Java client. I was thinking about creating an interface that would be implementable from the python code. The application would then be compossed of classes implementing say a ProducerWorker
interface and it would be configurable in runtime which implementation class to load. Now there is a lot of code duplication every time we need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean a python wrapper around an embedded jvm? like py4j or some such?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess something like a pluggable class controlled by a config (similar to how the client API user can specify the partitioner class in the kafka producer config today)
#include <iostream> | ||
|
||
namespace cluster { | ||
std::ostream& operator<<(std::ostream& o, cluster::errc err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we just use errc_category::message()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that, that was formatted as a message "this is a sequence violation.." or some such which doesn't format well (grammatically) with our log lines, so used an explicit formatter.
For producers the broker no longer tracks, we now skip sequence checks and allow any non zero sequence. This can happen if the producer produced after the producer got evicted from the broker's memory (eg: log got prefix truncated, producer hit expiration thresholds etc) While kip-360 suggests that the broker should throw unknown_producer_id error in this case, Apache Kafka no longer does that. Adding to the complication not every client implements unknown_producer_id logic similiarly, this can result in different behaviors on different clients. With this patch, we just mimic what Apache Kafka does to be consistent. Apache Kafka code for future reference. https://github.com/apache/kafka/pull/7115/files#diff-5482b26d93c5d36f272f65e628c1692622b69f8ba4a2df04ba74fad23623828dR239
config definition says it is but it is not, fixed it
If expire_old_txes kicks in and there are no tx topics, it means there are no transactions, that can be logged at a lower severity.
tests/rptest/transactions/verifiers/idempotency_load_generator.py
Outdated
Show resolved
Hide resolved
tests/rptest/transactions/verifiers/idempotency_load_generator.py
Outdated
Show resolved
Hide resolved
volatile Exception ex = null; | ||
volatile long counter = 0; | ||
|
||
void produceLoop() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess something like a pluggable class controlled by a config (similar to how the client API user can specify the partitioner class in the kafka producer config today)
In some racy situations it may happen that the request is already errored out. Consider the following sequence of actions. replicate_f - succeeded but set_value() not called -- scheduling point -- term change -> sync() -> GC of inflight requests, request is marked timedout now set_value() is called in the original fiber, this triggers an assert. Relaxing the assert condition to make it idempotent. Subsequent client retry of the request will be marked success (once the change is applied in the stm and the request state is populated). Unable to reproduce in a unit test mainly due to lack of an idempotent client in the unit test fixture.
/backport v24.2.x |
/backport v24.1.x |
/backport v23.3.x |
Failed to create a backport PR to v24.2.x branch. I tried:
|
Failed to create a backport PR to v24.1.x branch. I tried:
|
Failed to create a backport PR to v23.3.x branch. I tried:
|
[backport] [v24.2.x] miscellaneous idempotency fixes #22687
I thought clients had to reset from 0 once bumping -- i.e. once the epoch is bumped, the first sequence seen must be 0, not anything. |
nvm, I forgot that Kafka does this and that I actually opened an issue with them nearly 2yr ago: https://issues.apache.org/jira/browse/KAFKA-14312 |
[backport] [v24.1.x] miscellaneous idempotency fixes #22687
Two main changes in this patch
Broker can now handle epoch bumps for idempotent producers. A client can independently bump the producer epoch in certain situations (check kip-360 and related code) as idempotency only pertains to the single session. The broker code had issues handling epoch bumps which is fixed.
For evicted producer state on the broker (eg: log prefix truncation, producer expiration etc), there are subtle differences among clients around how they handle the producer reset scenario. Java client, for example bumps the epoch on OOOSN and if there are no other requests in flight while librdkafka is pretty strict and only does it on UNKNOWN_PRODUCER_ID error code (which explicitly tells the client that the broker has no state for the producer and it should reset). Changed the code to what Apache Kafka does, upon encountering an unknown producer id, any sequence number is accepted to make forward progress because the only way a broker doesn't know about the producer is when it got evicted from memory, doesn't seem fool proof but consistent with AK behavior and more importantly works with all the client implementations.
Backports Required
Release Notes