Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

miscellaneous idempotency fixes #22687

Merged
merged 10 commits into from
Aug 6, 2024
Merged

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Aug 1, 2024

Two main changes in this patch

  • Broker can now handle epoch bumps for idempotent producers. A client can independently bump the producer epoch in certain situations (check kip-360 and related code) as idempotency only pertains to the single session. The broker code had issues handling epoch bumps which is fixed.

  • For evicted producer state on the broker (eg: log prefix truncation, producer expiration etc), there are subtle differences among clients around how they handle the producer reset scenario. Java client, for example bumps the epoch on OOOSN and if there are no other requests in flight while librdkafka is pretty strict and only does it on UNKNOWN_PRODUCER_ID error code (which explicitly tells the client that the broker has no state for the producer and it should reset). Changed the code to what Apache Kafka does, upon encountering an unknown producer id, any sequence number is accepted to make forward progress because the only way a broker doesn't know about the producer is when it got evicted from memory, doesn't seem fool proof but consistent with AK behavior and more importantly works with all the client implementations.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

  • none

@bharathv
Copy link
Contributor Author

bharathv commented Aug 1, 2024

/dt

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Aug 1, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/52372#01910ff3-c80b-4daa-88c8-3ab32d1b3aba:

"rptest.tests.topic_creation_test.TopicRecreateTest.test_topic_recreation_while_producing.workload=IDEMPOTENT.cleanup_policy=compact"

new failures in https://buildkite.com/redpanda/redpanda/builds/52372#01911009-8451-417d-8844-77ba4a69a81a:

"rptest.tests.topic_creation_test.TopicRecreateTest.test_topic_recreation_while_producing.workload=IDEMPOTENT.cleanup_policy=compact"
"rptest.transactions.transactions_test.TransactionsTest.check_pids_overflow_test"

new failures in https://buildkite.com/redpanda/redpanda/builds/52372#01910ff3-c80d-4b77-adc4-4087884b2f17:

"rptest.tests.topic_creation_test.TopicRecreateTest.test_topic_recreation_while_producing.workload=IDEMPOTENT.cleanup_policy=delete"

new failures in https://buildkite.com/redpanda/redpanda/builds/52372#01910ff3-c80f-4c65-916c-c4b3481bc0c7:

"rptest.transactions.transactions_test.TransactionsTest.check_pids_overflow_test"

new failures in https://buildkite.com/redpanda/redpanda/builds/52372#01911009-8453-4ee4-b449-039e780c418a:

"rptest.tests.topic_creation_test.TopicRecreateTest.test_topic_recreation_while_producing.workload=IDEMPOTENT.cleanup_policy=delete"

new failures in https://buildkite.com/redpanda/redpanda/builds/52395#019111ec-435d-44a9-89e1-6170567876a0:

"rptest.tests.topic_creation_test.TopicRecreateTest.test_topic_recreation_while_producing.workload=IDEMPOTENT.cleanup_policy=delete"

new failures in https://buildkite.com/redpanda/redpanda/builds/52395#019111ec-4362-4005-91bb-403bb648b13f:

"rptest.tests.topic_creation_test.TopicRecreateTest.test_topic_recreation_while_producing.workload=IDEMPOTENT.cleanup_policy=compact"

new failures in https://buildkite.com/redpanda/redpanda/builds/52395#019111ec-dbdf-449f-b570-f961e0144371:

"rptest.tests.topic_creation_test.TopicRecreateTest.test_topic_recreation_while_producing.workload=IDEMPOTENT.cleanup_policy=delete"

new failures in https://buildkite.com/redpanda/redpanda/builds/52395#019111ec-dbe5-4fc3-9236-0e445d88b5ca:

"rptest.tests.topic_creation_test.TopicRecreateTest.test_topic_recreation_while_producing.workload=IDEMPOTENT.cleanup_policy=compact"

new failures in https://buildkite.com/redpanda/redpanda/builds/52422#0191150d-0bb6-471c-8f62-d05a02b7bc07:

"rptest.transactions.transactions_test.TransactionsTest.check_pids_overflow_test"

new failures in https://buildkite.com/redpanda/redpanda/builds/52422#0191150d-dca7-4cbe-8666-acac7b6b3f54:

"rptest.transactions.transactions_test.TransactionsTest.check_pids_overflow_test"

@bharathv
Copy link
Contributor Author

bharathv commented Aug 2, 2024

/dt

1 similar comment
@bharathv
Copy link
Contributor Author

bharathv commented Aug 2, 2024

/dt

@bharathv
Copy link
Contributor Author

bharathv commented Aug 4, 2024

/dt

Factoring out the code into a utility, to be used in a later commit.
To be used to reset the producer state with new epoch for idempotent
producers that decide to bump the epoch on the client side (which is
totally fine as the idempotency is per session and client can
independently decide to bump the epoch on it's side).
src/v/cluster/rm_stm.h Outdated Show resolved Hide resolved
src/v/cluster/rm_stm.h Outdated Show resolved Hide resolved
volatile Exception ex = null;
volatile long counter = 0;

void produceLoop() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in future, it would be great to make it easier to create this applications using Java client. I was thinking about creating an interface that would be implementable from the python code. The application would then be compossed of classes implementing say a ProducerWorker interface and it would be configurable in runtime which implementation class to load. Now there is a lot of code duplication every time we need this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean a python wrapper around an embedded jvm? like py4j or some such?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess something like a pluggable class controlled by a config (similar to how the client API user can specify the partitioner class in the kafka producer config today)

#include <iostream>

namespace cluster {
std::ostream& operator<<(std::ostream& o, cluster::errc err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we just use errc_category::message()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that, that was formatted as a message "this is a sequence violation.." or some such which doesn't format well (grammatically) with our log lines, so used an explicit formatter.

For producers the broker no longer tracks, we now skip sequence
checks and allow any non zero sequence. This can happen if the producer
produced after the producer got evicted from the broker's memory (eg:
log got prefix truncated, producer hit expiration thresholds etc)

While kip-360 suggests that the broker should throw unknown_producer_id
error in this case, Apache Kafka no longer does that. Adding to the
complication not every client implements unknown_producer_id logic
similiarly, this can result in different behaviors on different clients.

With this patch, we just mimic what Apache Kafka does to be consistent.

Apache Kafka code for future reference.
https://github.com/apache/kafka/pull/7115/files#diff-5482b26d93c5d36f272f65e628c1692622b69f8ba4a2df04ba74fad23623828dR239
config definition says it is but it is not, fixed it
If expire_old_txes kicks in and there are no tx topics, it means there
are no transactions, that can be logged at a lower severity.
mmaslankaprv
mmaslankaprv previously approved these changes Aug 6, 2024
volatile Exception ex = null;
volatile long counter = 0;

void produceLoop() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess something like a pluggable class controlled by a config (similar to how the client API user can specify the partitioner class in the kafka producer config today)

In some racy situations it may happen that the request is already
errored out. Consider the following sequence of actions.

replicate_f - succeeded but set_value() not called
-- scheduling point --
term change -> sync() -> GC of inflight requests, request is marked
timedout

now set_value() is called in the original fiber, this triggers an
assert.

Relaxing the assert condition to make it idempotent. Subsequent client
retry of the request will be marked success (once the change is applied
in the stm and the request state is populated).

Unable to reproduce in a unit test mainly due to lack of an idempotent
client in the unit test fixture.
@piyushredpanda piyushredpanda added this to the v24.2.2 milestone Aug 6, 2024
@bharathv bharathv merged commit edf234b into redpanda-data:dev Aug 6, 2024
20 checks passed
@bharathv bharathv deleted the idem_fixes branch August 6, 2024 18:21
@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-22687-v24.2.x-746 remotes/upstream/v24.2.x
git cherry-pick -x 26f6150198e043121b5e3e5b5deb6d8b0843e262 7ac5acfb9411bc9acfa7ae7fdd4a948236956e66 2f5b9cbcff2cb9a41290e685046d1ae8939271a6 e2c59b811810fa97b73234b1b8dd5040ee8b68f8 bc3d761b87b42b719ad9fab2172760abd4ac46d0 0ed2a90c0c071b3c6710e9cfacf6895db51b9076 3c36e7c383070b704e1bfbe43ec02fe598459533 1af75574cef77066d5c3f97695c429b4585ea5b9 f8e5c21deba4621c7fc4c8eda40b4a16270574aa 092a2b89e03450d5cf182be938980afe191d6b27

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-22687-v24.1.x-773 remotes/upstream/v24.1.x
git cherry-pick -x 26f6150198e043121b5e3e5b5deb6d8b0843e262 7ac5acfb9411bc9acfa7ae7fdd4a948236956e66 2f5b9cbcff2cb9a41290e685046d1ae8939271a6 e2c59b811810fa97b73234b1b8dd5040ee8b68f8 bc3d761b87b42b719ad9fab2172760abd4ac46d0 0ed2a90c0c071b3c6710e9cfacf6895db51b9076 3c36e7c383070b704e1bfbe43ec02fe598459533 1af75574cef77066d5c3f97695c429b4585ea5b9 f8e5c21deba4621c7fc4c8eda40b4a16270574aa 092a2b89e03450d5cf182be938980afe191d6b27

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-22687-v23.3.x-922 remotes/upstream/v23.3.x
git cherry-pick -x 26f6150198e043121b5e3e5b5deb6d8b0843e262 7ac5acfb9411bc9acfa7ae7fdd4a948236956e66 2f5b9cbcff2cb9a41290e685046d1ae8939271a6 e2c59b811810fa97b73234b1b8dd5040ee8b68f8 bc3d761b87b42b719ad9fab2172760abd4ac46d0 0ed2a90c0c071b3c6710e9cfacf6895db51b9076 3c36e7c383070b704e1bfbe43ec02fe598459533 1af75574cef77066d5c3f97695c429b4585ea5b9 f8e5c21deba4621c7fc4c8eda40b4a16270574aa 092a2b89e03450d5cf182be938980afe191d6b27

Workflow run logs.

piyushredpanda added a commit that referenced this pull request Aug 6, 2024
[backport] [v24.2.x] miscellaneous idempotency fixes #22687
@twmb
Copy link
Contributor

twmb commented Aug 8, 2024

any sequence number is accepted

I thought clients had to reset from 0 once bumping -- i.e. once the epoch is bumped, the first sequence seen must be 0, not anything.

@twmb
Copy link
Contributor

twmb commented Aug 8, 2024

nvm, I forgot that Kafka does this and that I actually opened an issue with them nearly 2yr ago: https://issues.apache.org/jira/browse/KAFKA-14312

piyushredpanda added a commit that referenced this pull request Aug 15, 2024
[backport] [v24.1.x] miscellaneous idempotency fixes #22687
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants