-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Txn sessions thrashing: expire all old txs instead of one at a time #13658
Conversation
/ci-repeat 5 |
77d10a2
to
7a1eb63
Compare
/ci-repeat 5 |
ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37627#018acec6-3929-415e-8af0-a0bb2f3b7540 |
ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37627#018acec6-3939-47a2-ab6c-274875032df5 |
ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37627#018acec6-3930-44f9-bc98-89d7d2978dfd |
ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37627#018acebe-3c54-4bb4-bbf2-a199331d9c1d |
config::shard_local_cfg().create_topic_timeout_ms(), | ||
true); | ||
|
||
auto timeout = config::shard_local_cfg().create_topic_timeout_ms(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we using create_topic_timeout
here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a refactoring so it isn't something new but the historical reason is we need a timeout which covers happy replication: write to the disk and a cluster wide rtt; ideally we should have a new property but there are associated overhead with introducing a property and a minor release doesn't look like a place for it
e697703
to
34b485a
Compare
Refactor do_expire_old_tx to give more control to limit_init_tm_tx over error handling.
When fetch_tx fails get_tx may remove a txn session and return tx_not_found; limit_init_tm_tx shouldn't propagate the error to the client because the method's intenion is the same (to remove oldest txn session when the txn cache is beyond its capacity).
Expiring one old txn sesison per new tx helps to maintain the txn sessions cache size at the capacity but it won't bring the cache size down if it's already beyond max_transactions_per_coordinator (it may happen when a user sets max_transactions_per_coordinator for the first time). Fixing it by bulk expiring.
Cleaning tx session cache blocks `init_producer_id` request until the size of the cache falls below the threshold. This commit makes it less disruptive by allowing `init_producer_id` to procced if its tx id is already in the cache so processing it doesn't make the situation worse.
34b485a
to
3d51861
Compare
/ci-repeat 5 |
ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37759#018ad7df-b81d-493d-bfa0-45ac16ead531 |
https://buildkite.com/redpanda/redpanda/builds/37759 Known issues:
One attempt is hanging (needs investigation): The hanging test is
|
/backport v23.2.x |
/backport v23.1.x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This is a follow up to #12477
Previously we expired one old txn session per new tx. It helped to maintain the txn sessions cache size at the capacity but it couldn't bring the cache size down if it was already beyond
max_transactions_per_coordinator
(it may happen when a user setsmax_transactions_per_coordinator
for the first time).Fixing it by bulk expiring.
Backports Required
Release Notes