Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Txn sessions thrashing: expire all old txs instead of one at a time #13658

Merged
merged 6 commits into from
Sep 27, 2023

Conversation

rystsov
Copy link
Contributor

@rystsov rystsov commented Sep 25, 2023

This is a follow up to #12477

Previously we expired one old txn session per new tx. It helped to maintain the txn sessions cache size at the capacity but it couldn't bring the cache size down if it was already beyond max_transactions_per_coordinator (it may happen when a user sets max_transactions_per_coordinator for the first time).

Fixing it by bulk expiring.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

  • none

@rystsov
Copy link
Contributor Author

rystsov commented Sep 25, 2023

/ci-repeat 5
dt-repeat=1

@rystsov
Copy link
Contributor Author

rystsov commented Sep 25, 2023

/ci-repeat 5
dt-repeat=1

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

config::shard_local_cfg().create_topic_timeout_ms(),
true);

auto timeout = config::shard_local_cfg().create_topic_timeout_ms();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we using create_topic_timeout here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a refactoring so it isn't something new but the historical reason is we need a timeout which covers happy replication: write to the disk and a cluster wide rtt; ideally we should have a new property but there are associated overhead with introducing a property and a minor release doesn't look like a place for it

Refactor do_expire_old_tx to give more control to limit_init_tm_tx
over error handling.
When fetch_tx fails get_tx may remove a txn session and return
tx_not_found; limit_init_tm_tx shouldn't propagate the error to
the client because the method's intenion is the same (to remove
oldest txn session when the txn cache is beyond its capacity).
Expiring one old txn sesison per new tx helps to maintain the txn
sessions cache size at the capacity but it won't bring the cache
size down if it's already beyond max_transactions_per_coordinator
(it may happen when a user sets max_transactions_per_coordinator
for the first time).

Fixing it by bulk expiring.
Cleaning tx session cache blocks `init_producer_id` request until the
size of the cache falls below the threshold. This commit makes it less
disruptive by allowing `init_producer_id` to procced if its tx id is
already in the cache so processing it doesn't make the situation worse.
@rystsov
Copy link
Contributor Author

rystsov commented Sep 27, 2023

/ci-repeat 5
dt-repeat=1

@vbotbuildovich
Copy link
Collaborator

@rystsov
Copy link
Contributor Author

rystsov commented Sep 27, 2023

https://buildkite.com/redpanda/redpanda/builds/37759

Known issues:

One attempt is hanging (needs investigation):

The hanging test is test_tiered_storage also responsible for 13736 & 13745. Since the test has been just added and it doesn't seem to use transactions use_transactions=False the failure most likely isn't related to the changes in this PR.

�_bk;t=1695845640699� [TestKey(test_id='rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.ABS.test_case=.TS_Read==True.TS_Timequery==True', test_index=205)]

@piyushredpanda piyushredpanda merged commit 7fcbe01 into redpanda-data:dev Sep 27, 2023
9 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.1.x

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants