Txn sessions thrashing: expire all old txs instead of one at a time #13658

rystsov · 2023-09-25T22:12:47Z

This is a follow up to #12477

Previously we expired one old txn session per new tx. It helped to maintain the txn sessions cache size at the capacity but it couldn't bring the cache size down if it was already beyond max_transactions_per_coordinator (it may happen when a user sets max_transactions_per_coordinator for the first time).

Fixing it by bulk expiring.

Backports Required

Release Notes

none

rystsov · 2023-09-25T22:34:34Z

/ci-repeat 5
dt-repeat=1

rystsov · 2023-09-25T22:50:34Z

/ci-repeat 5
dt-repeat=1

vbotbuildovich · 2023-09-26T01:01:46Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37627#018acec6-3929-415e-8af0-a0bb2f3b7540

vbotbuildovich · 2023-09-26T01:04:11Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37627#018acec6-3939-47a2-ab6c-274875032df5

vbotbuildovich · 2023-09-26T01:08:05Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37627#018acec6-3930-44f9-bc98-89d7d2978dfd

vbotbuildovich · 2023-09-26T01:26:08Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37627#018acebe-3c54-4bb4-bbf2-a199331d9c1d

src/v/cluster/tx_gateway_frontend.cc

mmaslankaprv · 2023-09-26T15:50:17Z

src/v/cluster/tx_gateway_frontend.cc

-              config::shard_local_cfg().create_topic_timeout_ms(),
-              true);
+
+            auto timeout = config::shard_local_cfg().create_topic_timeout_ms();


why are we using create_topic_timeout here ?

it's a refactoring so it isn't something new but the historical reason is we need a timeout which covers happy replication: write to the disk and a cluster wide rtt; ideally we should have a new property but there are associated overhead with introducing a property and a minor release doesn't look like a place for it

src/v/cluster/tx_gateway_frontend.cc

Refactor do_expire_old_tx to give more control to limit_init_tm_tx over error handling.

When fetch_tx fails get_tx may remove a txn session and return tx_not_found; limit_init_tm_tx shouldn't propagate the error to the client because the method's intenion is the same (to remove oldest txn session when the txn cache is beyond its capacity).

Expiring one old txn sesison per new tx helps to maintain the txn sessions cache size at the capacity but it won't bring the cache size down if it's already beyond max_transactions_per_coordinator (it may happen when a user sets max_transactions_per_coordinator for the first time). Fixing it by bulk expiring.

Cleaning tx session cache blocks `init_producer_id` request until the size of the cache falls below the threshold. This commit makes it less disruptive by allowing `init_producer_id` to procced if its tx id is already in the cache so processing it doesn't make the situation worse.

rystsov · 2023-09-27T16:44:51Z

/ci-repeat 5
dt-repeat=1

vbotbuildovich · 2023-09-27T19:28:54Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/37759#018ad7df-b81d-493d-bfa0-45ac16ead531

rystsov · 2023-09-27T23:02:36Z

https://buildkite.com/redpanda/redpanda/builds/37759

Known issues:

One attempt is hanging (needs investigation):

https://buildkite.com/redpanda/redpanda/builds/37759#018ad7ed-ec7d-4d1e-acb0-4ce367f7703d

The hanging test is test_tiered_storage also responsible for 13736 & 13745. Since the test has been just added and it doesn't seem to use transactions use_transactions=False the failure most likely isn't related to the changes in this PR.

�_bk;t=1695845640699� [TestKey(test_id='rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.ABS.test_case=.TS_Read==True.TS_Timequery==True', test_index=205)]

vbotbuildovich · 2023-09-27T23:45:14Z

/backport v23.2.x

vbotbuildovich · 2023-09-27T23:45:15Z

/backport v23.1.x

dotnwat

👍

rystsov requested review from dotnwat, bharathv and mmaslankaprv September 25, 2023 22:12

github-actions bot added the area/redpanda label Sep 25, 2023

rystsov force-pushed the tx-limit-fix branch from 77d10a2 to 7a1eb63 Compare September 25, 2023 22:49

mmaslankaprv reviewed Sep 26, 2023

View reviewed changes

src/v/cluster/tx_gateway_frontend.cc Outdated Show resolved Hide resolved

mmaslankaprv reviewed Sep 26, 2023

View reviewed changes

src/v/cluster/tx_gateway_frontend.cc Outdated Show resolved Hide resolved

mmaslankaprv reviewed Sep 26, 2023

View reviewed changes

src/v/cluster/tx_gateway_frontend.cc Show resolved Hide resolved

mmaslankaprv reviewed Sep 26, 2023

View reviewed changes

src/v/cluster/tx_gateway_frontend.cc Outdated Show resolved Hide resolved

tx_gateway_frontend: fix locking wrong tx

3f92f12

rystsov force-pushed the tx-limit-fix branch from e697703 to 34b485a Compare September 26, 2023 19:03

rystsov requested a review from mmaslankaprv September 26, 2023 19:04

mmaslankaprv reviewed Sep 27, 2023

View reviewed changes

src/v/cluster/tx_gateway_frontend.cc Show resolved Hide resolved

rystsov added 5 commits September 27, 2023 09:02

tx_gateway_frontend: refactor do_expire_old_tx

1a17042

Refactor do_expire_old_tx to give more control to limit_init_tm_tx over error handling.

tx_gateway_frontend: improve logging

e0be00f

rystsov force-pushed the tx-limit-fix branch from 34b485a to 3d51861 Compare September 27, 2023 16:09

rystsov requested a review from mmaslankaprv September 27, 2023 16:10

mmaslankaprv approved these changes Sep 27, 2023

View reviewed changes

piyushredpanda merged commit 7fcbe01 into redpanda-data:dev Sep 27, 2023
9 checks passed

This was referenced Sep 27, 2023

[v23.2.x] Txn sessions thrashing: expire all old txs instead of one at a time #13746

Merged

[v23.1.x] Txn sessions thrashing: expire all old txs instead of one at a time #13747

Merged

dotnwat reviewed Sep 28, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Txn sessions thrashing: expire all old txs instead of one at a time #13658

Txn sessions thrashing: expire all old txs instead of one at a time #13658

rystsov commented Sep 25, 2023 •

edited

Loading

rystsov commented Sep 25, 2023

rystsov commented Sep 25, 2023

vbotbuildovich commented Sep 26, 2023

vbotbuildovich commented Sep 26, 2023

vbotbuildovich commented Sep 26, 2023

vbotbuildovich commented Sep 26, 2023

mmaslankaprv Sep 26, 2023

rystsov Sep 26, 2023

rystsov commented Sep 27, 2023

vbotbuildovich commented Sep 27, 2023

rystsov commented Sep 27, 2023 •

edited

Loading

vbotbuildovich commented Sep 27, 2023

vbotbuildovich commented Sep 27, 2023

dotnwat left a comment

Txn sessions thrashing: expire all old txs instead of one at a time #13658

Txn sessions thrashing: expire all old txs instead of one at a time #13658

Conversation

rystsov commented Sep 25, 2023 • edited Loading

Backports Required

Release Notes

rystsov commented Sep 25, 2023

rystsov commented Sep 25, 2023

vbotbuildovich commented Sep 26, 2023

vbotbuildovich commented Sep 26, 2023

vbotbuildovich commented Sep 26, 2023

vbotbuildovich commented Sep 26, 2023

mmaslankaprv Sep 26, 2023

Choose a reason for hiding this comment

rystsov Sep 26, 2023

Choose a reason for hiding this comment

rystsov commented Sep 27, 2023

vbotbuildovich commented Sep 27, 2023

rystsov commented Sep 27, 2023 • edited Loading

vbotbuildovich commented Sep 27, 2023

vbotbuildovich commented Sep 27, 2023

dotnwat left a comment

Choose a reason for hiding this comment

rystsov commented Sep 25, 2023 •

edited

Loading

rystsov commented Sep 27, 2023 •

edited

Loading