kv,testserver: more easily inject transaction retries #106417

stevendanna · 2023-07-07T17:30:42Z

We've had a number of bugs caused by the incorrect management of state within a retriable function passed to the (*sql.InternalDB).Txn or (*kv.DB.)Txn methods.

For example, see the two recent issues:

We'd like to be able to more aggressively find these types of issues.

Currently, there are a few tools one can use to generate transaction retries in a test:

crdb_internal.force_retry
SET inject_retry_errors_enabled=true

arranging for a write conflict on some unrelated scratch key via some well placed testing hook

cockroach/pkg/sql/mvcc_backfiller_test.go

Lines 568 to 581 in 6a6ebcb

    
           RunDuringMergeTxn: func(ctx context.Context, txn *kv.Txn, startKey roachpb.Key, endKey roachpb.Key) error { 
        
           	var err error 
        
           	retryOnce.Do(func() { 
        
           		t.Logf("forcing txn retry for [%s, %s)] using %s", startKey, endKey, scratch) 
        
           		if _, err := txn.Get(ctx, scratch); err != nil { 
        
           			require.NoError(t, err) 
        
           		} 
        
           		require.NoError(t, kvDB.Put(ctx, scratch, "foo")) 
        
           		// This hits WriteTooOldError, but we swallow the error to test the txn failing after the entire txn closure has run. 
        
           		_ = txn.Put(ctx, scratch, "bar") 
        
           	}) 
        
           	return err 
        
           },

using the TestingRequestFilter testing knob to inject a retriable error in response to EndTxn.

cockroach/pkg/jobs/jobs_test.go

Lines 2764 to 2776 in 6a6ebcb

    
           TestingRequestFilter: func(ctx context.Context, r *kvpb.BatchRequest) *kvpb.Error { 
        
           	if r.Txn == nil || r.Txn.Name != txnName { 
        
           		return nil 
        
           	} 
        
           	if _, ok := r.GetArg(kvpb.EndTxn); ok { 
        
           		if !haveInjectedRetry { 
        
           			haveInjectedRetry = true 
        
           			// Force a retry error the first time. 
        
           			return kvpb.NewError(kvpb.NewTransactionRetryError(kvpb.RETRY_REASON_UNKNOWN, "injected error")) 
        
           		} 
        
           	} 
        
           	return nil 
        
           },

These work well if the code in question is designed to make such tools easy to use inside a test. However, they aren't great for finding transaction retry issues in existing code.

It would be nice to have a feature that could force all DB.Txn to retry with some probability without authors having to have pre-arranged for this to be possible.

When experimenting with a simple version of such a feature locally (#106506), it has already proven useful in potentially reproducing a rare bug that we've seen in CI and finding new issues. Including but not limited to:

Non-trivial Issues

Likely Trivial Issues

Potential Issues

sql: CREATE TABLE AS and REFRESH MATERIALIZED VIEW with some expressions fail in the face of transaction retries #106696
jobs: scheduled jobs Create and Update functions are not txn retry safe #106700
sql: CREATE STATISTICS can produce more than 1 stats entry in the face of rare transaction retry #106810
rangefeed library calls unable to handle duplicate onSpanDone calls during initial scan (TODO: investigate)

Next step

Before we can turn something like this on by default, we need to flush out issues that would cause this to be too noisy for teams. The logictests provide one place where we can get a large amount of coverage without modifying a large number of tests:

Make sure existing fixes above have been backported if appropriate.
Add helper to make it a bit faster to write tests targetting particular transactions.
Run against all logictests to identify more issues
Add logictest feature to opt test out of random txn retires
Enable by default in logictests

Jira issue: CRDB-29559

The text was updated successfully, but these errors were encountered:

Occasionally, we see bugs caused by mismangement of state inside a function that will be retried. Namely, the functions `(*kv.DB).Txn` and `(*sql.InternalExecutor).Txn` take funcs that must be suitably idempotent. This is testing knob provides cheap way to help us hunt down such bugs. In the long run, it would be nice to turn this on by default, but we should probably make sure TestServer runs reliably with this turned on before flipping the default. I also considered retrying _every_ transaction once rather than relying on randomness. However, that would require adding more state in kv.Txn, which I wasn't sure I really wanted to do for testing purposes. Informs cockroachdb#106417 Epic: none Release note: None

The transaction modified here updated the curID variable controlling the deletion batching inside the retryable transaction func. If the last batch was retried, the getMaxIDInBatch query would return NULL leading to a panic. Reading the code it seems to me that retries on batches before the last batch would likely result in some descriptors not being deleted. Informs cockroachdb#106417 Informs cockroachdb#106506 Epic: none Release note: None

106510: backupccl: fix txn retry bug during planning r=dt a=stevendanna If the Txn modified here was retried, in some cases we would fail with a nil pointer dereference. The details struct passed into getBackupManifestAndDetails is passed by value and then modified inside the function. As part of that modification, we update the EncryptionOptions. But, I believe when we are in encryption mode none we actually end up setting the encryption options to nil. https://github.com/cockroachdb/cockroach/blob/3a523711f46921916b6bd1af5153868e1489565e/pkg/ccl/backupccl/backup_planning.go#L1621 https://github.com/cockroachdb/cockroach/blob/3a523711f46921916b6bd1af5153868e1489565e/pkg/ccl/backupccl/backupencryption/encryption.go#L213-L258 Informs #106417 The change here maybe isn't the best way to solve this. We can update the encryption option handling code to better handle the case of encryption mode None, But really, ownership over these structs needs to be made much cleaner. Epic: none Release note: None Co-authored-by: Steven Danna <danna@cockroachlabs.com>

106701: upgrades: fix transaction retry error r=chengxiong-ruan a=stevendanna The transaction modified here updated the curID variable controlling the deletion batching inside the retryable transaction func. If the last batch was retried, the getMaxIDInBatch query would return NULL leading to a panic. Reading the code it seems to me that retries on batches before the last batch would likely result in some descriptors not being deleted. Informs #106417 Informs #106506 Epic: none Release note: None Co-authored-by: Steven Danna <danna@cockroachlabs.com>

The transaction modified here updated the curID variable controlling the deletion batching inside the retryable transaction func. If the last batch was retried, the getMaxIDInBatch query would return NULL leading to a panic. Reading the code it seems to me that retries on batches before the last batch would likely result in some descriptors not being deleted. Informs #106417 Informs #106506 Epic: none Release note: None

This helper makes it a little quicker to write a test that tests whether a particular transaction is retry safe. Informs cockroachdb#106417 Epic: none Release note: none

Previously, we would double-count the number of deleted rows in the face of a transaction retry. Here, we move the counting outside of the Txn func and use a small helper function to force transaction retries during the test. It wasn't clear to me whether we wanted the metric to be the total number of deletion attempts or the total committed deletions. I've made it the latter. It seems likely that a transaction retry here is not very likely at all, but this was found when trying to apply random transaction retries to the entire test suite. Informs cockroachdb#106417 Epic: none Release note: none

This helper makes it a little quicker to write a test that tests whether a particular transaction is retry safe. Informs cockroachdb#106417 Epic: none Release note: none

106738: logic: skip_on_retry works when errors are expected r=Xiang-Gu a=Xiang-Gu Previously, we have `skip_on_retry` directive for logic test which, when set, it skips the rest of test if a statement fails with TransactionRetryError. However, it won't skip if the statement is expected to fail with certain error message. This PR ensures that whenever we have a TransactionRetryError and `skip_on_retry` is set, we always skip the rest of the test, even if the stmt is expected to fail. fixes #104464 Release note: None 106759: streamingccl: unskip TestStreamDeleteRange r=msbutler a=stevendanna This test had previously timed out. The timeout we saw was the result of a couple of issues. When waiting for all delete ranges, our loop exit condition was very strict. We would only stop looking for rows if the number of delete ranges was exactly 3. If, however, we got 4 delete ranges, with 2 coming in a single batch, we would never hit this condition. How would that happen though? One possibility are rangefeed duplicates. Another, and what appears to have been happening in this test, is that the representation of the range deletes observed by the rangefeed consumer is slightly different depending on whether the range delete is delivered as part of a catch-up scan or as part of the rangefeeds steady state. I believe this is because the range deletes overlap but are issued at different time points. When we get them as part of the steady state, we get a trimmed version of the original event. When we get them as part of the catch-up scan, we get them broke up at the point of overlap. Fixes #93568 Epic: none Release note: None 106814: testutils: add helper to target transactions for retries r=lidorcarmel a=stevendanna This helper makes it a little quicker to write a test that tests whether a particular transaction is retry safe. Informs #106417 Epic: none Release note: none 106822: spanconfigccl: remove uses of `TODOTestTenantDisabled` r=stevendanna a=knz Informs #76378 . Epic: CRDB-18499 There's a mix of tests that control their tenants directly, and tests that should really work with virtualization enabled but don't. Followup issues: #106821 and #106818. Release note: None 106832: server: bark loudly if the test tenant cannot be created r=herkolategan a=knz Informs #76378 Informs #103772. Epic: CRDB-18499 For context, the automatic test tenant machinery is currently dependent on a CCL enterprise license check. (This is fundamentally not necessary - see #103772 - but sadly this is the way it is for now) Prior to this patch, if the user or a test selected the creation of a test tenant, but the test code forgot to import the required CCL go package, the framework would announce that "a test tenant was created" but it was actually silently failing to do so. This led to confusing investigations where a test tenant was expected, a test was appearing to succeed, but with a release build the same condition would fail. This commit enhances the situation by ensuring we have clear logging output when the test tenant cannot be created due to the missing CCL import. Release note: None Co-authored-by: Xiang Gu <xiang@cockroachlabs.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

This helper makes it a little quicker to write a test that tests whether a particular transaction is retry safe. Informs cockroachdb#106417 Epic: none Release note: none

Previously, we would double-count the number of deleted rows in the face of a transaction retry. Here, we move the counting outside of the Txn func and use a small helper function to force transaction retries during the test. It wasn't clear to me whether we wanted the metric to be the total number of deletion attempts or the total committed deletions. I've made it the latter. It seems likely that a transaction retry here is not very likely at all, but this was found when trying to apply random transaction retries to the entire test suite. Informs cockroachdb#106417 Epic: none Release note: none

106808: ttljob: fix minor book-keeping bug during txn retry r=knz a=stevendanna Previously, we would double-count the number of deleted rows in the face of a transaction retry. Here, we move the counting outside of the Txn func and use a small helper function to force transaction retries during the test. It wasn't clear to me whether we wanted the metric to be the total number of deletion attempts or the total committed deletions. I've made it the latter. It seems likely that a transaction retry here is not very likely at all, but this was found when trying to apply random transaction retries to the entire test suite. Informs #106417 Epic: none Release note: none Co-authored-by: Steven Danna <danna@cockroachlabs.com>

This helper makes it a little quicker to write a test that tests whether a particular transaction is retry safe. Informs #106417 Epic: none Release note: none

Occasionally, we see bugs caused by mismangement of state inside a function that will be retried. Namely, the functions `(*kv.DB).Txn` and `(*sql.InternalExecutor).Txn` take funcs that must be suitably idempotent. This is testing knob provides cheap way to help us hunt down such bugs. In the long run, it would be nice to turn this on by default, but we should probably make sure TestServer runs reliably with this turned on before flipping the default. I also considered retrying _every_ transaction once rather than relying on randomness. However, that would require adding more state in kv.Txn, which I wasn't sure I really wanted to do for testing purposes. Informs cockroachdb#106417 Epic: none Release note: None

jbowens · 2023-10-06T18:17:05Z

With Go generics available, it seems feasible to add variants of these that are generic over an input type and an output type. Then we could write a linter enforcing that invocations to these variants don't pass funcs that close over any outside state.

There's still the possibility of accidentally mutating state through pointers in the input type, but it's at least easier to audit too since you only need to look at the input type and what's reachable through it.

stevendanna added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 7, 2023

stevendanna self-assigned this Jul 7, 2023

stevendanna mentioned this issue Jul 10, 2023

kv,sql: randomly inject retriable errors #106506

Open

This was referenced Jul 12, 2023

sql: CREATE TABLE AS and REFRESH MATERIALIZED VIEW with some expressions fail in the face of transaction retries #106696

Closed

jobs: scheduled jobs Create and Update functions are not txn retry safe #106700

Open

This was referenced Jul 12, 2023

upgrades: fix transaction retry error #106701

Merged

backupccl: fix txn retry bug during planning #106510

Merged

userfile: fix transaction retry upload bug #106503

Merged

upgrades: fix transaction retry bug in auto config job migration #106509

Merged

blathers-crl bot mentioned this issue Jul 12, 2023

release-23.1: backupccl: fix txn retry bug during planning #106707

Closed

blathers-crl bot mentioned this issue Jul 13, 2023

release-23.1: upgrades: fix transaction retry error #106744

Merged

stevendanna mentioned this issue Jul 14, 2023

ttljob: fix minor book-keeping bug during txn retry #106808

Merged

stevendanna mentioned this issue Jul 14, 2023

testutils: add helper to target transactions for retries #106814

Merged

stevendanna mentioned this issue Jul 14, 2023

release-22.2: userfile: fix transaction retry upload bug #106839

Merged

blathers-crl bot mentioned this issue Jul 21, 2023

release-23.1: testutils: add helper to target transactions for retries #107344

Merged

shermanCRL added the db-cy-23 label Jul 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv,testserver: more easily inject transaction retries #106417

kv,testserver: more easily inject transaction retries #106417

stevendanna commented Jul 7, 2023 •

edited

Loading

jbowens commented Oct 6, 2023

kv,testserver: more easily inject transaction retries #106417

kv,testserver: more easily inject transaction retries #106417

Comments

stevendanna commented Jul 7, 2023 • edited Loading

jbowens commented Oct 6, 2023

stevendanna commented Jul 7, 2023 •

edited

Loading