Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv,testserver: more easily inject transaction retries #106417

Open
5 tasks
stevendanna opened this issue Jul 7, 2023 · 1 comment
Open
5 tasks

kv,testserver: more easily inject transaction retries #106417

stevendanna opened this issue Jul 7, 2023 · 1 comment
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) db-cy-23

Comments

@stevendanna
Copy link
Collaborator

stevendanna commented Jul 7, 2023

We've had a number of bugs caused by the incorrect management of state within a retriable function passed to the (*sql.InternalDB).Txn or (*kv.DB.)Txn methods.

For example, see the two recent issues:

We'd like to be able to more aggressively find these types of issues.

Currently, there are a few tools one can use to generate transaction retries in a test:

  • crdb_internal.force_retry
  • SET inject_retry_errors_enabled=true
  • arranging for a write conflict on some unrelated scratch key via some well placed testing hook
    RunDuringMergeTxn: func(ctx context.Context, txn *kv.Txn, startKey roachpb.Key, endKey roachpb.Key) error {
    var err error
    retryOnce.Do(func() {
    t.Logf("forcing txn retry for [%s, %s)] using %s", startKey, endKey, scratch)
    if _, err := txn.Get(ctx, scratch); err != nil {
    require.NoError(t, err)
    }
    require.NoError(t, kvDB.Put(ctx, scratch, "foo"))
    // This hits WriteTooOldError, but we swallow the error to test the txn failing after the entire txn closure has run.
    _ = txn.Put(ctx, scratch, "bar")
    })
    return err
    },
  • using the TestingRequestFilter testing knob to inject a retriable error in response to EndTxn.
    TestingRequestFilter: func(ctx context.Context, r *kvpb.BatchRequest) *kvpb.Error {
    if r.Txn == nil || r.Txn.Name != txnName {
    return nil
    }
    if _, ok := r.GetArg(kvpb.EndTxn); ok {
    if !haveInjectedRetry {
    haveInjectedRetry = true
    // Force a retry error the first time.
    return kvpb.NewError(kvpb.NewTransactionRetryError(kvpb.RETRY_REASON_UNKNOWN, "injected error"))
    }
    }
    return nil
    },

These work well if the code in question is designed to make such tools easy to use inside a test. However, they aren't great for finding transaction retry issues in existing code.

It would be nice to have a feature that could force all DB.Txn to retry with some probability without authors having to have pre-arranged for this to be possible.

When experimenting with a simple version of such a feature locally (#106506), it has already proven useful in potentially reproducing a rare bug that we've seen in CI and finding new issues. Including but not limited to:

Non-trivial Issues

Likely Trivial Issues

Potential Issues

Next step

Before we can turn something like this on by default, we need to flush out issues that would cause this to be too noisy for teams. The logictests provide one place where we can get a large amount of coverage without modifying a large number of tests:

  • Make sure existing fixes above have been backported if appropriate.
  • Add helper to make it a bit faster to write tests targetting particular transactions.
  • Run against all logictests to identify more issues
  • Add logictest feature to opt test out of random txn retires
  • Enable by default in logictests

Jira issue: CRDB-29559

@stevendanna stevendanna added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 7, 2023
@stevendanna stevendanna self-assigned this Jul 7, 2023
stevendanna added a commit to stevendanna/cockroach that referenced this issue Jul 10, 2023
Occasionally, we see bugs caused by mismangement of state inside a
function that will be retried. Namely, the functions `(*kv.DB).Txn`
and `(*sql.InternalExecutor).Txn` take funcs that must be suitably
idempotent.

This is testing knob provides cheap way to help us hunt down such
bugs.

In the long run, it would be nice to turn this on by default, but we
should probably make sure TestServer runs reliably with this turned on
before flipping the default.

I also considered retrying _every_ transaction once rather than
relying on randomness. However, that would require adding more state
in kv.Txn, which I wasn't sure I really wanted to do for testing
purposes.

Informs cockroachdb#106417

Epic: none

Release note: None
stevendanna added a commit to stevendanna/cockroach that referenced this issue Jul 12, 2023
The transaction modified here updated the curID variable controlling
the deletion batching inside the retryable transaction func.

If the last batch was retried, the getMaxIDInBatch query would return
NULL leading to a panic. Reading the code it seems to me that retries
on batches before the last batch would likely result in some
descriptors not being deleted.

Informs cockroachdb#106417
Informs cockroachdb#106506

Epic: none

Release note: None
craig bot pushed a commit that referenced this issue Jul 12, 2023
106510: backupccl: fix txn retry bug during planning r=dt a=stevendanna

If the Txn modified here was retried, in some cases we would fail with a nil pointer dereference.

The details struct passed into getBackupManifestAndDetails is passed by value and then modified inside the function. As part of that modification, we update the EncryptionOptions. But, I believe when we are in encryption mode none we actually end up setting the encryption options to nil.

https://github.com/cockroachdb/cockroach/blob/3a523711f46921916b6bd1af5153868e1489565e/pkg/ccl/backupccl/backup_planning.go#L1621
https://github.com/cockroachdb/cockroach/blob/3a523711f46921916b6bd1af5153868e1489565e/pkg/ccl/backupccl/backupencryption/encryption.go#L213-L258

Informs #106417

The change here maybe isn't the best way to solve this. We can update the encryption option handling code to better handle the case of encryption mode None, But really, ownership over these structs needs to be made much cleaner.

Epic: none

Release note: None

Co-authored-by: Steven Danna <danna@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Jul 13, 2023
106701: upgrades: fix transaction retry error r=chengxiong-ruan a=stevendanna

The transaction modified here updated the curID variable controlling the deletion batching inside the retryable transaction func.

If the last batch was retried, the getMaxIDInBatch query would return NULL leading to a panic. Reading the code it seems to me that retries on batches before the last batch would likely result in some descriptors not being deleted.

Informs #106417
Informs #106506

Epic: none

Release note: None

Co-authored-by: Steven Danna <danna@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Jul 13, 2023
The transaction modified here updated the curID variable controlling
the deletion batching inside the retryable transaction func.

If the last batch was retried, the getMaxIDInBatch query would return
NULL leading to a panic. Reading the code it seems to me that retries
on batches before the last batch would likely result in some
descriptors not being deleted.

Informs #106417
Informs #106506

Epic: none

Release note: None
stevendanna added a commit to stevendanna/cockroach that referenced this issue Jul 14, 2023
This helper makes it a little quicker to write a test that tests
whether a particular transaction is retry safe.

Informs cockroachdb#106417

Epic: none

Release note: none
stevendanna added a commit to stevendanna/cockroach that referenced this issue Jul 14, 2023
Previously, we would double-count the number of deleted rows in the
face of a transaction retry.

Here, we move the counting outside of the Txn func and use a small
helper function to force transaction retries during the test.

It wasn't clear to me whether we wanted the metric to be the total
number of deletion attempts or the total committed deletions.  I've
made it the latter.

It seems likely that a transaction retry here is not very likely at
all, but this was found when trying to apply random transaction
retries to the entire test suite.

Informs cockroachdb#106417

Epic: none

Release note: none
stevendanna added a commit to stevendanna/cockroach that referenced this issue Jul 14, 2023
This helper makes it a little quicker to write a test that tests
whether a particular transaction is retry safe.

Informs cockroachdb#106417

Epic: none

Release note: none
craig bot pushed a commit that referenced this issue Jul 14, 2023
106738: logic: skip_on_retry works when errors are expected r=Xiang-Gu a=Xiang-Gu

Previously, we have `skip_on_retry` directive for logic test which, when set, it skips the rest of test if a statement fails with TransactionRetryError. However, it won't skip if the statement is expected to fail with certain error message. This PR ensures that whenever we have a TransactionRetryError and `skip_on_retry` is set, we always skip the rest of the test, even if the stmt is expected to fail.

fixes #104464

Release note: None

106759: streamingccl: unskip TestStreamDeleteRange r=msbutler a=stevendanna

This test had previously timed out. The timeout we saw was the result of a couple of issues.

When waiting for all delete ranges, our loop exit condition was very strict. We would only stop looking for rows if the number of delete ranges was exactly 3. If, however, we got 4 delete ranges, with 2 coming in a single batch, we would never hit this condition.

How would that happen though? One possibility are rangefeed duplicates. Another, and what appears to have been happening in this test, is that the representation of the range deletes observed by the rangefeed consumer is slightly different depending on whether the range delete is delivered as part of a catch-up scan or as part of the rangefeeds steady state. I believe this is because the range deletes overlap but are issued at different time points.  When we get them as part of the steady state, we get a trimmed version of the original event. When we get them as part of the catch-up scan, we get them broke up at the point of overlap.

Fixes #93568

Epic: none

Release note: None

106814: testutils: add helper to target transactions for retries r=lidorcarmel a=stevendanna

This helper makes it a little quicker to write a test that tests whether a particular transaction is retry safe.

Informs #106417

Epic: none

Release note: none

106822: spanconfigccl: remove uses of `TODOTestTenantDisabled` r=stevendanna a=knz

Informs #76378 .
Epic: CRDB-18499

There's a mix of tests that control their tenants directly, and tests that should really work with virtualization enabled but don't.

Followup issues: #106821 and #106818.

Release note: None

106832: server: bark loudly if the test tenant cannot be created r=herkolategan a=knz

Informs #76378 
Informs #103772. 
Epic: CRDB-18499

For context, the automatic test tenant machinery is currently dependent on a CCL enterprise license check.
(This is fundamentally not necessary - see #103772 - but sadly this is the way it is for now)

Prior to this patch, if the user or a test selected the creation of a test tenant, but the test code forgot to import the required CCL go package, the framework would announce that "a test tenant was created" but it was actually silently failing to do so.

This led to confusing investigations where a test tenant was expected, a test was appearing to succeed, but with a release build the same condition would fail.

This commit enhances the situation by ensuring we have clear logging output when the test tenant cannot be created due to the missing CCL import.

Release note: None

Co-authored-by: Xiang Gu <xiang@cockroachlabs.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
msbutler pushed a commit to msbutler/cockroach that referenced this issue Jul 17, 2023
This helper makes it a little quicker to write a test that tests
whether a particular transaction is retry safe.

Informs cockroachdb#106417

Epic: none

Release note: none
stevendanna added a commit to stevendanna/cockroach that referenced this issue Jul 21, 2023
Previously, we would double-count the number of deleted rows in the
face of a transaction retry.

Here, we move the counting outside of the Txn func and use a small
helper function to force transaction retries during the test.

It wasn't clear to me whether we wanted the metric to be the total
number of deletion attempts or the total committed deletions.  I've
made it the latter.

It seems likely that a transaction retry here is not very likely at
all, but this was found when trying to apply random transaction
retries to the entire test suite.

Informs cockroachdb#106417

Epic: none

Release note: none
craig bot pushed a commit that referenced this issue Jul 21, 2023
106808: ttljob: fix minor book-keeping bug during txn retry r=knz a=stevendanna

Previously, we would double-count the number of deleted rows in the face of a transaction retry.

Here, we move the counting outside of the Txn func and use a small helper function to force transaction retries during the test.

It wasn't clear to me whether we wanted the metric to be the total number of deletion attempts or the total committed deletions.  I've made it the latter.

It seems likely that a transaction retry here is not very likely at all, but this was found when trying to apply random transaction retries to the entire test suite.

Informs #106417

Epic: none

Release note: none

Co-authored-by: Steven Danna <danna@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this issue Jul 21, 2023
This helper makes it a little quicker to write a test that tests
whether a particular transaction is retry safe.

Informs #106417

Epic: none

Release note: none
stevendanna added a commit to stevendanna/cockroach that referenced this issue Jul 28, 2023
Occasionally, we see bugs caused by mismangement of state inside a
function that will be retried. Namely, the functions `(*kv.DB).Txn`
and `(*sql.InternalExecutor).Txn` take funcs that must be suitably
idempotent.

This is testing knob provides cheap way to help us hunt down such
bugs.

In the long run, it would be nice to turn this on by default, but we
should probably make sure TestServer runs reliably with this turned on
before flipping the default.

I also considered retrying _every_ transaction once rather than
relying on randomness. However, that would require adding more state
in kv.Txn, which I wasn't sure I really wanted to do for testing
purposes.

Informs cockroachdb#106417

Epic: none

Release note: None
@jbowens
Copy link
Collaborator

jbowens commented Oct 6, 2023

With Go generics available, it seems feasible to add variants of these that are generic over an input type and an output type. Then we could write a linter enforcing that invocations to these variants don't pass funcs that close over any outside state.

There's still the possibility of accidentally mutating state through pointers in the input type, but it's at least easier to audit too since you only need to look at the input type and what's reachable through it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) db-cy-23
Projects
None yet
Development

No branches or pull requests

3 participants