-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv,testserver: more easily inject transaction retries #106417
Comments
Occasionally, we see bugs caused by mismangement of state inside a function that will be retried. Namely, the functions `(*kv.DB).Txn` and `(*sql.InternalExecutor).Txn` take funcs that must be suitably idempotent. This is testing knob provides cheap way to help us hunt down such bugs. In the long run, it would be nice to turn this on by default, but we should probably make sure TestServer runs reliably with this turned on before flipping the default. I also considered retrying _every_ transaction once rather than relying on randomness. However, that would require adding more state in kv.Txn, which I wasn't sure I really wanted to do for testing purposes. Informs cockroachdb#106417 Epic: none Release note: None
The transaction modified here updated the curID variable controlling the deletion batching inside the retryable transaction func. If the last batch was retried, the getMaxIDInBatch query would return NULL leading to a panic. Reading the code it seems to me that retries on batches before the last batch would likely result in some descriptors not being deleted. Informs cockroachdb#106417 Informs cockroachdb#106506 Epic: none Release note: None
106510: backupccl: fix txn retry bug during planning r=dt a=stevendanna If the Txn modified here was retried, in some cases we would fail with a nil pointer dereference. The details struct passed into getBackupManifestAndDetails is passed by value and then modified inside the function. As part of that modification, we update the EncryptionOptions. But, I believe when we are in encryption mode none we actually end up setting the encryption options to nil. https://github.com/cockroachdb/cockroach/blob/3a523711f46921916b6bd1af5153868e1489565e/pkg/ccl/backupccl/backup_planning.go#L1621 https://github.com/cockroachdb/cockroach/blob/3a523711f46921916b6bd1af5153868e1489565e/pkg/ccl/backupccl/backupencryption/encryption.go#L213-L258 Informs #106417 The change here maybe isn't the best way to solve this. We can update the encryption option handling code to better handle the case of encryption mode None, But really, ownership over these structs needs to be made much cleaner. Epic: none Release note: None Co-authored-by: Steven Danna <danna@cockroachlabs.com>
106701: upgrades: fix transaction retry error r=chengxiong-ruan a=stevendanna The transaction modified here updated the curID variable controlling the deletion batching inside the retryable transaction func. If the last batch was retried, the getMaxIDInBatch query would return NULL leading to a panic. Reading the code it seems to me that retries on batches before the last batch would likely result in some descriptors not being deleted. Informs #106417 Informs #106506 Epic: none Release note: None Co-authored-by: Steven Danna <danna@cockroachlabs.com>
The transaction modified here updated the curID variable controlling the deletion batching inside the retryable transaction func. If the last batch was retried, the getMaxIDInBatch query would return NULL leading to a panic. Reading the code it seems to me that retries on batches before the last batch would likely result in some descriptors not being deleted. Informs #106417 Informs #106506 Epic: none Release note: None
This helper makes it a little quicker to write a test that tests whether a particular transaction is retry safe. Informs cockroachdb#106417 Epic: none Release note: none
Previously, we would double-count the number of deleted rows in the face of a transaction retry. Here, we move the counting outside of the Txn func and use a small helper function to force transaction retries during the test. It wasn't clear to me whether we wanted the metric to be the total number of deletion attempts or the total committed deletions. I've made it the latter. It seems likely that a transaction retry here is not very likely at all, but this was found when trying to apply random transaction retries to the entire test suite. Informs cockroachdb#106417 Epic: none Release note: none
This helper makes it a little quicker to write a test that tests whether a particular transaction is retry safe. Informs cockroachdb#106417 Epic: none Release note: none
106738: logic: skip_on_retry works when errors are expected r=Xiang-Gu a=Xiang-Gu Previously, we have `skip_on_retry` directive for logic test which, when set, it skips the rest of test if a statement fails with TransactionRetryError. However, it won't skip if the statement is expected to fail with certain error message. This PR ensures that whenever we have a TransactionRetryError and `skip_on_retry` is set, we always skip the rest of the test, even if the stmt is expected to fail. fixes #104464 Release note: None 106759: streamingccl: unskip TestStreamDeleteRange r=msbutler a=stevendanna This test had previously timed out. The timeout we saw was the result of a couple of issues. When waiting for all delete ranges, our loop exit condition was very strict. We would only stop looking for rows if the number of delete ranges was exactly 3. If, however, we got 4 delete ranges, with 2 coming in a single batch, we would never hit this condition. How would that happen though? One possibility are rangefeed duplicates. Another, and what appears to have been happening in this test, is that the representation of the range deletes observed by the rangefeed consumer is slightly different depending on whether the range delete is delivered as part of a catch-up scan or as part of the rangefeeds steady state. I believe this is because the range deletes overlap but are issued at different time points. When we get them as part of the steady state, we get a trimmed version of the original event. When we get them as part of the catch-up scan, we get them broke up at the point of overlap. Fixes #93568 Epic: none Release note: None 106814: testutils: add helper to target transactions for retries r=lidorcarmel a=stevendanna This helper makes it a little quicker to write a test that tests whether a particular transaction is retry safe. Informs #106417 Epic: none Release note: none 106822: spanconfigccl: remove uses of `TODOTestTenantDisabled` r=stevendanna a=knz Informs #76378 . Epic: CRDB-18499 There's a mix of tests that control their tenants directly, and tests that should really work with virtualization enabled but don't. Followup issues: #106821 and #106818. Release note: None 106832: server: bark loudly if the test tenant cannot be created r=herkolategan a=knz Informs #76378 Informs #103772. Epic: CRDB-18499 For context, the automatic test tenant machinery is currently dependent on a CCL enterprise license check. (This is fundamentally not necessary - see #103772 - but sadly this is the way it is for now) Prior to this patch, if the user or a test selected the creation of a test tenant, but the test code forgot to import the required CCL go package, the framework would announce that "a test tenant was created" but it was actually silently failing to do so. This led to confusing investigations where a test tenant was expected, a test was appearing to succeed, but with a release build the same condition would fail. This commit enhances the situation by ensuring we have clear logging output when the test tenant cannot be created due to the missing CCL import. Release note: None Co-authored-by: Xiang Gu <xiang@cockroachlabs.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
This helper makes it a little quicker to write a test that tests whether a particular transaction is retry safe. Informs cockroachdb#106417 Epic: none Release note: none
Previously, we would double-count the number of deleted rows in the face of a transaction retry. Here, we move the counting outside of the Txn func and use a small helper function to force transaction retries during the test. It wasn't clear to me whether we wanted the metric to be the total number of deletion attempts or the total committed deletions. I've made it the latter. It seems likely that a transaction retry here is not very likely at all, but this was found when trying to apply random transaction retries to the entire test suite. Informs cockroachdb#106417 Epic: none Release note: none
106808: ttljob: fix minor book-keeping bug during txn retry r=knz a=stevendanna Previously, we would double-count the number of deleted rows in the face of a transaction retry. Here, we move the counting outside of the Txn func and use a small helper function to force transaction retries during the test. It wasn't clear to me whether we wanted the metric to be the total number of deletion attempts or the total committed deletions. I've made it the latter. It seems likely that a transaction retry here is not very likely at all, but this was found when trying to apply random transaction retries to the entire test suite. Informs #106417 Epic: none Release note: none Co-authored-by: Steven Danna <danna@cockroachlabs.com>
This helper makes it a little quicker to write a test that tests whether a particular transaction is retry safe. Informs #106417 Epic: none Release note: none
Occasionally, we see bugs caused by mismangement of state inside a function that will be retried. Namely, the functions `(*kv.DB).Txn` and `(*sql.InternalExecutor).Txn` take funcs that must be suitably idempotent. This is testing knob provides cheap way to help us hunt down such bugs. In the long run, it would be nice to turn this on by default, but we should probably make sure TestServer runs reliably with this turned on before flipping the default. I also considered retrying _every_ transaction once rather than relying on randomness. However, that would require adding more state in kv.Txn, which I wasn't sure I really wanted to do for testing purposes. Informs cockroachdb#106417 Epic: none Release note: None
With Go generics available, it seems feasible to add variants of these that are generic over an input type and an output type. Then we could write a linter enforcing that invocations to these variants don't pass funcs that close over any outside state. There's still the possibility of accidentally mutating state through pointers in the input type, but it's at least easier to audit too since you only need to look at the input type and what's reachable through it. |
We've had a number of bugs caused by the incorrect management of state within a retriable function passed to the
(*sql.InternalDB).Txn
or(*kv.DB.)Txn
methods.For example, see the two recent issues:
We'd like to be able to more aggressively find these types of issues.
Currently, there are a few tools one can use to generate transaction retries in a test:
crdb_internal.force_retry
SET inject_retry_errors_enabled=true
cockroach/pkg/sql/mvcc_backfiller_test.go
Lines 568 to 581 in 6a6ebcb
cockroach/pkg/jobs/jobs_test.go
Lines 2764 to 2776 in 6a6ebcb
These work well if the code in question is designed to make such tools easy to use inside a test. However, they aren't great for finding transaction retry issues in existing code.
It would be nice to have a feature that could force all DB.Txn to retry with some probability without authors having to have pre-arranged for this to be possible.
When experimenting with a simple version of such a feature locally (#106506), it has already proven useful in potentially reproducing a rare bug that we've seen in CI and finding new issues. Including but not limited to:
Non-trivial Issues
Likely Trivial Issues
Potential Issues
Next step
Before we can turn something like this on by default, we need to flush out issues that would cause this to be too noisy for teams. The logictests provide one place where we can get a large amount of coverage without modifying a large number of tests:
Jira issue: CRDB-29559
The text was updated successfully, but these errors were encountered: