-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv,storage: replica inconsistency in 23.2 scale test #114421
Comments
There's just one SET for this key on n41 in sstable 779345:
No reference in the other two checkpoints, checked with:
|
The Some logs around the time: n39,s382
n41,s404
n18,s189
On minority node n41 (0021):
Output
It looks like the intent was for this range descriptor:
Which seems to correspond to these replication log lines.
Is there any context KV can infer from this? There's very little for storage to go on off n41's engine state or the other nodes' checkpoints, since the original write of the key was ~5hrs earlier. All history of the internal keys is gone. |
The leaseholder Which is odd, because I don't think |
Any chance this has something to do with #104539? |
Is there a log entry corresponding to |
Maybe my previous question is irrelevant. IIUC, the intent is from a RangeDescriptor change at 231111 06:17:02.833039. And the snapshot application at n41 was much earlier at 231110 22:42:50.655325. And the RangeDescriptor has been mutated after the spurious intent, so that intent must have been resolved. Is this accurate? |
It's possible. I double checked that our metamorphic tests are exercising those code paths well. A bug in the integration into the metamorphic tests previously obscured an issue with #104539. Based on the intent, I think this should have been a SINGLEDEL rather than a DELSIZED, and SINGLEDEL code paths saw minimal change in #104539. I'm going to keep looking at it from this angle.
That matches my understanding. From the singledel angle: How do we ensure idempotence of raft application? If committing the batch errors, I guess we need to read the applied index back from the engine to determine where to pick up application? |
Correct. Raft application is not immediately fsync-ed, so we rely on the atomicity of a WriteBatch (containing an update to the applied index and the write itself) to ensure that loss of durability will roll back the application of a suffix of the raft log without tearing individual raft log entries. We then read the applied index on startup and begin (re-)applying entries from there. |
Miraculously, the MANIFESTs actually capture quite a bit of information in file boundaries of memtable flushes. The problematic key is n41
Later, miraculously, the manifest captured the problematic key's flush from the memtable, because it was the largest key flushed at the time.
A short time later, we see a different intent for the same range descriptor, again as the largest key flushed:
A short time later, we see a DELSIZED for a third unique intent on the same range descriptor:
n39 We see the snapshot ingestion in the checkpoint's manifest.
And then 4 minutes later, again?
And then 15 minutes later (a few minutes before the problematic intent is written) again:
And then ~6 minutes later, (~8 seconds after the problematic intent's timestamp), again:
Then approximately 30 seconds later, we actually flush a memtable which contains the intent 🤔 :
Then ~13 seconds later, we see a memtable flushing the SINGLEDEL to delete this intent:
n19 The key relevant range's key start doesn't appear in any of the manifest entries. |
Not sure there's too much actionable there, although there's confirmation that a SINGLEDEL was used to delete the intent. I'm not sure what to make of the repeated snapshot applications for this range on n39. |
Can we tell from the logs which nodes were the source of these snapshots? |
I think we need to cleanup the comments around @nvanbenschoten Can we infer anything about whether there could have been multiple writes to the same key by a txn based on the txn={id=c22060f6 key=/Local/Range/Table/106/1/1582225727306831864/RangeDescriptor iso=Serializable pri=0.00813128 epo=0 ts=1699683422.833100427,0 min=1699683422.833100427,0 seq=1} ts=1699683422.833100427,0 del=false klen=12 vlen=102 mergeTs= txnDidNotUpdateMeta=true seq starts with 0, yes? So seq=1 doesn't preclude that this key was also written at seq=0? If raft application succeeded but some code in the KV layer thought it had failed and didn't advance the in-memory |
I realize state machine replication in general does not require idempotent application, but I am curious what would go wrong in our implementation. I think there are at least two kinds of state that will diverge: (a) timeseries, where we use MERGE, (b) |
seq starts with 0, but is incremented to 1 before the first write. So 0 is used to allow a transaction to read below all of its writes. Here's the code that handles this. It checks out that the range descriptor write has sequence 1, because that's the first key this transaction writes to. So
Once an ignored seq num is established by a transaction, it will be included in all later LockUpdates. We also don't see any use of savepoints in this transaction, so ignored seq nums should not be in play.
It seems difficult for a fully repeated raft application to go unnoticed, but it depends on how much of the in-memory state from the prior application is lost.
Agreed. Although this isn't the only operation which requires exactly-once application to avoid replica divergence. Another example is the |
Actually, we can exclude that possibility. The log line on |
Am I understanding correctly that the I agree that if we learn nothing else and decide we cannot continue to block a beta, we should disable the single delete optimization. But in the interim, I think we should keep it enabled and add additional misuse detection:
Anything else? Within Pebble I want to give the metamorphic tests use of SINGLEDEL a more critical look to see if we might be too conservative in writing SINGLEDELs such that we're potentially leaving subtle interactions unexercised. |
I wasn't aware of this. It was introduced in 1fe6808, where we can see that the This was all before proposer-evaluated KV, which pretty dramatically changed how request evaluation and application interact with Raft. That RFC contains a discussion about replay protection, alluding to the protection we now have with the I'm not aware of any changes in this area recently. @erikgrinaker are you? |
Handlers are added for SingleDeleteInvariantViolationCallback and IneffectualSingleDeleteCallback options in Pebble. By default, they crash the process. If crashing is turned off, they increment metrics, storage.single-delete.invariant-violation and storage.single-delete.ineffectual. Informs cockroachdb#115881 Informs cockroachdb#114421 Epic: none Release note (ops change): The cluster settings storage.single_delete.crash_on_invariant_violation.enabled and storage.single_delete.crash_on_ineffectual.enabled default to true since these can be indicative of a serious bug that could corrupt data.
116544: storage: add single delete callbacks to crash, or increment metrics r=jbowens a=sumeerbhola Handlers are added for SingleDeleteInvariantViolationCallback and IneffectualSingleDeleteCallback options in Pebble. By default, they crash the process. If crashing is turned off, they increment metrics, storage.single-delete.invariant-violation and storage.single-delete.ineffectual. Informs #115881 Informs #114421 Epic: none Release note (ops change): The cluster settings storage.single_delete.crash_on_invariant_violation.enabled and storage.single_delete.crash_on_ineffectual.enabled default to true since these can be indicative of a serious bug that could corrupt data. 116549: sql,opt: don't validate AOST during session migration r=rafiss a=rafiss When re-preparing a statement for a session migration, we want to skip evaluating and validating the AS OF SYSTEM TIME clause. During session migrations, we know that the statement will just be prepared, and not executed, and each statement could have different AOST timestamps. Therefore it is incorrect to evaluate the AOST clause and fix the transaction timestamp. No release note since this fixes a bug that only affects Serverless. See [here](https://cockroachlabsgcp.splunkcloud.com/en-US/app/search/search?earliest=1702141875&latest=1702401075&q=search%20index%3Dcc-app-crdb*%20%20%22could%20not%20prepare%20statement%20during%20session%20migration%22%20host_ip%3D%2210.4.0.19%22&display.page.search.mode=verbose&dispatch.sample_ratio=1&display.general.type=events&display.page.search.tab=events&workload_pool=standard_perf&sid=1702579453.274591) for an example of the error. Epic: None Release note: None 116567: roachtest: run workload after 8tb OR roachtest r=adityamaru a=msbutler Previously the tpce workload would not run after the 8tb OR test, because of a misconfiguration in the test specs. This fixes this bug. Epic: none Release note: none 116580: roachtest: fix mininum version calculation in `UploadWorkload` r=rickystewart,DarrylWong a=renatolabs In ARM64 builds, the `workload` binary is only available in 23.2+. Epic: none Release note: None 116587: roachtest: don't run `schemachange` workload with upgrade migrations r=annrpom,DarrylWong a=renatolabs This has been flaking on all branches. Informs: #116586. Fixes: #116304. Fixes: #116425. Fixes: #116357. Release note: None Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: Michael Butler <butler@cockroachlabs.com> Co-authored-by: Renato Costa <renato@cockroachlabs.com>
Handlers are added for SingleDeleteInvariantViolationCallback and IneffectualSingleDeleteCallback options in Pebble. By default, they crash the process. If crashing is turned off, they increment metrics, storage.single-delete.invariant-violation and storage.single-delete.ineffectual. Informs #115881 Informs #114421 Epic: none Release note (ops change): The cluster settings storage.single_delete.crash_on_invariant_violation.enabled and storage.single_delete.crash_on_ineffectual.enabled default to true since these can be indicative of a serious bug that could corrupt data.
Handlers are added for SingleDeleteInvariantViolationCallback and IneffectualSingleDeleteCallback options in Pebble. By default, they crash the process. If crashing is turned off, they increment metrics, storage.single-delete.invariant-violation and storage.single-delete.ineffectual. Informs #115881 Informs #114421 Epic: none Release note (ops change): The cluster settings storage.single_delete.crash_on_invariant_violation.enabled and storage.single_delete.crash_on_ineffectual.enabled default to true since these can be indicative of a serious bug that could corrupt data.
As elaborated in code comments, these can be false positives caused by delete-only compactions. The metrics are not disabled, under the rare chance that we see a problem in a cluster and one of the metrics (if it happens to not be a false positive) gives us a hint about the cause of the problem. For the same reason, we log such callbacks every 5 minutes, so we can correlate the key in such logs with an actual problem. False positives should be rare, especially for the invariant violation callback. Informs cockroachdb#115881 Informs cockroachdb#114421 Epic: none Release note (ops change): The cluster settings storage.single_delete.crash_on_invariant_violation.enabled and storage.single_delete.crash_on_ineffectual.enabled are disabled and must not be enabled.
As elaborated in code comments, these can be false positives caused by delete-only compactions. The metrics are not disabled, under the rare chance that we see a problem in a cluster and one of the metrics (if it happens to not be a false positive) gives us a hint about the cause of the problem. For the same reason, we log such callbacks every 5 minutes, so we can correlate the key in such logs with an actual problem. False positives should be rare, especially for the invariant violation callback. Informs cockroachdb#115881 Informs cockroachdb#114421 Epic: none Release note (ops change): The cluster settings storage.single_delete.crash_on_invariant_violation.enabled and storage.single_delete.crash_on_ineffectual.enabled are disabled and must not be enabled.
As elaborated in code comments, these can be false positives caused by delete-only compactions. The metrics are not disabled, under the rare chance that we see a problem in a cluster and one of the metrics (if it happens to not be a false positive) gives us a hint about the cause of the problem. For the same reason, we log such callbacks every 5 minutes, so we can correlate the key in such logs with an actual problem. False positives should be rare, especially for the invariant violation callback. Informs cockroachdb#115881 Informs cockroachdb#114421 Epic: none Release note (ops change): The cluster settings storage.single_delete.crash_on_invariant_violation.enabled and storage.single_delete.crash_on_ineffectual.enabled are disabled and must not be enabled.
As elaborated in code comments, these can be false positives caused by delete-only compactions. The metrics are not disabled, under the rare chance that we see a problem in a cluster and one of the metrics (if it happens to not be a false positive) gives us a hint about the cause of the problem. For the same reason, we log such callbacks every 5 minutes, so we can correlate the key in such logs with an actual problem. False positives should be rare, especially for the invariant violation callback. Informs cockroachdb#115881 Informs cockroachdb#114421 Epic: none Release note (ops change): The cluster settings storage.single_delete.crash_on_invariant_violation.enabled and storage.single_delete.crash_on_ineffectual.enabled are disabled and must not be enabled.
116889: storage: disable crashing for single delete callbacks r=sumeerbhola a=sumeerbhola As elaborated in code comments, these can be false positives caused by delete-only compactions. The metrics are not disabled, under the rare chance that we see a problem in a cluster and one of the metrics (if it happens to not be a false positive) gives us a hint about the cause of the problem. For the same reason, we log such callbacks every 5 minutes, so we can correlate the key in such logs with an actual problem. False positives should be rare, especially for the invariant violation callback. Informs #115881 Informs #114421 Epic: none Release note (ops change): The cluster settings storage.single_delete.crash_on_invariant_violation.enabled and storage.single_delete.crash_on_ineffectual.enabled are disabled and must not be enabled. Co-authored-by: sumeerbhola <sumeer@cockroachlabs.com>
As elaborated in code comments, these can be false positives caused by delete-only compactions. The metrics are not disabled, under the rare chance that we see a problem in a cluster and one of the metrics (if it happens to not be a false positive) gives us a hint about the cause of the problem. For the same reason, we log such callbacks every 5 minutes, so we can correlate the key in such logs with an actual problem. False positives should be rare, especially for the invariant violation callback. Informs #115881 Informs #114421 Epic: none Release note (ops change): The cluster settings storage.single_delete.crash_on_invariant_violation.enabled and storage.single_delete.crash_on_ineffectual.enabled are disabled and must not be enabled.
As elaborated in code comments, these can be false positives caused by delete-only compactions. The metrics are not disabled, under the rare chance that we see a problem in a cluster and one of the metrics (if it happens to not be a false positive) gives us a hint about the cause of the problem. For the same reason, we log such callbacks every 5 minutes, so we can correlate the key in such logs with an actual problem. False positives should be rare, especially for the invariant violation callback. Informs #115881 Informs #114421 Epic: none Release note (ops change): The cluster settings storage.single_delete.crash_on_invariant_violation.enabled and storage.single_delete.crash_on_ineffectual.enabled are disabled and must not be enabled.
116319: kvnemesis: add Raft application assertions r=erikgrinaker a=erikgrinaker This adds basic kvnemesis assertions for Raft command application: * A request is only applied once, at a single log index and lease applied index across all replicas (i.e. no double-applies or replays). * Commands do not regress the Raft index/term, lease applied index, and closed timestamp (assuming no node restarts). * All replicas apply the same commands in the same order at the same positions. Resolves #115771. Touches #114421. Epic: none Release note: None 117014: kv: add mutex tracing to tscache r=erikgrinaker a=nvanbenschoten We added the ability to trace mutex acquisition in afb73d2. This commit uses this capability to trace the mutex acquisition the timestamp cache. Epic: None Release note: None 117138: sctest: remove t.Parallel calls and cleanup r=rafiss a=rafiss In 943a7fd and f397c13 these tests were made parallel. This makes debugging harder, and now that the tests run in a remote execution environment, the speed benefits seem dubious. Let's make it non-parallel to see if the tests become more stable. informs #116740 informs #117103 Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
We've added all of the planned assertions and test coverage to try to track this down, but so far haven't found any problem. I'm going to close this out as inactionable for now. |
Describe the problem
Replica inconsistency found on the 23.2 scale testing cluster:
Panic
The diff'd checkpoints:
Additional data / screenshots
See slack thread (internal) for connection details.
Environment:
Jira issue: CRDB-33505
The text was updated successfully, but these errors were encountered: