-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv/kvnemesis: TestKVNemesisSingleNode failed #92189
Comments
kv/kvnemesis.TestKVNemesisSingleNode failed with artifacts on master @ f0554bc215e31ff53e644936ba645bd895ec0d15:
Parameters: |
cc @nvanbenschoten for triage |
bisecting this in #92217 (comment) |
aaand the winner is [7ff05f1] kvserver: allow certain read-only requests to drop latches before evaluation |
kv/kvnemesis.TestKVNemesisSingleNode failed with artifacts on master @ dca415eddac0d659ae6d76b4e3dfdf4076adbd34:
Parameters: |
kv/kvnemesis.TestKVNemesisSingleNode failed with artifacts on master @ b5be006bedd7d3cedc3fb3d2248df168e3d64be2:
Parameters: |
kv/kvnemesis.TestKVNemesisSingleNode failed with artifacts on master @ 1a6e9f885baa124d5ff2996adb966ea15a1a9b2b:
Parameters: |
Still haven't gotten to the bottom of this, but I'll write some words down with what I have so far. All the repros seem to follow a similar pattern, so I'll just work through the last one I looked at locally. The operations are as follows:
The interesting bit here is this batch:
The test is failing because the Interestingly, the issue only manifests if the batch is retried. In this case, the batch refreshes and retries because we see a
Now, the first time the
However, once the batch is retried at a higher timestamp, we end up returning the intent at key
Given the patch we bisected this down to, I tried changing it such that we always used the intent interleaving iterator (even if we dropped latches early). I couldn't get the test to fail (20 minutes, stressing locally) with this change. Maybe we're missing something in this logic? cockroach/pkg/storage/engine.go Lines 1730 to 1740 in 8c0de8a
A few debug lines from this area indicate that the scan sees the intent because it's at a lower sequence number. This does seem okay to me. I also haven't dug into how/why using the intent interleaving iterator would help here. I'll pick this back up again tomorrow.
cc @nvanbenschoten, maybe something jumps out to you. |
Nice debugging @arulajmani! We just met to talk through the findings. We suspect that the bug has to do with Arul and I wrote out what we expect will be the fix. It looks something like:
He'll check to see if that resolves the issue and write a deterministic test, now that the setup that causes the bug is known. Beyond that, we're both still interested in why We could test that theory with the following patch:
|
kv/kvnemesis.TestKVNemesisSingleNode failed with artifacts on master @ db71e045cc3f2937bf20170d254dd7e54a620ffb:
Parameters: |
kv/kvnemesis.TestKVNemesisSingleNode failed with artifacts on master @ e7b15ebaed9c14668ade0a7827a5525aedef1ab0:
Parameters: |
kv/kvnemesis.TestKVNemesisSingleNode failed with artifacts on master @ ca079ac1fc8a8f7a5360f5b1f4fa210c14dc8410:
Parameters: |
kv/kvnemesis.TestKVNemesisSingleNode failed with artifacts on master @ ca079ac1fc8a8f7a5360f5b1f4fa210c14dc8410:
Parameters: |
kv/kvnemesis.TestKVNemesisSingleNode failed with artifacts on master @ 859e1e1f26d79af43cce16519f9ba984baaa495c:
Parameters: |
kv/kvnemesis.TestKVNemesisSingleNode failed with artifacts on master @ 859e1e1f26d79af43cce16519f9ba984baaa495c:
Parameters: |
kv/kvnemesis.TestKVNemesisSingleNode failed with artifacts on master @ 146556e19f5e4fdc8c3e6a623b280cc33aee4d18:
Parameters: |
kv/kvnemesis.TestKVNemesisSingleNode failed with artifacts on master @ 146556e19f5e4fdc8c3e6a623b280cc33aee4d18:
Parameters: |
A read only request scans the lock table before it can proceed with dropping latches. It can only evaluate if no conflicting intents are found. While doing so, it also determines if the MVCC scan evaluation needs to consult intent history (by using the interleaved iterator). The MVCC scan evaluation needs to consult intent history if we discover an intent by the transaction performing the read operation at a higher sequence number or a higher timestamp. The correct sequence numbers to compare here are those on the `BatchRequest`, and not on the transaction. Before this patch, we were using the sequence number on the transaction, which could lead us to wrongly conclude that the use of an intent interleaving iterator wasn't required. Specifically, if the batch of the following construction was retried on the server: ``` b.Scan(a, e) b.Put(b, "value") ``` The scan would end up (erroneously) reading "value" at key b. As part of this patch, I've also renamed `ScanConflictingIntents` to `ScanConflictingIntentsForDroppingLatchesEarly` -- the function isn't as generalized as the old name would suggest. Closes cockroachdb#92217 Closes cockroachdb#92189 Release note: None
This works around cockroachdb#92189 while it's being fixed (and allows me to merge the PR this commit is in, which otherwise fairly reliably fails maye_stress in CI). Release note: None
A read only request scans the lock table before it can proceed with dropping latches. It can only evaluate if no conflicting intents are found. While doing so, it also determines if the MVCC scan evaluation needs to consult intent history (by using the interleaved iterator). The MVCC scan evaluation needs to consult intent history if we discover an intent by the transaction performing the read operation at a higher sequence number or a higher timestamp. The correct sequence numbers to compare here are those on the `BatchRequest`, and not on the transaction. Before this patch, we were using the sequence number on the transaction, which could lead us to wrongly conclude that the use of an intent interleaving iterator wasn't required. Specifically, if the batch of the following construction was retried: ``` b.Scan(a, e) b.Put(b, "value") ``` The scan would end up (erroneously) reading "value" at key b. As part of this patch, I've also renamed `ScanConflictingIntents` to `ScanConflictingIntentsForDroppingLatchesEarly` -- the function isn't as generalized as the old name would suggest. Closes cockroachdb#92217 Closes cockroachdb#92189 Release note: None
91971: colflow: track closers by the vectorized flow r=yuzefovich a=yuzefovich This commit simplifies the way we track all of the `colexecop.Closer`s we need to close. Previously, we would track them using `OpWithMetaInfo` and then many operators would be responsible for closing the components in their input tree. This commit makes it so that we close them during the flow cleanup. This is ok because we know that all concurrent goroutines will have exited by the time `Cleanup` is called since we do so only after `Flow.Wait` returns (which blocks). The decision to put closers into `OpWithMetaInfo` was made in #62221 with justification "why not" that I provided. Now I have the answer to why we should not do it. Unlike the stats collection and metadata draining (which - at least currently - must happen in the "owner" goroutines), for closers we have a convenient place to close them at the flow level. The contract is such that the closure must occur eventually, but it doesn't matter much in which goroutine it's done (as long as there is no race) and whether it is a bit delayed. (The only downside I see is that the tracing spans are finished with a delay in comparison to when the relevant operations are actually done, but this has already been pretty bad, and this commit makes things only slightly worse. This "delayed" finish shows up as "over-extended" span when viewing traces via jaeger.) As a result of this refactor, the assertion that all closers are closed seems redundant - we'd effectively be asserting only that a single method is called, thus the assertion is removed. This commit also allowed to remove some of the complexity around `ExplainVec` implementation (we no longer need to tie the cleanup to closing of the corresponding planNode). Epic: None Release note: None 93077: opt: fix optbuilder partial index comment r=mgartner a=mgartner Epic: None Release note: None 93175: kv: use correct sequence number when scanning for conflicting intents r=nvanbenschoten a=arulajmani A read only request scans the lock table before it can proceed with dropping latches. It can only evaluate if no conflicting intents are found. While doing so, it also determines if the MVCC scan evaluation needs to consult intent history (by using the interleaved iterator). The MVCC scan evaluation needs to consult intent history if we discover an intent by the transaction performing the read operation at a higher sequence number or a higher timestamp. The correct sequence numbers to compare here are those on the `BatchRequest`, and not on the transaction. Before this patch, we were using the sequence number on the transaction, which could lead us to wrongly conclude that the use of an intent interleaving iterator wasn't required. Specifically, if the batch of the following construction was retried on the server: ``` b.Scan(a, e) b.Put(b, "value") ``` The scan would end up (erroneously) reading "value" at key b. As part of this patch, I've also renamed `ScanConflictingIntents` to `ScanConflictingIntentsForDroppingLatchesEarly` -- the function isn't as generalized as the old name would suggest. Closes #92217 Closes #92189 Release note: None 93225: multitenant: check query rows result in admin function tests r=knz a=ecwall Informs CRDB-16746 Some admin functions return error rows in the response instead of an error itself. Update related tests to check these rows. Release note: None 93263: docs issue generation: Update TC configuration to use new parameters r=nickvigilante a=nickvigilante Part of #72910 Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: Arul Ajmani <arulajmani@gmail.com> Co-authored-by: Evan Wall <wall@cockroachlabs.com> Co-authored-by: Nick Vigilante <vigilante@cockroachlabs.com>
kv/kvnemesis.TestKVNemesisSingleNode failed with artifacts on master @ 8357abb668a5adaff781343b394b162fb1b66c6e:
Parameters:
TAGS=bazel,gss,deadlock
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-21628
The text was updated successfully, but these errors were encountered: