-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: better merge testing in clearrange #29646
Conversation
Guess what, folks? We got our repro right away! cc #29252.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/cmd/roachtest/clearrange.go, line 97 at r2 (raw file):
} } else { startHex = "bd" // extremely likely to be the right thing (b'\275').
loooool
pkg/cmd/roachtest/clearrange.go, line 110 at r2 (raw file):
return n } }()
I don't understand why this needs the outer layer of closuring. Isn't this equivalent?
conn := c.Conn(ctx, 1)
defer conn.Close()
var startHex string
// NB: set this to false to save yourself some time during development. Selecting
// from crdb_internal.ranges is very slow because it contacts all of the leaseholders.
// You may actually want to run a version of cockroach that doesn't do that because
// it'll still slow you down every time the method returned below is called.
if true {
if err := conn.QueryRow(
`SELECT to_hex(start_key) FROM crdb_internal.ranges WHERE "database" = 'bank' AND "table" = 'bank' ORDER BY start_key ASC LIMIT 1`,
).Scan(&startHex); err != nil {
t.Fatal(err)
}
} else {
startHex = "bd" // extremely likely to be the right thing (b'\275').
}
numBankRanges := func() int {
var n int
if err := conn.QueryRow(
`SELECT COUNT(*) FROM crdb_internal.ranges WHERE substr(to_hex(start_key), 1, length($1::string)) = $1`, startHex,
).Scan(&n); err != nil {
t.Fatal(err)
}
return n
}
We saw a consistency failure in cockroachdb#29252 that would've been much more useful had it occurred close to the time around which the inconsistency must have been introduced. Instead of leaving it to chance, add a switch that runs aggressive checks in (roach) tests that want them such as the clearrange test. Release note: None
a2dd2b6
to
2a6c8e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/cmd/roachtest/clearrange.go, line 110 at r2 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
I don't understand why this needs the outer layer of closuring. Isn't this equivalent?
conn := c.Conn(ctx, 1) defer conn.Close() var startHex string // NB: set this to false to save yourself some time during development. Selecting // from crdb_internal.ranges is very slow because it contacts all of the leaseholders. // You may actually want to run a version of cockroach that doesn't do that because // it'll still slow you down every time the method returned below is called. if true { if err := conn.QueryRow( `SELECT to_hex(start_key) FROM crdb_internal.ranges WHERE "database" = 'bank' AND "table" = 'bank' ORDER BY start_key ASC LIMIT 1`, ).Scan(&startHex); err != nil { t.Fatal(err) } } else { startHex = "bd" // extremely likely to be the right thing (b'\275'). } numBankRanges := func() int { var n int if err := conn.QueryRow( `SELECT COUNT(*) FROM crdb_internal.ranges WHERE substr(to_hex(start_key), 1, length($1::string)) = $1`, startHex, ).Scan(&n); err != nil { t.Fatal(err) } return n }
Sure, that works, but I always find it a bit unsavory. Also, note that we're calling SET statement_timeout
below, though you get lucky and it's on a different conn. Might just be my preference, but I like having stuff defined locally and not floating around as a quasiglobal. Makes my brain melt less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is now @benesch's to probably close and reopen under his name.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
Unfortunately, the method to determine the range count is quite slow since crdb_internal.ranges internally sends an RPC for each range to determine the leaseholder. Anecdotally, I've seen ~25% of the merges completed after less than 15 minutes. I know that it's slowing down over time, but @benesch will fix that. Also throws in aggressive consistency checks so that when something goes out of sync, we find out right there. Release note: None
2a6c8e0
to
5bd9941
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is now @benesch's to probably close and reopen under his name.
I'm actually going to merge it basically as-is. The one change I made I noted below; doesn't seem worthy of another review cycle. Merging as soon as I verify this passes with the fix from #29677.
Reviewed 1 of 1 files at r3, 1 of 1 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/cmd/roachtest/clearrange.go, line 110 at r2 (raw file):
Previously, tschottdorf (Tobias Schottdorf) wrote…
Sure, that works, but I always find it a bit unsavory. Also, note that we're calling
SET statement_timeout
below, though you get lucky and it's on a different conn. Might just be my preference, but I like having stuff defined locally and not floating around as a quasiglobal. Makes my brain melt less.
Ack. Returning a closure from a closure ranks as more confusing than a stray state variable to me, but to each his own. I'm going to leave as is.
pkg/cmd/roachtest/clearrange.go, line 125 at r4 (raw file):
return err }
Added this to speed up merges.
Canceled (will resume) |
And now Bors is stuck. Great. bors r+ |
Not awaiting review |
bors r- |
Canceled |
bors r+ |
29646: roachtest: better merge testing in clearrange r=benesch a=tschottdorf Unfortunately, the method to determine the range count is quite slow since crdb_internal.ranges internally sends an RPC for each range to determine the leaseholder. Anecdotally, I've seen ~25% of the merges completed after less than 15 minutes. I know that it's slowing down over time, but @benesch will fix that. Also throws in aggressive consistency checks so that when something goes out of sync, we find out right there. Release note: None 29677: storage: preserve consistency when applying widening preemptive snapshots r=benesch a=benesch Merges can cause preemptive snapshots that widen existing replicas. For example, consider the following sequence of events: 1. A replica of range A is removed from store S, but is not garbage collected. 2. Range A subsumes its right neighbor B. 3. Range A is re-added to store S. In step 3, S will receive a preemptive snapshot for A that requires widening its existing replica, thanks to the intervening merge. Problematically, the code to check whether this widening was possible, in Store.canApplySnapshotLocked, was incorrectly mutating the range descriptor in the snapshot header! Applying the snapshot would then fail to clear all of the data from the old incarnation of the replica, since the bounds on the range deletion tombstone were wrong. This often resulted in replica inconsistency. Plus, the in-memory copy of the range descriptor would be incorrect until the next descriptor update--though this usually happened quickly, as the replica would apply the change replicas command, which updates the descriptor, soon after applying the preemptive snapshot. To fix the problem, teach Store.canApplySnapshotLocked to make a copy of the range descriptor before it mutates it. To prevent regressions, add an assertion that a range's start key is never changed to the descriptor update path. With this assertion in place, but without the fix itself, TestStoreRangeMergeReadoptedLHSFollower reliably fails. Fixes #29252. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com> Co-authored-by: Nikhil Benesch <nikhil.benesch@gmail.com>
Build succeeded |
Unfortunately, the method to determine the range count is quite slow since
crdb_internal.ranges internally sends an RPC for each range to determine
the leaseholder.
Anecdotally, I've seen ~25% of the merges completed after less than 15
minutes. I know that it's slowing down over time, but @benesch will fix
that.
Also throws in aggressive consistency checks so that when something goes
out of sync, we find out right there.
Release note: None