-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: randomly run with runtime assertions by default #111949
Conversation
49bdf6a
to
6a0b6e0
Compare
01c896e
to
04c9709
Compare
pkg/cmd/roachtest/tests/kvbench.go
Outdated
@@ -96,6 +96,7 @@ func registerKVBenchSpec(r registry.Registry, b kvBenchSpec) { | |||
Suites: registry.ManualOnly, | |||
Tags: registry.Tags("manual"), | |||
Owner: registry.OwnerKV, | |||
Benchmark: true, // TODO: Comments imply this is a benchmark test but this was not set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test name and comments imply this is a benchmark test so I assume the missing benchmark spec is a mistake but if there is a reason why it shouldn't be set let me know.
04c9709
to
1d6e2db
Compare
edbf21b
to
09f34ee
Compare
09f34ee
to
05598c3
Compare
pkg/cmd/roachtest/test_impl.go
Outdated
cockroach string // path to main cockroach binary | ||
cockroachShort string // path to cockroach-short binary compiled with --crdb_test build tag | ||
cockroach string // path to main cockroach binary | ||
cockroachShort string // path to cockroach-short binary compiled with --crdb_test build tag |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: cockroachAssertionsEnabled
(or similar, it seems elsewhere we use cockroachEA
for "enabled assertions") is a better name than cockroachShort
. Yes, the comment explains the real difference between these two things, but everyone sees the variable name, not everyone sees the comment.
@@ -83,7 +83,7 @@ case "$component" in | |||
;; | |||
roachtest) | |||
# Roachtest binary. | |||
bazel_args=(//pkg/cmd/roachtest) | |||
bazel_args=(//pkg/cmd/roachtest --crdb_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What benefit does this give exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stealing from what Stan said:
Note, to fully support interop between a roachtest runner, (executing the roachtest binary), and a node, executing the cockroach binary compiled with assertions (i.e., crdb_test), the roachtest binary must also be compiled with assertions to establish the same serialization on the wire. E.g., KVNemesisSeq is used on the wire only under crdb_test builds. The corresponding (sequence) Container type is conditional on the binary being compiled under crdb_test. (See pkg/kv/kvpb/api.proto) Thus, a request serialized from the roachtest runner is compatible with the cockroach binary iff both binaries are built under crdb_test.
The solution here was to just compile all roachtest binaries with crdb_test since it touches very little code, rather than try and figure out a way to have team city selectively use roachtest binaries based on what the cockroach binary is compiled with.
I can add a comment though to make it more clear.
05598c3
to
77d45ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but is there a specific reason why we're skipping runtime checks in a few tests that legitimately break?
Specific comments on kv/split
tests below but in general no there isn't and I agree with the sentiment we should let them fail.
I'm confused about
schemachange/
bulkingest`
I want to say the connection failure was just an infra flake since I can't repro it with the same seed and it's never failed before w/ runtime assertions. But I also don't see anything that jumps out to me in the logs so I don't want to write it off just yet. Let me dig into this some more.
PR suggests the timeout was bumped to 60mins
The duration of how long to run the workload was bumped to 60 mins, the timeout is calculated as 2 * duration.
Would be interesting to run a full nightly build in this branch forcing metamorphic builds for all tests, to reveal some flakes we might not have noticed. Did you run a build like that? I'm interested to see the outcome.
Yeah, that's actually what I was doing most of the time. Here's two runs where I ran all of them #1 and #2. Unfortunately you can see they timed out, which is why I reverted it back for the final few runs.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @rickystewart, and @srosenberg)
pkg/cmd/roachtest/tests/asyncpg.go
line 54 at r5 (raw file):
Previously, renatolabs (Renato Costa) wrote…
In
pgjdbc
, we're forcing the value of this session var -- can't we do this here too?
The asyncpg
tests drops the entire database several times throughout the test which also drops session vars. So even setting the default session value like in pgjdbc
isn't enough.
pkg/cmd/roachtest/tests/multitenant_distsql.go
line 65 at r5 (raw file):
Previously, renatolabs (Renato Costa) wrote…
We shouldn't skip runtime assertions because this test fails IMO: it's precisely why we're doing this! I think the issue will get a lot more eyes when it is reported as a roachtest failure. My reading of the issue you created is that it is a legitimate bug.
SGTM
pkg/cmd/roachtest/tests/multitenant_utils.go
line 281 at r5 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Should we delete this?
Done.
pkg/cmd/roachtest/tests/split.go
line 434 at r5 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Same comment re: letting this break.
The root cause of this issue was found and from my understanding it's not really an urgent issue. Also from what I've been told the faulty QPS function is deprecated anyway. When we have a metamorphic exclusion list, we can just disable kv-batch-size
and let the other constants/assertions be on.
I'm not opposed to letting this break, I guess I'm more wondering what we do if KV determines the bug isn't worth fixing? Switch it back to t.StandardCockroach()
?
3c00c4d
to
10f3709
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specific comments on
kv/split
tests below
👍 sounds good, the exception seems reasonable.
I want to say the connection failure was just an infra flake since I can't repro it
I think the connection failure is a red herring, the test timed out after 2h.
These are full runs, but are tests running with metamorphic builds all the time? I don't see any custom parameters to force that. Also, when I look at the logs for some tests in that #2 link, I only see the Running with runtime assertions enabled on global seed
message on some tests, not all.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, @rickystewart, and @srosenberg)
pkg/cmd/roachtest/test_impl.go
line 166 at r6 (raw file):
// CI build, but we can't assume it exists in every roachtest call. if path := t.RuntimeAssertionsCockroach(); path != "" { t.l.Printf("Runtime assertions enabled")
For instant knowledge of whether we are running with assertions enabled or not, we could add a log message to both branches.
pkg/cmd/roachtest/tests/asyncpg.go
line 54 at r5 (raw file):
Previously, DarrylWong wrote…
The
asyncpg
tests drops the entire database several times throughout the test which also drops session vars. So even setting the default session value like inpgjdbc
isn't enough.
Got it, thanks. Would it make sense to run these tests with assertions enabled, but disabling metamorphic constants?
c58ba1a
to
339f002
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the connection failure is a red herring, the test timed out after 2h.
Yeah after running it 5+ times I finally got it to timeout on creating indexes... Seems like an actual issue and not a flake but seems weird why it doesn't consistently fail on the same seed. Will have to investigate more.
These are full runs, but are tests running with metamorphic builds all the time?
Yeah I did it in an admittedly not very obvious way of just temporarily replacing t.StandardCockroach()
with the runtime binary in the code itself partly because I was tired of typing in a flag every time. I could do another run if you want, more testing wouldn't hurt but it will probably time out.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @rickystewart, and @srosenberg)
pkg/cmd/roachtest/test_impl.go
line 166 at r6 (raw file):
Previously, renatolabs (Renato Costa) wrote…
For instant knowledge of whether we are running with assertions enabled or not, we could add a log message to both branches.
Done.
pkg/cmd/roachtest/tests/asyncpg.go
line 54 at r5 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Got it, thanks. Would it make sense to run these tests with assertions enabled, but disabling metamorphic constants?
I was under the impression that just disabling metamorphic constants wasn't easily doable in roachtests since they get enabled on init. If it's simple functionality I'd be happy to try adding, but otherwise I wonder how much benefit it would be over a metamorphic exclusion list .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temporarily replacing
t.StandardCockroach()
with the runtime binary in the code itself
Ah, I see it in the commit now!
Seems like an actual issue
Could be, the last we hear from the index creation is
executing declarative schema change PostCommitPhase stage 2 of 7 with 1 BackfillType op (rollback=false) for CREATE INDEX
2h before the test timed out, so it got stuck there for a really long time. We should probably let sql-foundations investigate this issue when it occurs on master, as it doesn't seem related to anything we're doing.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, @rickystewart, and @srosenberg)
pkg/cmd/roachtest/tests/asyncpg.go
line 54 at r5 (raw file):
Previously, DarrylWong wrote…
I was under the impression that just disabling metamorphic constants wasn't easily doable in roachtests since they get enabled on init. If it's simple functionality I'd be happy to try adding, but otherwise I wonder how much benefit it would be over a metamorphic exclusion list .
You can set COCKROACH_INTERNAL_DISABLE_METAMORPHIC_TESTING=true
when starting the cockroach process, and that will disable use of metamorphic constants in CRDB. Runtime checks (gated with buildutil.CrdbTestBuild
) will still run.
I'm saying there's still value in running tests with these runtime checks (but without changing any constants
) instead of always using "standard cockroach" in cases like asyncpg
. But this can be done in follow-up work too.
339f002
to
68e5974
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably let sql-foundations investigate this issue when it occurs on master,
Sounds good to me
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @rickystewart, and @srosenberg)
pkg/cmd/roachtest/tests/asyncpg.go
line 54 at r5 (raw file):
Previously, renatolabs (Renato Costa) wrote…
You can set
COCKROACH_INTERNAL_DISABLE_METAMORPHIC_TESTING=true
when starting the cockroach process, and that will disable use of metamorphic constants in CRDB. Runtime checks (gated withbuildutil.CrdbTestBuild
) will still run.I'm saying there's still value in running tests with these runtime checks (but without changing any
constants
) instead of always using "standard cockroach" in cases likeasyncpg
. But this can be done in follow-up work too.
Yeah that seems easy enough. In that case, we should only be using StandardCockroach
for excessive memory usage and timeouts.
Added it in for asyncpg
, kv/split
and copy/bank/rows=1000000
and made them all t.Cockroach()
again. Initial runs seem good.
This changes the semantics of `t.Cockroach()` to use a binary with runtime assertions and metamorphic constants enabled by default. Performance tests (indicated by the benchmark TestSpec) will continue using the standard binary, without runtime assertions, as usual. This commit also opts-out other tests that cannot easily be run with runtime assertions or metamorphic constants enabled, most often due to timeouts and metamorphic constant incompatibility. Resolves cockroachdb#86678. Informs cockroachdb#94986. Epic: none Release note: None
68e5974
to
581e671
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, @rickystewart, and @srosenberg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something I was thinking about: I think it shouldn't be too difficult to add a "cluster param" in [1] indicating whether the run used metamorphic builds or not. I think it would be nice to give teams an instant way to know if we used metamorphic builds without even having to click through TC artifacts. Thoughts?
[1]
cockroach/pkg/cmd/roachtest/github.go
Line 207 in cedd409
clusterParams := map[string]string{ |
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, @rickystewart, and @srosenberg)
Adds a label indicating that a roachtest failure was on a metamorphic build on github issue posts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but unfortunately it's not the cleanest. The binary used is decided at runtime so it's not available through testSpec like the other params are. I had to pass it in as a separate argument which seems a little messy.
It should work but I'm happy to revert the commit and go back to the drawing board if you have any better suggestions.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @renatolabs, @rickystewart, and @srosenberg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed no impacts to docs dependencies within the relevant files.
No, what you did is kind of what I had in mind originally. There's no elegant way to aggregate random decision made at test run-time and solving that problem is out of scope for this PR. Given the value that this brings (immediately knowing if we used metamorphic binaries), I think this is a good trade-off. |
TFTRs! bors r=renatolabs, mikeCRL, rickystewart, herkolategan |
Build failed (retrying...): |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 581e671 to blathers/backport-release-23.2-111949: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This changes the semantics of
t.Cockroach()
to use a binary with runtime assertions enabled by default. Performance tests (indicated by the benchmark TestSpec) will be able to continue using the standard binary, without runtime assertions or metamorphic constants, as usual.This commit also opts-out other tests that cannot easily be run with runtime assertions or metamorphic constants enabled, most often due to timeouts or metamorphic constant incompatibility.
Resolves #86678.
Informs #94986.
Epic: none
Release note: None