-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_paginated_single_column_descending is flaky #540
Comments
FWIW, this seems to affect many tests (presumably all of them) that create a test database. |
We're blowing the assertion here in
We're unexpectedly getting omicron/test-utils/src/dev/db.rs Line 335 in 2e6586b
At this point, we've forked I think this may be a slightly different failure mode than we saw last week but I'm not positive. |
Of course, we'd like to know what cockroachdb emitted to stdout and stderr when this happened. Those outputs are redirected to files in the temporary directory. While we go out of our way to preserve the temporary directory in the case where we timed out waiting for CockroachDB, we don't do that when CockroachDB just exits. I'm fixing this in #545. |
With #545 (really, from commit f6b7da2 in that PR), I attempted to repro this using:
It took 54 attempts. I wasn't tracking time, but it was less than an hour.
If we follow the log file:
it points us to the temporary directory:
Finally we have some stderr!
Yowza. I wonder if CockroachDB is finding something invalid about its on-disk state -- if there was something not quite safe about #493. |
Update here after a day of debugging, here's my theory:
This can be tested manually with: $ cargo build -p omicron-nexus --tests
$ ls -d $(cat /tmp/crdb-base/temp-dirs-record.txt) # This will exist
$ cargo test -p omicron-nexus db::pagination
$ ls -d $(cat /tmp/crdb-base/temp-dirs-record.txt) # This will no longer exist |
(apologies for fat-fingering tab+enter, this issue is definitely not closed yet) |
I've found sources from CRDB devs indicating that: Data directories should be movable:
Absolute paths in CRDB probably shouldn't exist
|
My latest understanding is that @smklein has identified what looks like a real bug (apparently in CockroachDB, since they say that what we're doing should basically work). However, I don't see how it explains all of the behavior we've seen, including the output I pasted above. Here's a bunch more data. I still don't fully understand what's happening. The output I pasted above suggested to me that I ran this to watch for writes to the seed directory:
and I ran
around the exact time that a bunch of tests failed:
They failed in strange ways, all seeming to reflect broken local database state. That could make sense, if something had mucked with the seed copy while they were running. What was it that did that? From the DTrace output above, it was
There are a few things worth noting here:
I searched $TMPDIR (
I am really regretting that I didn't grab output of
Note that while repro'ing, I reverted #544 because it seemed like it was making this harder to hit. So I was running at a5713c6, plus a revert of commit e262a05. I applied this local change to attempt to preserve all database temp directories, regardless of failure:
Then I appeared to reproduce the problem again (although admittedly with a new failure mode):
(At this point I'm using I did this with a clean TMPDIR=/dangerzone/omicron_tmp2. I still found no test that appeared to run
At this point, I wonder if |
So, here's another data point, embarassing confession, and another problem: since these runs take so long, I was working on #546 in parallel. You guessed it: on the same machine, with the same value of TMPDIR. Now, it's conceivable that while building that clone, I triggered a build of test-utils that kicked off the cockroach process that DTrace found. That said, from the output I have from the other workspace, that didn't happen. The last time I built it was 14:50, which was 8m before the failure. But it's possible that like rust-analyzer did it or something. I think it's a legit problem that if you were to build from two clones concurrently on the same machine (and with the same TMPDIR), the second one could totally destroy the first one's test run. I'll file a new issue for that. |
Updated D script:
Working on reproducing with:
|
I ran that for a while without reproducing a problem, so I decided to make my repro loop better reflect what happens in CI:
After 5 iterations (30 minutes), this produced:
I believe this "unexpected EOF" is one of the failures we saw in CI when Sean and I looked last Friday. (Unlike a lot of what I was chasing yesterday, which I'm afraid were unrelated self-inflicted issues.) Unfortunately, the directory doesn't exist:
That's even though I'm building from a5713c6, which includes #545. In this case, we don't know why |
I ran another loop that failed after 16 iterations in 82m with the error that @ahl originally saw:
Fortunately, we now have the database directory, which includes CockroachDB's stderr:
This looks to me like a variant of the issue Sean mentioned above. In his example, one |
After another 6 iterations in 33m, I ran into another one:
Same failure mode:
I will probably put this aside for the time being and pick up trying to reproduce it in the new year once #553 is landed. Alternatively, at that point I think it would be fair to wait until we hit this organically in CI again, now that we expect to have more debugging information in the CI artifacts. |
This should fix #540 by ensuring that no "temporary files" pointing to the seed directory exist after shutdown. Additionally, adds an assertion that the temp-dirs-record.txt file is empty (as suggested in cockroachdb/cockroach#74231)
With #553 merged, I'm considering this closed. Please feel free to re-open if we see any other flake related to DB setup / teardown / access! |
In retrospect, the "found pointer to free object" failure looks similar to the one reported under #1223. |
This one has some history! While working on some database-related APIs, I noticed that my tests using CockroachDB were a **lot** slower than other tests. To some degree, I expect this, but also, this was on the order of ~5-10 seconds per test doing very little other than CockroachDB setup and teardown. After doing a little profiling, I noticed that tests took several seconds to perform teardown, which increases significantly if any schema changes occurred. Why does teardown take so long? Well, it turns out we are sending `SIGTERM` to the CockroachDB process to gracefully terminate it, instead of `SIGKILL`, which would terminate it much more abruptly. This is where the history comes in: Gracefully terminating CockroachDB was a choice we made a few years ago to avoid a test flake: #540. Basically, when creating the "seed database" -- a database where we apply schema changes that are shared between many tests -- we want to gracefully terminate to avoid leaving the database in a "dirty state", where it might need to flush work and cleanup intermediate state. In the case of #540, that "dirty intermediate state" was an absolute path, which meant copies of that seed database trampled on each other if graceful shutdown did not complete. Our approach was to apply graceful termination to all CockroachDB teardown invocations, but this was overkill. Only the seed database expects to have storage be in-use after the call to `cleanup` -- all other test-only invocations expect to immediately remove their storage. They don't need to terminate gracefully, and arguably, should just exit as quickly as they can. This PR changes the disposition: - `cleanup_gracefully` uses `SIGTERM`, and waits for graceful cleanup. This is still used when constructing the seed db. - `cleanup` uses `SIGKILL`, and kills the database immediately. This is now used for all other use-cases. As an example in the performance difference, here's a comparison for some datastore tests: ## Before ``` SETUP PASS [ 1/1] crdb-seed: cargo run -p crdb-seed --profile test PASS [ 6.996s] nexus-db-queries db::datastore::db_metadata::test::ensure_schema_is_current_version PASS [ 7.344s] nexus-db-queries db::datastore::db_metadata::test::schema_version_subcomponents_save_progress PASS [ 8.609s] nexus-db-queries db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward ------------ Summary [ 11.386s] 3 tests run: 3 passed, 228 skipped ``` ## After ``` SETUP PASS [ 1/1] crdb-seed: cargo run -p crdb-seed --profile test PASS [ 2.087s] nexus-db-queries db::datastore::db_metadata::test::ensure_schema_is_current_version PASS [ 3.054s] nexus-db-queries db::datastore::db_metadata::test::schema_version_subcomponents_save_progress PASS [ 4.442s] nexus-db-queries db::datastore::db_metadata::test::concurrent_nexus_instances_only_move_forward ------------ Summary [ 7.550s] 3 tests run: 3 passed, 228 skipped ```
See: https://buildomat.eng.oxide.computer/wg/0/details/01FQF2458RQ7G2VFKJRVBRRZK9/rdBWzehf1zRLzuNhI9iAokOFh33iGAWV7WqKZQNhv4eh5F4g/01FQF24K9BW4YFS4YKB2P2SMD2
The text was updated successfully, but these errors were encountered: