-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error handling during initial data population needs work #985
Conversation
dc00f26
to
8224b73
Compare
(deleted -- I messed up the commits) |
8224b73
to
3629389
Compare
3629389
to
de943d8
Compare
Okay, I added some tests and force-pushed for clarity because I don't think anybody had looked at this yet. And sorry for the noise. It took a few tries to get this right. Commit aa91b32 only refactors populate.rs and adds tests for the individual populator functions. The tests check that each one is idempotent and that it fails with ServiceUnavailable when the database is down. So if you take that commit (so, main + the new tests, without the fix for the underlying bugs here), we get this failure:
That's good -- the test correctly identifies that one of the existing populators is not idempotent. If we comment out that one assertion, then we get this failure instead:
That's good too: the test correctly identifies that the populator also doesn't return ServiceUnavailable when the database is offline. The next commit fixes the underlying bugs. The new tests (and all existing tests) pass on this one. |
Sorry again for the noise. I think this is ready for review now. |
Propolis changes: PHD: refactor & add support Propolis server "environments" (#547) Begin making Accessor interface more robust Update Crucible and Omicron deps for Hakari fixes Add cloud-init volume generation to standalone Use specified toolchain version for all GHA checks Use params to configure rust-toolchain in GHA Update and lock GHA dependencies Crucible changes: Use regions_dataset path for apply_smf (#1000) Don't unwrap when we can't create a dataset (#992) Fix tests and update log messages. (#995) Better backpressure (#990) Update Rust crate proptest to 1.3.1 (#977) Read only downstairs can skip Live Repair (#984) Update Rust crate expectorate to 1.1.0 (#975) Add trait for `ExtentInner` (#982) report backpressure in upstairs_info dtrace probe (#987) Support multiple downstairs operations in GtoS (#985)
Propolis changes: PHD: refactor & add support Propolis server "environments" (#547) Begin making Accessor interface more robust Update Crucible and Omicron deps for Hakari fixes Add cloud-init volume generation to standalone Use specified toolchain version for all GHA checks Use params to configure rust-toolchain in GHA Update and lock GHA dependencies Crucible changes: Use regions_dataset path for apply_smf (#1000) Don't unwrap when we can't create a dataset (#992) Fix tests and update log messages. (#995) Better backpressure (#990) Update Rust crate proptest to 1.3.1 (#977) Read only downstairs can skip Live Repair (#984) Update Rust crate expectorate to 1.1.0 (#975) Add trait for `ExtentInner` (#982) report backpressure in upstairs_info dtrace probe (#987) Support multiple downstairs operations in GtoS (#985) --------- Co-authored-by: Alan Hanson <alan@oxide.computer>
Background
Initial populating of the CockroachDB database happens in two different ways:
omicron-dev db-run
), we create the omicron database, schema, and the bare minimum data that needs to be thereAs much as possible, data should be inserted using the second mode. That's because rack setup only ever happens once, so any data we add to Nexus will never get added to systems that were deployed on a previous version of Nexus. On the other hand, if data is inserted using the second mode, then the data will be automatically inserted on upgrade. That's good: that means for the most part, if you want to add a new built-in user, you can just do it and expect it to be there when your code is running.
As for that second mode: when Nexus starts up, there are a few cases to consider:
Bugs
This used to work, but a few problems have snuck in. Today, if you start Nexus without CockroachDB running, you see this:
It gave up immediately -- even though this error was transient. The impact of this is that if CockroachDB isn't running when Nexus first starts, it will never insert this data, and all kinds of stuff will be broken after that. The cause here is a code path that was replacing whatever error we got back from the database code (in this case, the retryable ServiceUnavailable) with an "InternalError" (which is not retryable).
After this change, if I start Nexus without CockroachDB running, I see this instead:
Note it's only a warning now and it says that we will try again. If I then start a
CockroachDB
node (that's already had the initial database populate done), Nexus finishes loading its data:There's another problem: with the above issue fixed, if you then restart Nexus, it fails again:
The impact of this is that if we updated Nexus to one which delivered more data, that data would never be inserted. The cause here is a code path that wasn't ignoring primary key conflicts.
It's a bit of a problem that there aren't automated tests for the error cases. I'll look into how hard this would be.