error handling during initial data population needs work #985

davepacheco · 2022-04-28T17:26:38Z

Background

Initial populating of the CockroachDB database happens in two different ways:

during "rack setup" (or during omicron-dev db-run), we create the omicron database, schema, and the bare minimum data that needs to be there
every time Nexus starts up, it attempts to insert a bunch of built-in users, roles, role assignments, silo, etc. into the database. It retries this until it successfully inserts everything it expects to or runs into some unknown user input problem (e.g., an unexpected conflict, or a SQL syntax error, or something like that).

As much as possible, data should be inserted using the second mode. That's because rack setup only ever happens once, so any data we add to Nexus will never get added to systems that were deployed on a previous version of Nexus. On the other hand, if data is inserted using the second mode, then the data will be automatically inserted on upgrade. That's good: that means for the most part, if you want to add a new built-in user, you can just do it and expect it to be there when your code is running.

As for that second mode: when Nexus starts up, there are a few cases to consider:

Nexus comes up and CockroachDB is not available. It should retry until CockroachDB is available.
Nexus comes up and none of its data is present. It should go ahead and insert it.
Nexus comes up and some of its data is present. It should still go ahead and insert it, ignoring primary key conflicts (on the assumption that it's the same data). This deals with crashes during a previous attempt, but also the upgrade case mentioned above: future versions of Nexus can deliver more data knowing it will be inserted the first time the new version of Nexus comes online.
Nexus comes up and runs into a non-retryable problem doing any of this (e.g., SQL syntax error). It logs an error. This should eventually raise a support case. It shouldn't ever happen.

Bugs

This used to work, but a few problems have snuck in. Today, if you start Nexus without CockroachDB running, you see this:

Apr 28 16:10:19.151 ERRO gave up trying to populate built-in users, error_message: InternalError { internal_message: "error creating silo user: Timeout" }, component: DataLoader, component: nexus, component: ServerContext, name: e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c

It gave up immediately -- even though this error was transient. The impact of this is that if CockroachDB isn't running when Nexus first starts, it will never insert this data, and all kinds of stuff will be broken after that. The cause here is a code path that was replacing whatever error we got back from the database code (in this case, the retryable ServiceUnavailable) with an "InternalError" (which is not retryable).

After this change, if I start Nexus without CockroachDB running, I see this instead:

Apr 28 16:50:08.414 WARN failed to populate built-in users; will retry in 588.027336ms, error_message: ServiceUnavailable { internal_message: "Timeout accessing connection pool" }, component: DataLoader, component: nexus, component: ServerContext, name: e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c

Note it's only a warning now and it says that we will try again. If I then start a CockroachDB node (that's already had the initial database populate done), Nexus finishes loading its data:

Apr 28 17:22:37.978 INFO created 6 built-in users, component: DataLoader, component: nexus, component: ServerContext, name: e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c
Apr 28 17:22:38.005 INFO created 9 built-in roles, component: DataLoader, component: nexus, component: ServerContext, name: e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c
Apr 28 17:22:38.031 INFO created 5 built-in role assignments, component: DataLoader, component: nexus, component: ServerContext, name: e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c
Apr 28 17:22:38.079 INFO created 1 built-in silos, component: DataLoader, component: nexus, component: ServerContext, name: e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c

There's another problem: with the above issue fixed, if you then restart Nexus, it fails again:

Apr 28 17:23:14.373 ERRO gave up trying to populate built-in users, error_message: InternalError { internal_message: "unexpected database error: duplicate key value violates unique constraint \"primary\"" }

The impact of this is that if we updated Nexus to one which delivered more data, that data would never be inserted. The cause here is a code path that wasn't ignoring primary key conflicts.

It's a bit of a problem that there aren't automated tests for the error cases. I'll look into how hard this would be.

davepacheco · 2022-04-29T00:23:51Z

(deleted -- I messed up the commits)

…ransient error

davepacheco · 2022-04-29T02:08:52Z

Okay, I added some tests and force-pushed for clarity because I don't think anybody had looked at this yet. And sorry for the noise. It took a few tries to get this right.

Commit aa91b32 only refactors populate.rs and adds tests for the individual populator functions. The tests check that each one is idempotent and that it fails with ServiceUnavailable when the database is down. So if you take that commit (so, main + the new tests, without the fix for the underlying bugs here), we get this failure:

$ cargo test -p omicron-nexus --lib -- --nocapture populate::
   Compiling omicron-nexus v0.1.0 (/home/dap/omicron/nexus)
   Compiling nexus-test-utils v0.1.0 (/home/dap/omicron/nexus/test-utils)
    Finished test [unoptimized + debuginfo] target(s) in 1m 37s
     Running unittests (target/debug/deps/omicron_nexus-9137b6fff380731a)

running 1 test
log file: "/dangerzone/omicron_tmp/omicron_nexus-9137b6fff380731a-test_populator.11741.0.log"
note: configured to log to "/dangerzone/omicron_tmp/omicron_nexus-9137b6fff380731a-test_populator.11741.0.log"
thread 'populate::test::test_populators' panicked at 'called `Result::unwrap()` on an `Err` value: populator PopulateBuiltinUsers: expected to be idempotent, but it failed when run a second time

Caused by:
    Internal Error: error creating silo user: Connection(Query(DatabaseError(UniqueViolation, "duplicate key value violates unique constraint \"primary\"")))', nexus/src/populate.rs:265:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: /dangerzone/omicron_tmp/.tmpyRGVOS
test populate::test::test_populators ... FAILED

failures:

failures:
    populate::test::test_populators

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 76 filtered out; finished in 2.66s

error: test failed, to rerun pass '-p omicron-nexus --lib'

That's good -- the test correctly identifies that one of the existing populators is not idempotent. If we comment out that one assertion, then we get this failure instead:

$ cargo test -p omicron-nexus --lib -- --nocapture populate::
   Compiling omicron-nexus v0.1.0 (/home/dap/omicron/nexus)
   Compiling nexus-test-utils v0.1.0 (/home/dap/omicron/nexus/test-utils)
    Finished test [unoptimized + debuginfo] target(s) in 1m 31s
     Running unittests (target/debug/deps/omicron_nexus-9137b6fff380731a)

running 1 test
log file: "/dangerzone/omicron_tmp/omicron_nexus-9137b6fff380731a-test_populator.11791.0.log"
note: configured to log to "/dangerzone/omicron_tmp/omicron_nexus-9137b6fff380731a-test_populator.11791.0.log"
thread 'populate::test::test_populators' panicked at 'populator PopulateBuiltinUsers: expected ServiceUnavailable when the database was down, but got Internal Error: error creating silo user: Timeout (InternalError { internal_message: "error creating silo user: Timeout" })', nexus/src/populate.rs:297:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test populate::test::test_populators ... FAILED

failures:

failures:
    populate::test::test_populators

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 76 filtered out; finished in 5.36s

error: test failed, to rerun pass '-p omicron-nexus --lib'

That's good too: the test correctly identifies that the populator also doesn't return ServiceUnavailable when the database is offline.

The next commit fixes the underlying bugs. The new tests (and all existing tests) pass on this one.

davepacheco · 2022-04-29T02:10:24Z

Sorry again for the noise. I think this is ready for review now.

Propolis changes: PHD: refactor & add support Propolis server "environments" (#547) Begin making Accessor interface more robust Update Crucible and Omicron deps for Hakari fixes Add cloud-init volume generation to standalone Use specified toolchain version for all GHA checks Use params to configure rust-toolchain in GHA Update and lock GHA dependencies Crucible changes: Use regions_dataset path for apply_smf (#1000) Don't unwrap when we can't create a dataset (#992) Fix tests and update log messages. (#995) Better backpressure (#990) Update Rust crate proptest to 1.3.1 (#977) Read only downstairs can skip Live Repair (#984) Update Rust crate expectorate to 1.1.0 (#975) Add trait for `ExtentInner` (#982) report backpressure in upstairs_info dtrace probe (#987) Support multiple downstairs operations in GtoS (#985)

Propolis changes: PHD: refactor & add support Propolis server "environments" (#547) Begin making Accessor interface more robust Update Crucible and Omicron deps for Hakari fixes Add cloud-init volume generation to standalone Use specified toolchain version for all GHA checks Use params to configure rust-toolchain in GHA Update and lock GHA dependencies Crucible changes: Use regions_dataset path for apply_smf (#1000) Don't unwrap when we can't create a dataset (#992) Fix tests and update log messages. (#995) Better backpressure (#990) Update Rust crate proptest to 1.3.1 (#977) Read only downstairs can skip Live Repair (#984) Update Rust crate expectorate to 1.1.0 (#975) Add trait for `ExtentInner` (#982) report backpressure in upstairs_info dtrace probe (#987) Support multiple downstairs operations in GtoS (#985) --------- Co-authored-by: Alan Hanson <alan@oxide.computer>

davepacheco changed the title ~~error handling on silo user create breaks initial population during transient error~~ error handling during initial data population needs work Apr 28, 2022

davepacheco marked this pull request as ready for review April 28, 2022 18:03

davepacheco requested a review from jmpesp April 28, 2022 18:11

davepacheco force-pushed the populate-error branch from dc00f26 to 8224b73 Compare April 29, 2022 00:22

davepacheco force-pushed the populate-error branch from 8224b73 to 3629389 Compare April 29, 2022 01:39

davepacheco added 2 commits April 28, 2022 18:49

test individual populator behavior

aa91b32

error handling on silo user create breaks initial population during t…

de943d8

…ransient error

davepacheco force-pushed the populate-error branch from 3629389 to de943d8 Compare April 29, 2022 02:08

jmpesp approved these changes May 2, 2022

View reviewed changes

davepacheco added 2 commits May 2, 2022 08:31

Merge branch 'main' into populate-error

48bc1ac

fix auto-mismerge

8781489

davepacheco enabled auto-merge (squash) May 2, 2022 15:36

davepacheco merged commit d2bf956 into main May 2, 2022

davepacheco deleted the populate-error branch May 2, 2022 16:23

davepacheco mentioned this pull request Oct 7, 2022

Resource Utilization #1782

Merged

14 tasks

leftwo mentioned this pull request Oct 16, 2023

Update Crucible and Propolis #4282

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

error handling during initial data population needs work #985

error handling during initial data population needs work #985

Uh oh!

davepacheco commented Apr 28, 2022 •

edited

Loading

Uh oh!

davepacheco commented Apr 29, 2022 •

edited

Loading

Uh oh!

davepacheco commented Apr 29, 2022

Uh oh!

davepacheco commented Apr 29, 2022

Uh oh!

Uh oh!

error handling during initial data population needs work #985

error handling during initial data population needs work #985

Uh oh!

Conversation

davepacheco commented Apr 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Bugs

Uh oh!

davepacheco commented Apr 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davepacheco commented Apr 29, 2022

Uh oh!

davepacheco commented Apr 29, 2022

Uh oh!

Uh oh!

davepacheco commented Apr 28, 2022 •

edited

Loading

davepacheco commented Apr 29, 2022 •

edited

Loading