Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error handling during initial data population needs work #985

Merged
merged 4 commits into from
May 2, 2022

Conversation

davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Apr 28, 2022

Background

Initial populating of the CockroachDB database happens in two different ways:

  1. during "rack setup" (or during omicron-dev db-run), we create the omicron database, schema, and the bare minimum data that needs to be there
  2. every time Nexus starts up, it attempts to insert a bunch of built-in users, roles, role assignments, silo, etc. into the database. It retries this until it successfully inserts everything it expects to or runs into some unknown user input problem (e.g., an unexpected conflict, or a SQL syntax error, or something like that).

As much as possible, data should be inserted using the second mode. That's because rack setup only ever happens once, so any data we add to Nexus will never get added to systems that were deployed on a previous version of Nexus. On the other hand, if data is inserted using the second mode, then the data will be automatically inserted on upgrade. That's good: that means for the most part, if you want to add a new built-in user, you can just do it and expect it to be there when your code is running.

As for that second mode: when Nexus starts up, there are a few cases to consider:

  • Nexus comes up and CockroachDB is not available. It should retry until CockroachDB is available.
  • Nexus comes up and none of its data is present. It should go ahead and insert it.
  • Nexus comes up and some of its data is present. It should still go ahead and insert it, ignoring primary key conflicts (on the assumption that it's the same data). This deals with crashes during a previous attempt, but also the upgrade case mentioned above: future versions of Nexus can deliver more data knowing it will be inserted the first time the new version of Nexus comes online.
  • Nexus comes up and runs into a non-retryable problem doing any of this (e.g., SQL syntax error). It logs an error. This should eventually raise a support case. It shouldn't ever happen.

Bugs

This used to work, but a few problems have snuck in. Today, if you start Nexus without CockroachDB running, you see this:

Apr 28 16:10:19.151 ERRO gave up trying to populate built-in users, error_message: InternalError { internal_message: "error creating silo user: Timeout" }, component: DataLoader, component: nexus, component: ServerContext, name: e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c

It gave up immediately -- even though this error was transient. The impact of this is that if CockroachDB isn't running when Nexus first starts, it will never insert this data, and all kinds of stuff will be broken after that. The cause here is a code path that was replacing whatever error we got back from the database code (in this case, the retryable ServiceUnavailable) with an "InternalError" (which is not retryable).

After this change, if I start Nexus without CockroachDB running, I see this instead:

Apr 28 16:50:08.414 WARN failed to populate built-in users; will retry in 588.027336ms, error_message: ServiceUnavailable { internal_message: "Timeout accessing connection pool" }, component: DataLoader, component: nexus, component: ServerContext, name: e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c

Note it's only a warning now and it says that we will try again. If I then start a CockroachDB node (that's already had the initial database populate done), Nexus finishes loading its data:

Apr 28 17:22:37.978 INFO created 6 built-in users, component: DataLoader, component: nexus, component: ServerContext, name: e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c
Apr 28 17:22:38.005 INFO created 9 built-in roles, component: DataLoader, component: nexus, component: ServerContext, name: e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c
Apr 28 17:22:38.031 INFO created 5 built-in role assignments, component: DataLoader, component: nexus, component: ServerContext, name: e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c
Apr 28 17:22:38.079 INFO created 1 built-in silos, component: DataLoader, component: nexus, component: ServerContext, name: e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c

There's another problem: with the above issue fixed, if you then restart Nexus, it fails again:

Apr 28 17:23:14.373 ERRO gave up trying to populate built-in users, error_message: InternalError { internal_message: "unexpected database error: duplicate key value violates unique constraint \"primary\"" }

The impact of this is that if we updated Nexus to one which delivered more data, that data would never be inserted. The cause here is a code path that wasn't ignoring primary key conflicts.

It's a bit of a problem that there aren't automated tests for the error cases. I'll look into how hard this would be.

@davepacheco davepacheco changed the title error handling on silo user create breaks initial population during transient error error handling during initial data population needs work Apr 28, 2022
@davepacheco davepacheco marked this pull request as ready for review April 28, 2022 18:03
@davepacheco davepacheco requested a review from jmpesp April 28, 2022 18:11
@davepacheco
Copy link
Collaborator Author

davepacheco commented Apr 29, 2022

(deleted -- I messed up the commits)

@davepacheco
Copy link
Collaborator Author

Okay, I added some tests and force-pushed for clarity because I don't think anybody had looked at this yet. And sorry for the noise. It took a few tries to get this right.

Commit aa91b32 only refactors populate.rs and adds tests for the individual populator functions. The tests check that each one is idempotent and that it fails with ServiceUnavailable when the database is down. So if you take that commit (so, main + the new tests, without the fix for the underlying bugs here), we get this failure:

$ cargo test -p omicron-nexus --lib -- --nocapture populate::
   Compiling omicron-nexus v0.1.0 (/home/dap/omicron/nexus)
   Compiling nexus-test-utils v0.1.0 (/home/dap/omicron/nexus/test-utils)
    Finished test [unoptimized + debuginfo] target(s) in 1m 37s
     Running unittests (target/debug/deps/omicron_nexus-9137b6fff380731a)

running 1 test
log file: "/dangerzone/omicron_tmp/omicron_nexus-9137b6fff380731a-test_populator.11741.0.log"
note: configured to log to "/dangerzone/omicron_tmp/omicron_nexus-9137b6fff380731a-test_populator.11741.0.log"
thread 'populate::test::test_populators' panicked at 'called `Result::unwrap()` on an `Err` value: populator PopulateBuiltinUsers: expected to be idempotent, but it failed when run a second time

Caused by:
    Internal Error: error creating silo user: Connection(Query(DatabaseError(UniqueViolation, "duplicate key value violates unique constraint \"primary\"")))', nexus/src/populate.rs:265:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: /dangerzone/omicron_tmp/.tmpyRGVOS
test populate::test::test_populators ... FAILED

failures:

failures:
    populate::test::test_populators

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 76 filtered out; finished in 2.66s

error: test failed, to rerun pass '-p omicron-nexus --lib'

That's good -- the test correctly identifies that one of the existing populators is not idempotent. If we comment out that one assertion, then we get this failure instead:

$ cargo test -p omicron-nexus --lib -- --nocapture populate::
   Compiling omicron-nexus v0.1.0 (/home/dap/omicron/nexus)
   Compiling nexus-test-utils v0.1.0 (/home/dap/omicron/nexus/test-utils)
    Finished test [unoptimized + debuginfo] target(s) in 1m 31s
     Running unittests (target/debug/deps/omicron_nexus-9137b6fff380731a)

running 1 test
log file: "/dangerzone/omicron_tmp/omicron_nexus-9137b6fff380731a-test_populator.11791.0.log"
note: configured to log to "/dangerzone/omicron_tmp/omicron_nexus-9137b6fff380731a-test_populator.11791.0.log"
thread 'populate::test::test_populators' panicked at 'populator PopulateBuiltinUsers: expected ServiceUnavailable when the database was down, but got Internal Error: error creating silo user: Timeout (InternalError { internal_message: "error creating silo user: Timeout" })', nexus/src/populate.rs:297:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test populate::test::test_populators ... FAILED

failures:

failures:
    populate::test::test_populators

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 76 filtered out; finished in 5.36s

error: test failed, to rerun pass '-p omicron-nexus --lib'

That's good too: the test correctly identifies that the populator also doesn't return ServiceUnavailable when the database is offline.

The next commit fixes the underlying bugs. The new tests (and all existing tests) pass on this one.

@davepacheco
Copy link
Collaborator Author

Sorry again for the noise. I think this is ready for review now.

@davepacheco davepacheco enabled auto-merge (squash) May 2, 2022 15:36
@davepacheco davepacheco merged commit d2bf956 into main May 2, 2022
@davepacheco davepacheco deleted the populate-error branch May 2, 2022 16:23
@davepacheco davepacheco mentioned this pull request Oct 7, 2022
14 tasks
leftwo pushed a commit that referenced this pull request Oct 16, 2023
Propolis changes:
PHD: refactor & add support Propolis server "environments" (#547)
Begin making Accessor interface more robust
Update Crucible and Omicron deps for Hakari fixes
Add cloud-init volume generation to standalone
Use specified toolchain version for all GHA checks
Use params to configure rust-toolchain in GHA
Update and lock GHA dependencies

Crucible changes:
Use regions_dataset path for apply_smf (#1000)
Don't unwrap when we can't create a dataset (#992)
Fix tests and update log messages. (#995)
Better backpressure (#990)
Update Rust crate proptest to 1.3.1 (#977)
Read only downstairs can skip Live Repair (#984)
Update Rust crate expectorate to 1.1.0 (#975)
Add trait for `ExtentInner` (#982)
report backpressure in upstairs_info dtrace probe (#987)
Support multiple downstairs operations in GtoS (#985)
leftwo added a commit that referenced this pull request Oct 16, 2023
Propolis changes:
PHD: refactor & add support Propolis server "environments" (#547) Begin
making Accessor interface more robust
Update Crucible and Omicron deps for Hakari fixes
Add cloud-init volume generation to standalone
Use specified toolchain version for all GHA checks Use params to
configure rust-toolchain in GHA
Update and lock GHA dependencies

Crucible changes:
Use regions_dataset path for apply_smf (#1000)
Don't unwrap when we can't create a dataset (#992) Fix tests and update
log messages. (#995)
Better backpressure (#990)
Update Rust crate proptest to 1.3.1 (#977)
Read only downstairs can skip Live Repair (#984)
Update Rust crate expectorate to 1.1.0 (#975)
Add trait for `ExtentInner` (#982)
report backpressure in upstairs_info dtrace probe (#987) Support
multiple downstairs operations in GtoS (#985)

---------

Co-authored-by: Alan Hanson <alan@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants