Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[repo depot 3/n] nexus background task to replicate TUF artifacts across sleds #7129

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

iliana
Copy link
Contributor

@iliana iliana commented Nov 21, 2024

This mostly rounds out the implementation of uploading TUF repos to the rack. The module doc comment describes the flow:

//! `Nexus::updates_put_repository` accepts a TUF repository, which Nexus
//! unpacks, verifies, and reasons about the artifacts in. This uses temporary
//! storage within the Nexus zone, so the update artifacts have to go somewhere.
//! We've settled for now on "everywhere": a copy of each artifact is stored on
//! each sled's M.2 devices.
//!
//! This background task is responsible for getting locally-stored artifacts
//! onto sleds, and ensuring all sleds have copies of all artifacts.
//! `Nexus::updates_put_repository` sends the [`ArtifactsWithPlan`] object to
//! this task via an [`mpsc`] channel and activates it.
//!
//! During each activation:
//!
//! 1. The task moves `ArtifactsWithPlan` objects of the `mpsc` channel and into
//! a `Vec`.
//! 2. The task queries the list of artifacts stored on each sled, and compares
//! it to the list of artifacts in CockroachDB. Sled artifact storage is
//! content-addressed by SHA-256 checksum. Errors are logged but otherwise
//! ignored (unless no sleds respond); the task proceeds as if that sled
//! has no artifacts. (This means that the task will always be trying to
//! replicate artifacts to that sled until it comes back or is pulled out
//! of service.)
//! 3. The task compares the list of artifacts from the database to the list of
//! artifacts available locally.
//! 4. If all the artifacts belonging to an `ArtifactsWithPlan` object have
//! been replicated to at least `MIN_SLED_REPLICATION` sleds, the task drops
//! the object from its `Vec` (thus cleaning up the local storage of those
//! files).
//! 5. The task generates a list of requests that need to be sent:
//! - PUT each locally-stored artifact not present on any sleds to a random
//! sled.
//! - For each partially-replicated artifact, choose a sled that is missing
//! the artifact, and tell it (via `artifact_copy_from_depot`) to fetch the
//! artifact from a random sled that has it.
//! - DELETE all artifacts no longer tracked in CockroachDB from all sleds
//! that have that artifact.
//! 6. The task randomly choose requests up to per-activation limits and
//! sends them. Up to `MAX_REQUESTS` total requests are sent, with up to
//! `MAX_PUT_REQUESTS` PUT requests. Successful and unsuccessful responses
//! are logged.

There's an integration test, but I eventually want to add a live test too to test the behavior of MIN_SLED_REPLICATION. I am manually testing that behavior now.

(previous out-of-date text about tests)

There are no tests yet; I've been weighing my options, although I think it's unlikely any will be ready before I head out on Tuesday for a week:

  • Make parts of the background task generic enough to be unit tested. I haven't started down this path at all, and I'm not sure how much gain we'll get out of it compared to the other paths.
  • Nexus integration test. I want to have this before we start relying on this path. I am relatively close (I have a small cleanup of [repo depot 2/n] sled agent APIs to manage update artifact storage #6764 that I'll post in another PR which will make this more possible), although I will need to determine some way to start the Repo Depot API on an independent port and report that back up to the test execution context. This is the closest to being ready and I'll continue working on it today.
  • Live test. I would also like to write one of these in the limit, because it's the only way to test MIN_SLED_REPLICATION. But I will need to teach the test context to resolve the external Nexus API in order to upload the repository.

After this lands, the remaining work is:

  • When a Dropshot release with [5/n] [dropshot] add support for per-endpoint size limits dropshot#1171 lands here, we will need to adjust the size limits for Nexus system_update_put_repository and Sled Agent artifact_put.
  • Deleting a repository, which includes figuring out how to mark a repository (or individual artifacts?) as "in use" to prevent an operator from deleting a repository that Nexus might need.

@iliana iliana marked this pull request as ready for review November 21, 2024 17:36
@iliana iliana requested a review from sunshowers November 21, 2024 17:47
@iliana
Copy link
Contributor Author

iliana commented Nov 21, 2024

Update: I got enough plumbing done to get an integration test working, so I'll have that here soon.

pub(super) struct SimArtifactStorage {
root: Utf8TempDir,
backend: Arc<Mutex<Storage>>,
dirs: Arc<(Utf8TempDir, Utf8TempDir)>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we come up with a named type for this, or at least a comment? It's hard to tell what this tuple is for.

(Are there two of these to emulate having two M.2s?)

ALTER TABLE omicron.public.sled
ADD COLUMN IF NOT EXISTS repo_depot_port INT4
CHECK (port BETWEEN 0 AND 65535)
NOT NULL DEFAULT 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely do want the default here -- it's necessary to perform this upgrade -- but on existing systems, would we ever patch this value? Are we relying on the sled_upsert to update this value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are relying on sled_upsert, yeah. I am not actually certain if that runs every time Sled Agent starts, I did not look that closely — it seems that there's a background task that periodically notifies Nexus of Sled Agent's existence?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if anyone tries using this column before that value propagates, they'll try communicating with port zero on the sled agent, right?

Should we set this value to 12348 instead of zero? It's a constant anyway, that seems like it would avoid this period of "seeing repot_depo_port = 0"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 12348 the actual port in use?

If we don't want to assume that, then NULL is essentially a legal state and it seems like we should use NULL here and Nexus should generate an error if it runs into a case where it needs this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12348 is the port in use, it's a const:

pub const SLED_AGENT_PORT: u16 = 12345;
pub const REPO_DEPOT_PORT: u16 = 12348;

As noted elsewhere the main reason this is in the database is to support tests, where the const is not used. So I think a 12348 default for current sleds is relatively safe. (Making this nullable instead I think would be the right thing if we were rolling into this upgrade instead of doing mupdate.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear: the assumption here is that any system on which the migration is running is a "real" system and for those the port has always been 12348 and so filling it in this way is definitely correct. That makes sense to me. Add a comment to that effect?

Comment on lines 154 to 156
# The default activation period for this task is 60s, but we want to activate it
# manually to test the result of each activation.
tuf_artifact_replication.period_secs = 600
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend using a disable config to just explicitly turn this off. Our tests don't run for 600 seconds, until sometimes they do -- disable = true will just stop it for good

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment trying to say that the config is specifying a value of 600s[, which is greater than 60s] so that it will only be activated manually? I think that could be more explicit.

disable is a nice pattern. To be clear though I believe you'd need to implement the disable flag yourself (e.g., the way instance_updater task does). I don't think that's a thing you can just set on the config for any background task and have the driver skip it for you.

But it won't work here, at least not if you implement it so that disable means "skip all activations of this task" (which is what the other implementors of disable do). That will prevent manual activations from doing anything, too.

If it's just for tests, I think it's reasonable to set this to some astronomical number (much bigger than 10m), but even better if the test can be resilient to extra activations (see my other comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking based on Sean's comment that I could disable periodic activation without disabling manual activation but that doesn't seem to be the case. If background tasks were passed the ActivationReason it'd be much easier to disable periodic activation without disabling manual activation.

It looks like our "terminate slow tests" timeout in CI is 15 minutes, so setting it to 3600s would be reasonable for tests I think.

Comment on lines 20 to 21
//! 1. The task moves `ArtifactsWithPlan` objects of the `mpsc` channel and into
//! a `Vec`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! 1. The task moves `ArtifactsWithPlan` objects of the `mpsc` channel and into
//! a `Vec`.
//! 1. The task moves `ArtifactsWithPlan` objects off the `mpsc` channel and into
//! a `Vec`.

?

Comment on lines 35 to 43
//! 5. The task generates a list of requests that need to be sent:
//! - PUT each locally-stored artifact not present on any sleds to a random
//! sled.
//! - For each partially-replicated artifact, choose a sled that is missing
//! the artifact, and tell it (via `artifact_copy_from_depot`) to fetch the
//! artifact from a random sled that has it.
//! - DELETE all artifacts no longer tracked in CockroachDB from all sleds
//! that have that artifact.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on the following scenario:

  • Nexus A runs, is really slow. Tries to tell all sleds to use "version N" of artifacts, and delete everything else.
  • Slightly later, Nexus B runs, and is much faster. Tries to tell all sleds to use "version N + 1" of artifacts, and delete everything else.
  • Sled Agents see a mix of requests from Nexus A and B, and end up with a mix of "version N" and "version N+1" artifacts.

We usually use generation numbers when sending requests to sled agents to prevent us from moving backwards. Do we have something similar here? Should we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would we store the generation number, and who would verify it?

Presumably we'd store it in CockroachDB somewhere, but I think we want a single generation number for the complete list of all artifacts, so we can't store it for each tuf_repo row.

Would sleds store the highest generation number they've seen, and reject any requests with a lower number?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've used this pattern for storing config information about zones, datasets, and physical disks.

  • Nexus stores the generation number in cockroachdb, so that it can be agreed upon between multiple instances.
  • It gets sent to each sled agent as part of the request, and is stored on the M.2
  • Sled agents ignore requests with generation numbers lower than their last-seen value
  • If the generation number is equal to their last seen value, they can confirm that the contents are the same, and otherwise throw an error

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding the big picture right: this task isn't telling sleds to use artifacts from one repo or the other. It's just copying files around. I expect those files aren't used for anything today. A follow-on step will be to have Sled Agent use these artifacts when creating control plane zones. But when it does that, the request from Reconfigurator will say exactly which artifact to use. At no point does Sled Agent care which TUF repo the artifacts are associated with. And it's not limited to having only one artifact for each kind of zone or anything like that.

If all that's correct, then I think the essential constraints are:

  • Once a TUF repo is uploaded, then as long as it remains present in CockroachDB, all sleds should converge to having all of the artifacts in that repo.
    • Once a sled has any artifact from a TUF repo that's present in CockroachDB, it should not remove that artifact until after that repo has been removed from CockroachDB. This is just saying that we never go "backwards".
  • Once an artifact is no longer referenced by any TUF repos present in CockroachDB, then all sleds should converge to not having it.
    • This is less critical but: once a sled has removed an artifact that's not referenced by any TUF repos present in CockroachDB, then it should not download it again (unless some other TUF repo is uploaded that references it).

Importantly, for any two TUF repos N and N+1, whether they get uploaded around the same time or not, as long as both remain present in CockroachDB, it seems perfectly fine for a sled to have literally any combination of artifacts from both repos (including: none of N and all of N + 1, all of N and none of N + 1, etc.) -- as long as Nexus is making sure that all sleds eventually get all of them.

Does all that seem right @iliana?

We could model this instead as: there's a set of artifacts that all sleds should have, that has a generation number, and that number gets bumped whenever a TUF repo gets added or removed. But I don't think that's necessary or helpful here.

Copy link
Contributor Author

@iliana iliana Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is 100% correct; however Sean's concern is also correct in that two Nexuses could be running this background task simultaneously without conferring with each other. Because the only fetch the list of artifacts from the database once, the different tasks that are executing can have different views of what the correct list of artifacts is.

Sean's definition of how a problem could manifest here I think is somewhat incorrect, but I think this problem could indeed occur:

  • Nexus A runs, fetches the list of artifacts at time N, and then goes off to ask all the sleds what they've got. The first sled's response is slowed by congestion/cosmic rays/gremlins/whatever.
  • Nexus B runs because it just got a new repo from an operator, fetches the list of artifacts at time N+1, and starts replicating its local artifacts to other sleds. This all happens before Nexus A gets its first response.
  • Nexus A gets its responses, sees all these sleds have artifacts it doesn't know about, and tells all the sleds to delete them.

The risk here is that the rack loses all of the copies of an artifact. But Nexus B won't delete its local copy of the repository artifacts until the next time it runs: local repos are only dropped if the information we've collected from all the sleds indicate sufficient replication of all the local artifacts in that repository. By that time it will see that all the sleds no longer have a copy and it'll try to replicate them again.

It's also not very likely that this particular problem will occur except in a very contrived case, because each task execution is limited to performing 32 API calls. But if we ever shipped a patch repo that changed only a single artifact it becomes much less contrived. (Or bumped those consts up, because the numbers aren't really based on anything?)

A generation number would add some defense in depth here, but I am not quite sure whether it's helpful or not.

// or three on the system during an upgrade (we've sized the artifact
// datasets to fit at most 10 repositories for this reason).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give more context to the claim "We've sized the artifact datasets to fit at most 10 repositories"?

Are you saying the M.2 'update' dataset is 10x the size of "a repo"? but which repo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// TODO-correctness: This value of 20 GiB is a wild guess -- given TUF repo
// sizes as of Oct 2024, it would be capable of storing about 10 distinct system
// versions.
pub const ARTIFACT_DATASET_QUOTA: ByteCount = ByteCount::from_gibibytes_u32(20);

All of these numbers are dubious.

"failed to send artifacts for replication: {err}"
))
})?;
// Now store the artifacts in the database.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we send artifacts_with_plan, write to the database, and crash before the background task completes, what happens to the state durably recorded in the database? Wouldn't we have a tuf_repo recorded in a database table, that doesn't actualyl exist on the rack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would. There's a // TODO in the background task to mark individual repositories or artifacts as "failed" if any of the artifacts are verifiably missing. But this is not yet implemented.

However I think given the comment below I'm going to flip this logic. We still need to implement some way of noticing missing artifacts though.

// If the last sender for this channel is dropped, then the
// `Nexus` type has also been dropped. This is presumably
// a bug.
panic!("artifact replication receiver disconnected");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I know it's unlikely, but can we return an error here instead? This behavior makes implementing clean-shutdown really painful

// `Utf8TempDir`s storing the artifacts, into the artifact replication
// background task. This is done before the database insert because if
// this fails, we shouldn't record the artifacts in the database.
self.tuf_artifact_replication_tx
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're sending the artifacts on this channel before we write them into the DB -- if the background task activated after we write to this mpsc, but before we call update_tuf_repo_insert, would discard the TUF artifacts we just tried to add?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I think it would. This seems clearly worse than "we recorded the artifacts in the database, but failed to move the artifacts into the mpsc", right? Particularly since failing to move the artifacts shouldn't ever really happen unless somehow the background tasks are dead.

Probably I will flip these around, which also gives me the opportunity to avoid sending the artifacts and activating the task if the repo is either already uploaded or was rejected for whatever reason.

Comment on lines +437 to +439
if new_status == status {
panic!("TUF artifact replication stalled: {new_status:?}");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This background task involves replicating to a random selection of sleds. If we pick the same set twice, would we stall, and trigger this panic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's possible to pick the same set twice, unless errors are occurring, and we panic within these test functions if any errors are recorded by the task.

@@ -375,6 +375,7 @@ mod test {
let sled = SledUpdate::new(
*sled_id.as_untyped_uuid(),
"[::1]:0".parse().unwrap(),
0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a lot of places where we use 0 for the repo depot port. I assume this is a sentinel value? It might be nice to use Option instead here. (see also the discussion about whether the field should be NULLable but I think this is true regardless).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general in these test functions, I attempted to follow the same usage of how the sled agent port was specified. In this case you can see the sled agent SocketAddr is localhost port 0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense.

The end result is that there are many callers where we're repeating the same values that, if I'm understanding right, can't actually be right -- they're just unused. This makes me wonder if both of those ought to be optional. Maybe this should be a SledUpdateBuilder? But anyway it's fine to say that's out of scope here.

Comment on lines 202 to 206
/// Number of requests that were not sent during this activation because the
/// limits had been reached.
pub requests_outstanding: usize,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would read "outstanding" to mean "in-flight" rather than "not sent". Is it "outstanding" in the sense of "these requests remain to-be-done and will be done on the next activation"? How about calling this: "requests_deferred? (or: "requests_wanted" or "requested_skipped" or "requests_overflow" or "requests_over_limit")

/// The number of repositories this Nexus instance is keeping in local
/// storage. Local repositories are deleted once all their artifacts are
/// sufficiently replicated.
pub local_repos: usize,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even nittier and feel free to ignore: what do you think of unreplicated_repos? Honestly I'm torn because I prefer the fact (the count of local repos) over the inference (they're under-replicated), but I also feel like "local_repos" doesn't communicate its importance if you don't already know how this system works.

Comment on lines +2328 to +2329
-- TODO: Repos fetched over HTTP will not have a SHA256 hash; this is an
-- implementation detail of our ZIP archives.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd strike this -- we'll have to deal with it if/when we add HTTP support but I'm not sure the comment helps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it because it will certainly help me in the future :) It's more of a reminder.

@@ -2371,6 +2380,8 @@ CREATE TABLE IF NOT EXISTS omicron.public.tuf_artifact (
PRIMARY KEY (name, version, kind)
);

CREATE INDEX IF NOT EXISTS tuf_artifact_sha256 ON omicron.public.tuf_artifact (sha256);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CREATE INDEX IF NOT EXISTS tuf_artifact_sha256 ON omicron.public.tuf_artifact (sha256);
CREATE UNIQUE INDEX IF NOT EXISTS tuf_artifact_sha256 ON omicron.public.tuf_artifact (sha256, id);

(pagination indexes should generally be unique)

Copy link
Contributor Author

@iliana iliana Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the effect of adding a UNIQUE constraint to the sha256 column, which causes the bogus repos we generate during testing to fail to upload (because many of the bogus artifacts have the same sha256 checksum despite having different name/version/kind values). Should I instead be paginating on the composite primary key? Or is this one of the exceptions to the rule (and I should add a comment as to why)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked about this particular thing with @davepacheco: it definitely doesn't make sense to use sha256 as a pagination key here because we don't expect it to be unique. Since the correct pagination key is the primary key, which is stored across 3 columns, this requires an improvement to paginated_multicolumn which is already pretty gnarly trait-wise just for two columns.

Speaking with @sunshowers about the key we're using for tuf_artifact, we both seem to think that (name, version, kind) is a poor choice of key since it does not distribute well, so I'm going to evaluate replacing that key with a UUID in a separate PR. That will block this one but I think it's preferable.

if let Some(presence) = ArtifactHash::from_str(&hash)
.ok()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I got confused because we're swallowing the error when the hash doesn't parse. Would it be reasonable to:

let Ok(hash) = ArtifactHash::from_str(&hash) else {
    error!(log, "sled reported bogus artifact hash"; ...);
    continue;
}

if let Some(presence) = artifacts.get_mut(&hash) {
...

Comment on lines 232 to 234
presence.counts.insert(sled.id, count);
presence.sleds.push(sled);
} else {
delete_requests.push(Request::Delete { sled, hash });
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest an info-level log message here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will already emit an info! or higher-severity message if this request occurs (at the bottom of the execute function). I think this would unnecessarily spam the logs if there's any artifacts for deletion.

// Mark all the artifacts found in `self.local` as locally available.
// If all of the artifacts in an `ArtifactsWithPlan` are sufficiently
// replicated, drop the `ArtifactsWithPlan`.
let sufficient_sleds = MIN_SLED_REPLICATION.min(sleds.len());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the min here just to handle developer systems that don't have MIN_SLED_REPLICATION sleds? If so I'd put this into configuration for this task (which is already distinct for the test suite / omicron-dev vs. a real system) so that we never accidentally consider this acceptable in a production system.

Comment on lines 266 to 268
// TODO: If there are any missing artifacts with no known copies,
// check the status of the assigned Nexus for its repositories. If the
// assigned Nexus is expunged, or if we are the assigned Nexus and we
// don't have the artifact, mark the repository as failed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we file an issue for this? (I'd also strike the comment but that's less important.)

@@ -0,0 +1,447 @@
// This Source Code Form is subject to the terms of the Mozilla Public
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got two comments here about testing and observability. These aren't blockers.

On testing, and feel free to ignore or defer this: there aren't any automated tests here, but I think it might be much easier to automatically test the logic here if it's separated into three phases: (1) collect data from the sled agents and the database, (2) determine what to do (this would produce the list of requests, and potentially also a status summary), (3) make the requests. All the tricky bits would be in (2) and you could write exhaustive tests for that (that have TUF repos come and go, sleds come and go, etc.) without needing to mess around with faking up database state or fake sled agents.

On observability: I'm coming at this like I come at all subsystems like this, which is: assuming at some point we do find a bug in production with it, how can we make sure we can debug it the first time it happens? The kinds of bugs I can imagine are (and these overlap a bunch):

  • TUF repo doesn't seem to be replicating
  • artifact is unexpectedly missing from some sled
    • artifact took too long to be replicated
    • artifact was unexpectedly deleted
  • artifact is unexpectedly present on some sled
    • artifact took too long to be deleted everywhere
    • artifact was unexpectedly re-replicated

I think a few things would help us root-cause any of these problems from the first time they happen:

  • "info"-level log message when a local repo is created or deleted
  • "info"-level log entry for every download/delete request that we make to the sled
  • maybe? "info"-level log entry with each activation listing the TUF repos (or artifacts?) that are believed to be "present"
  • cumulative counters for the stuff we have last-activation counters for, plus one for "repos that we've received and then successfully replicated" and "repos that we've received and then permanently failed to replicate".

It would also be really nice to have a ringbuffer of the last N replicate/delete requests made to sleds. If we make this big enough (say, 32 sleds * 4 repos), we could conceivably have the full history for a particular TUF repo right there in the ringbuffer. This is more work than the others and redundant with the log entries, but much easier to debug with... if Nexus hasn't restarted.

I'd suggest doing at least the "info"-level log entries (some of which may already exist). Recent customer discussions have emphasized the criticality of being able to debug problems without an Oxide person having direct access to the system so I think it's better for us to be proactive rather than wait to discover a bug here.

iliana added a commit that referenced this pull request Jan 14, 2025
`(name, version, kind)` is a natural key for the `tuf_artifact` table:
given a kind and name for an artifact, we should only ever have one copy
of any particular version.

However, this natural key does not distribute well, which [makes it a
poor choice for a primary key in CockroachDB][1]. It also makes
pagination more difficult.

Instead of altering the schema we drop the tables and recreate them, as
any potential contents would be useless to us; until #7129 lands we have
no means to store the artifacts from a repository.

[1]: https://www.cockroachlabs.com/blog/how-to-choose-a-primary-key/
@iliana
Copy link
Contributor Author

iliana commented Jan 23, 2025

Okay, major refactor of this PR is now pushed and ready for review.

These are things I'm planning on punting to later issues to unblock demo work:

internal_client: &ClientTestContext,
) {
let mut status = run_tuf_artifact_replication_step(internal_client).await;
while status.local_repos > 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that we could get hung here if something's broken? It'd be nice if there were a timeout, as generous as it might need to be.

Comment on lines +69 to +72
// If we inserted a new repository, move the `ArtifactsWithPlan` (which
// carries with it the `Utf8TempDir`s storing the artifacts) into the
// artifact replication background task, then immediately activate the
// task.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr: I think this is fine.

This is different from the way background tasks are usually supposed to work so I wanted to mention it. This feels like a way to sneak input in:

//! Background task activations do not accept input, by design. See "Design
//! notes" below.

//! * Each activation of the reconciler accepts no input. That is, even when we
//! think we know what changed, we do not use that information. This ensures
//! that the reconciler really is idempotent and its actions are based solely
//! on the state that it's watching. Put differently: having reconcilers
//! accept an explicit hint about what changed (and then doing something
//! differently based on that) bifurcates the code: there's the common case
//! where that hint is available and the rarely-exercised case when it's not
//! (e.g., because Nexus crashed and it's the subsequent periodic activation
//! that's propagating this change). This is what we're trying to avoid.

Thinking out loud: I guess the way that's more consistent with the intent is that the task would periodically scan the database for metadata about repos it owns, plus the local filesystem for what it's got on-disk, and determine from first principles what to do next. I think what we've said instead here is: we only keep the important information in-memory, we pass it via this channel, and we essentially ignore anything that's on-disk on Nexus startup. Is that right? I think that's probably fine (with the understood limitation that a crash before a repo is replicated kind of sucks for the user) but just wanted to be explicit about it.

(This is certainly not the only case where a background task gets messages from the rest of Nexus. I know the saga recovery one does, too. In that case though it's not an item of work for it to do, but rather ancillary state that it needs in order to work correctly.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's right. Given the current update-common API we can't unpack the repository into a path well-known to Nexus; everything is a temporary directory. Moving the ArtifactsWithPlan through a channel into the task is conceptually similar to putting it in a well-known disk directory with the exception that if Nexus kerplodes all that information is gone.

Eventually I would like repositories to have an operator-visible status (replicating / replicated / faulted). A repo could go from replicating to faulted if Nexus crashes and loses its information about the local artifacts (this would require the relaunched Nexus to be aware it was responsible for that repo's artifacts, or that another Nexus could identify the original lost its mind); it could also go from replicated to faulted if suddenly an artifact is missing everywhere, somehow.

We certainly could migrate all of the files out of ArtifactsWithPlan into some sort of serialized on-disk format, or otherwise redesign update-common with this workflow in mind (I think it was primarily developed with the Wicket use case, which has fewer implicit guarantees about persistence).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good; I wouldn't change anything here without a more concrete reason to think it's a problem.

Comment on lines +78 to +80
Error::internal_error(&format!(
"failed to send artifacts for replication: {err}"
))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the impact of this? Will we never wind up replicating the artifacts or will it just take until the next periodic activation? It'd be worth adding a comment here.

Comment on lines +10 to +12
[features]
# Set by omicron-package based on the target configuration.
rack-topology-single-sled = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we document who uses this and what it's intended for? My guess was that it controls a bit of policy about whether we expect/require that we have multiple sleds for availability, but then I'd have expected that to be a runtime thing and also we already have stuff in this bucket (like Crucible region allocation?) so does that use some other mechanism?

(edit: more in another comment where we use it)

// The number of sleds that artifacts must be present on before the local copy
// of artifacts is dropped. This is ignored if there are fewer than this many
// sleds in the system.
#[cfg(not(feature = "rack-topology-single-sled"))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the only place this feature is used (which it looks like it is?): could you use the existing Nexus config file for this instead? There are already separate configs for omicron-dev/tests (nexus/tests/config.toml), general development (examples/{config,config-second}.toml), and production (config-partial.toml), IIRC. This won't work if we need it to be different between non-simulated deployments on a single node vs. otherwise. The only single-node non-simulated deployment I can think of is the helios-deploy job. Racklettes, rack2, rack3, and a4x2 all have at least three sleds. Even if we do need this to be different, we could have it be provided by RSS and plumbed through, similar to what we do for "is TLS enabled or not". This might be more work than is worth doing?

I don't mean to make a mountain out of a molehill. This isn't a big deal either way. But features increase the matrix of stuff we should test compilation for and also lead to unnecessary rebuilds when people switch so that's why I'm poking at it a bit.

Comment on lines +149 to +153
fn into_requests<'a>(
self,
sleds: &'a [Sled],
rng: &mut impl rand::Rng,
) -> Requests<'a> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this function.

}
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you could put a lot of the stuff above this into a separate crate inside Nexus and be able to iterate on it without having to build/link Nexus and its test suite. (Obviously not a blocker, just a thought.)

}
}

let sleds = match self.list_sleds(opctx).await {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let sleds = match self.list_sleds(opctx).await {
let sleds = match self.list_sleds(opctx).context("listing sleds").await {

Comment on lines +580 to +581
// Two artifacts can have the same sha256 but all the values in
// this insertion step are the default value.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Two artifacts can have the same sha256 but all the values in
// this insertion step are the default value.
// Two artifacts can have the same sha256. In that case, this will clobber what's there, but that's okay because all the values inserted up to this point are the default value.

(or just replace the code with inventory.0.entry(artifact.sha256.0).or_insert_with(...), which I think communicates that the author knows the thing might already be there)

Comment on lines +533 to +537
// `Arc::make_mut` will either directly provide a mutable reference
// if there are no other references, or clone it if there are. At this
// point there should never be any other references; we only clone this
// Arc when serializing the ringbuf to a `serde_json::Value`.
let ringbuf = Arc::make_mut(&mut self.request_debug_ringbuf);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use get_mut instead? (I'm just thinking that if this is wrong for whatever reason the behavior of having multiple different sets of counters floating around might make it hard to debug.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not have to panic here if the assumption is wrong. If it does clone, the other copies of ringbuf are temporary and used during serialization.

The goal here is to prevent needing to clone the ringbuf every time we go to serialize it to JSON. There are other options here:

  • Use get_mut and clone if it fails (this is basically what Arc::make_mut does)
  • Use get_mut and panic (I'm not sure what else to do here if it returns None)
  • Create an enum similar to std::borrow::Cow that can accept a slice for serialization, but deserializes to an owned Vec
  • Give up and clone it anyway

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it does clone, the other copies of ringbuf are temporary and used during serialization.

If that's true, agreed, that's not really a problem. It wasn't obvious to me from the make_mut docs which copy would be left in self but I guess it has to be the new one (which is what we'd want here)?

Another option would be to skip the update and log a message if this happens. That might be worse, but again, we're talking about a case we believe can never happen.

I don't have a strong feeling about any of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants