Skip to content

Conversation

davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented Oct 15, 2025

This PR aims to preserve log files from inside control plane zones in two situations where they get lost today during a self-service update:

  • when zones are shut down gracefully (like when they are expunged or restarted)
  • when the system reboots, especially ungracefully (like during SP and host OS upgrades)

These affect situations other than self-service update, too, like when a sled panics.

There's a different situation that this does not help: sled-agent restarts (including crashes). More on that below.

Related to #3860, but I'm not going to declare that fixed by this PR (see #9226 (comment)).

@davepacheco davepacheco changed the title Dap/archive on startup sled agent should archive log files on startup and zone shutdown Oct 15, 2025
Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the summary you've given a few times - each of these changes individually looks fine and pretty low risk; taken as a whole it's a lot harder to be as confident.

I'm happy to approve based on the manual testing you've done so far, but am also fine if we want to hold off on merging it for more testing or to try it in anger to help with the Crucible stuff or whatever.

} = request;
self
.dump_setup
.archive_former_zone_root(&path, completion_tx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine but want to reason about it - if this method blocks for longer than we expect, we'll get stuck here and won't receive any changed() requests in the other select! branches. But if that happens, presumably something is wrong with the DumpSetup task, and the only thing our other branches do is instruct it to do work anyway, so presumably they would also get stuck in the same way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that's correct.

}

// Make a best effort to archive the zone's log files.
if let Some(zone_dataset_root) = running_zone.root().parent() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm vaguely surprised we need .parent() here - is that because root() returns the path we need plus /root on the end?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, and the dump setup stuff appends that root on its own. (This also surprised me. It burned one of my early testing laps.)

Comment on lines +473 to +495
let newly_adopted_disk = match self.disks.entry(disk_id) {
Entry::Vacant(vacant) => {
let new_state = vacant.insert(disk_state);
match &new_state.state {
DiskState::Managed(disk) => {
Some((disk_id, *disk.zpool_name()))
}
DiskState::FailedToManage(..) => None,
}
}
Entry::Occupied(mut occupied) => {
let old_state = &occupied.insert(disk_state).state;
let new_state = &occupied.get().state;
match (old_state, new_state) {
(DiskState::Managed(_), _) => None,
(_, DiskState::FailedToManage(_)) => None,
(
DiskState::FailedToManage(_),
DiskState::Managed(disk),
) => Some((disk_id, *disk.zpool_name())),
}
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be simpler if we peeked at disk_state before inserting it. Totally untested, but something like

let is_now_managed = match &disk_state.state {
    DiskState::Managed(disk) => Some((disk_id, *disk.zpool_name()),
    DiskState::FailedToManage(..) => None,
};

let newly_adopted_disk = match self.disks.insert(disk_state) {
    Some(old_state) => match old_state.state {
        // If the disk was already managed, it isn't newly adopted.
        DiskState::Managed(_) => None,
        // The disk is newly adopted if the new state is Managed.
        DiskState::FailedToManage(..) => is_now_managed,
    }
    // We had no old state at all - the disk is newly adopted if the new state is Managed.
    None => is_now_managed,
};

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, but assuming you're okay with it, I'd like to leave this as-is for now. (Candidly, I don't think it's simpler enough to warrant another testing lap at the moment.)

@davepacheco
Copy link
Collaborator Author

davepacheco commented Oct 17, 2025

Background

Log rotation and archival

First, some background. Services in control plane zones log to files in /var/svc/log. logadm rotates these files regularly to new files in the same directory. A subsystem in sled agent called DumpSetup periodically archives the rotated log files. The archival process copies them to a debug dataset on some U.2 device and then deletes the file that's inside the zone. Generally, any log entries more than an hour or two old will wind up in the debug dataset and we'll always have those (until we start running out of space for those, at which point I believe they get pruned).

Losing logs during zone shutdown

There are several reasons that we lose logs during zone shutdown.

  1. The archival process only archives log files from running zones. So any logs rotated by logadm between the last periodic archival and when the zone was shutdown will never get archived.
  2. logadm only runs periodically inside the zone. So any log entries written to the live file after the last time logadm ran won't have been rotated and so won't be archived even if we addressed the previous item.
  3. Even if we addressed the above two items, to shut down the zone, we both halt and uninstall it. I have not confirmed this but I believe the uninstall deletes the entire root filesystem, including any logs that were there.

The fix in this PR is to explictly archive files in the zone immediately prior to shutting it down.

Losing logs after sled reset

If a sled resets (as a result of any kind of reboot or a power cycle, as happens during a self-service update): we'd expect that when it comes back up, the log files are still in the ZFS datasets that store the zones' root filesystems. So why do they get lost? When adopting a disk, sled agent always wipes the parent dataset of the datasets that store all of the zones' root filesystems. This is because we intend these filesystems to be transient and don't want software to rely on them being persistent.

The fix in this PR is to explicitly archive files from these datasets before destroying them. This required some surgery, though, because currently this happens inside the sled-storage package and now it's happening in one of its consumers.

Losing logs after sled agent restart

If sled agent restarts, we also go through the above code path that wipes the datasets. But even before that, early during startup, sled agent halts and uninstalls all of zones that look like they were created by the control plane. This wipes their root filesystems. This happens long before almost anything else in sled agent is set up, including the DumpSetup subsystem, config reconciler, disk adoption, etc., so there's no opportunity to trigger archival at this point.

This PR does not attempt to fix this case. We'll still lose logs after a sled agent restart (including crash).

Changes in this PR

Log archival ("dump setup" subsystem)

  • Adds an interface for external callers to request archival of a "former zone root". This archival includes the live files.
  • For testing on a4x2, I changed a check that prevented this subsystem from doing anything at all on a4x2 to just skip setting up a dump device.

Disk adoption

A bunch of functionality moved from sled-storage/src/disk.rs to sled-agent/config-reconciler/src/reconciler_task/external_disks.rs. Previously, the the config reconciler's disk_adopt path called sled_storage::disk::Disk::new(), which would wipe the zones dataset in the pool. I ripped this behavior out of sled_storage altogether. Instead, after adopting all the disks in a pass of the reconciler, we call a new function to archive and destroy the individual datasets found underneath the zones dataset.

Zones

The zone shutdown path uses the new interface to trigger zone archival and waits for it to complete or fail.

Risks

Given the late nature of this PR in the release cycle and the complexity of the subsystems involved, I want to be explicit about the risks here.

Risks I think are pretty mitigated:

  • That other consumers of sled-storage were relying on the wiping behavior. There are other consumers, but they either don't use it with U.2s (e.g., installinator) or they're only tests or they're in code that's never used at all (that John removed in Remove StorageManager and related dead code #9229).
  • We're now destroying only the child datasets of zones, not the dataset itself. This would be a problem if there were properties on it that we wanted to reset or if there was data in it, but that shouldn't be the case.
  • Logic bugs in re-implementing the logic in the new place (e.g., there's new logic that determines whether to do this based on whether we think we've just adopted the disk).
  • Is there some way we could wind up archiving the same zone concurrently? No -- that's why I built all this into the existing DumpSetup, which is a single task.
  • Could failure to archive break anything? No -- the failure is not propagated, so a failure here cannot break anything that was previously working.

Other risks:

  • There's a new error path during disk adoption: it's possible that we adopt the disk, tell other subsystems about it, then do this cleanup, but fail to destroy a dataset. If that happens, we un-adopt the disk. I assume this is okay because the rest of the system has to deal with disks going away, and it should also be extremely unlikely (we shouldn't fail to destroy datasets on real systems), but it's conceivable.
  • Is there some way we could wind up archiving the live log file from a zone that's still running and will keep running? This could be bad because the service would continue to write to the deleted file that would never get rotated. It'd eventually run out of space and this failure mode is very hard to identify. I don't think this can happen because of the three cases:
    • During zone shutdown, we're shutting down the zone.
    • On sled boot, no zones are running.
    • On sled agent restart, sled agent has already shut down all the zones.
  • We're now blocking a bunch of config-reconciler stuff behind being able to archive this stuff. Note that we don't have to successfully archive stuff. If it fails, we'll proceed anyway. But if we get stuck doing so, that'd be a problem. The worst (ordinary) case I can imagine is that if the DumpSetup task goes off to run savecore and that takes 10 minutes, that could block disk import and the rest of the config reconciler. This would only happen on boot after a panic, and the impact should only be that the sled takes that extra amount of time to come back up. It's not great, but doesn't seem like a blocker here.

Test plan

Here's the basic idea:

  • I deployed commit dbc6e68 to the berlin racklette.
  • I picked a Nexus zone on sled 14 for most of my testing (oxz_nexus_abb7c8e7-281d-48e9-a197-ca8484a924e7).
  • I hit it with a workload that makes one request per second and logs the request id as seen by the client
  • I performed various operations that I wanted to verify:
    • Power cycle the sled (this is close to what self-service update does)
    • Restarted sled agent
    • Expunged the zone (using reconfigurator-cli to generate a new blueprint and import it)
  • In each case, I checked for changes in the archived log files (in /pool/ext/*/crypt/debug/*nexus*/*nexus*).
    • In the power cycle and expungement cases, there should be a new log file and it should have relatively recent request ids from my workload.
  • I spot-checked the sled-agent logs for relevant log entries to make sure I didn't see anything fishy.

I also tested some of this with a Propolis zone.

I'll post more details shortly.

@davepacheco
Copy link
Collaborator Author

Testing details

I set up berlin with commit dbc6e68 using rkadm:

dap@castle ~ $ /opt/rackletteadm/bin/rkadm --toml /opt/rackletteadm/configs/berlin/berlin-rkadm.toml run --commit-a-install dbc6e68f287f64d1f6b7c0a7808e519056b6a796 --skip-checks --skip-upgrade --commit-b-upgrade dbc6e68f287f64d1f6b7c0a7808e519056b6a796

Because I'd run into some strangeness in a4x2, I wanted to test on a system with a boundary NTP zone and also a scrimlet. I picked sled 14:

root@oxz_switch1:~# omdb db sleds
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:103::3]:32221,[fd00:1122:3344:102::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (198.0.0)
note: listing all commissioned sleds (use -F to filter, e.g. -F in-service)          
 SERIAL       IP                             ROLE      POLICY      STATE   ID
 BRM42220082  [fd00:1122:3344:101::1]:12345  scrimlet  in service  active  48a4bb91-ae89-4da7-bed9-3bf92111eaab
 BRM06240029  [fd00:1122:3344:104::1]:12345  -         in service  active  8022e2e4-348e-462a-993f-d1ac614d00cf
 BRM42220011  [fd00:1122:3344:102::1]:12345  scrimlet  in service  active  c82cc479-28fb-4e75-b912-6db22a96f6ff
 BRM44220007  [fd00:1122:3344:103::1]:12345  -         in service  active  f878c5de-73cd-4394-9e71-7700af103344
root@oxz_switch1:~# pilot host ls
CUBBY IP                        SERIAL      IMAGE
14    fe80::aa40:25ff:fe04:219  BRM42220011 ci dbc6e68/4b23a73 2025-10-17 18:02
15    fe80::aa40:25ff:fe04:49a  BRM44220007 ci dbc6e68/4b23a73 2025-10-17 18:02
16    fe80::aa40:25ff:fe04:198  BRM42220082 ci dbc6e68/4b23a73 2025-10-17 18:02
17    fe80::aa40:25ff:fe04:b5c  BRM06240029 ci dbc6e68/4b23a73 2025-10-17 18:02
root@oxz_switch1:~# omdb nexus blueprints show target | grep boundary
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:103::4]:12232
    boundary_ntp      4a6c7ab6-4559-4b21-8caf-d6ab12e5d6b6   install dataset   in service    fd00:1122:3344:101::10
    boundary_ntp   f57a9538-e11a-4e6c-9bec-7f6d6e53c762   install dataset   in service    fd00:1122:3344:102::f

My "workload" was this bash loop, run from my laptop:

while curl -i --resolve recovery.sys.berlin.eng.oxide.computer:443:172.20.30.2 https://recovery.sys.berlin.eng.oxide.computer/daptest; do sleep 1; done

I used --resolve here to make sure I was hitting the specific Nexus instance on this system. This produces a series of output like this:

HTTP/2 404  
content-type: application/json
x-request-id: 45c8fa1c-1b5c-4d47-9f0d-2892961dacf9
date: Fri, 17 Oct 2025 01:19:19 GMT
content-length: 84

{
  "request_id": "45c8fa1c-1b5c-4d47-9f0d-2892961dacf9",
  "message": "Not Found"
}

I got one of these with request id "4f0d2150-b6a0-4924-871d-7f44d0673d8e" and confirmed that I saw it in the live Nexus log:

BRM42220011 # grep "4f0d2150-b6a0-4924-871d-7f44d0673d8e" $(svcs -z oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6 -L nexus) | looker
20:16:35.971Z INFO 316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6 (dropshot_external): request completed
    error_message_external = Not Found
    error_message_internal = no route found (no path in router)
    file = /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.16.4/src/server.rs:855
    latency_us = 12
    local_addr = 172.30.2.5:443 
    method = GET
    remote_addr = 172.20.17.110:51861
    req_id = 4f0d2150-b6a0-4924-871d-7f44d0673d8e
    response_code = 404
    uri = https://recovery.sys.berlin.eng.oxide.computer/daptest

Here's what oxlog shows for logs:

BRM42220011 # /opt/oxide/oxlog/oxlog logs --current --archived --extra oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6 nexus
/pool/ext/85d31101-b257-42cc-bc55-87e63b5170ea/crypt/zone/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6/root/var/svc/log/oxide-nexus:default.log
/pool/ext/0015c91c-1949-46f8-8146-e9186dd01820/crypt/debug/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6/oxide-nexus:default.log.1760730300

We already have one archived one because I'd let this system sit a bit before I got to testing. I checked the sled agent log for messages from DumpSetup and we have the expected ones: setting things up, then one periodic archival of all the zones:

BRM42220011 # grep DumpSetup $(svcs -L sled-agent) | looker
00:00:48.138Z INFO SledAgent (DumpSetup-worker): DumpSetup poll loop started.
    file = sled-agent/config-reconciler/src/dump_setup.rs:630
00:00:50.926Z INFO SledAgent (DumpSetup-worker): Updated view of disks
    core_datasets = 2
    debug_datasets = 0
    dump_slices = 2
    file = sled-agent/config-reconciler/src/dump_setup.rs:720
00:00:50.968Z INFO SledAgent (DumpSetup-worker): Set process core dump directory to CoreDataset("/pool/int/ae50d73a-b206-4785-a55d-3f4b17acce95/crash")
    file = sled-agent/config-reconciler/src/dump_setup.rs:862
00:00:50.980Z INFO SledAgent (DumpSetup-worker): No archival destination for crash dumps yet chosen.
    file = sled-agent/config-reconciler/src/dump_setup.rs:1056
00:05:35.610Z INFO SledAgent (DumpSetup-worker): Updated view of disks
    core_datasets = 2
    debug_datasets = 9
    dump_slices = 2
    file = sled-agent/config-reconciler/src/dump_setup.rs:720
00:05:50.192Z INFO SledAgent (DumpSetup-worker): Set DumpSlicePath("/devices/pci@38,0/pci1de,fff9@3,3/pci1344,3100@0/blkdev@w00A075013280BD0C,0:e") as system dump slice
    file = sled-agent/config-reconciler/src/dump_setup.rs:986
19:45:01.045Z INFO SledAgent (DumpSetup-worker): Archiving 116 log files from oxz_switch zone
    file = sled-agent/config-reconciler/src/dump_setup.rs:1183
19:45:02.889Z INFO SledAgent (DumpSetup-worker): Archiving 106 log files from oxz_internal_dns_2acded62-b6c3-4694-972c-a7e7a333c02a zone
    file = sled-agent/config-reconciler/src/dump_setup.rs:1183
19:45:02.966Z INFO SledAgent (DumpSetup-worker): Archiving 109 log files from oxz_ntp_f57a9538-e11a-4e6c-9bec-7f6d6e53c762 zone
    file = sled-agent/config-reconciler/src/dump_setup.rs:1183
19:45:03.023Z INFO SledAgent (DumpSetup-worker): Archiving 107 log files from oxz_cockroachdb_d975901a-be4f-433e-bfe3-bb8a8f09eab7 zone
    file = sled-agent/config-reconciler/src/dump_setup.rs:1183
19:45:03.073Z INFO SledAgent (DumpSetup-worker): Archiving 107 log files from oxz_crucible_084484b5-1ec8-4502-8ff7-913216f41c1e zone
    file = sled-agent/config-reconciler/src/dump_setup.rs:1183
19:45:03.125Z INFO SledAgent (DumpSetup-worker): Archiving 107 log files from oxz_crucible_0157efc4-148b-45c5-a7fa-be098d2d53a5 zone
    file = sled-agent/config-reconciler/src/dump_setup.rs:1183
19:45:03.181Z INFO SledAgent (DumpSetup-worker): Archiving 107 log files from oxz_crucible_8897ec64-9085-4e72-9a33-5ae154b38feb zone
    file = sled-agent/config-reconciler/src/dump_setup.rs:1183
19:45:03.233Z INFO SledAgent (DumpSetup-worker): Archiving 107 log files from oxz_crucible_802f3cee-0afb-4ce7-b11e-fcb79d56f857 zone
    file = sled-agent/config-reconciler/src/dump_setup.rs:1183
19:45:03.285Z INFO SledAgent (DumpSetup-worker): Archiving 107 log files from oxz_crucible_df6c0b65-eb55-4b6b-89b8-e49f6aa5b0e9 zone
    file = sled-agent/config-reconciler/src/dump_setup.rs:1183
19:45:03.341Z INFO SledAgent (DumpSetup-worker): Archiving 107 log files from oxz_crucible_8cb0a1ff-5b93-480e-89ae-1f69001fc8c2 zone
    file = sled-agent/config-reconciler/src/dump_setup.rs:1183
19:45:03.394Z INFO SledAgent (DumpSetup-worker): Archiving 107 log files from oxz_crucible_9d0d0761-aa1a-494c-9e4a-094cb6b1f6be zone
    file = sled-agent/config-reconciler/src/dump_setup.rs:1183
19:45:03.448Z INFO SledAgent (DumpSetup-worker): Archiving 107 log files from oxz_crucible_2f8c07fc-c516-4f07-a87a-d491a4b572e0 zone
    file = sled-agent/config-reconciler/src/dump_setup.rs:1183
19:45:03.502Z INFO SledAgent (DumpSetup-worker): Archiving 107 log files from oxz_crucible_0d715d86-0a08-4dd9-a494-72edfc456814 zone
    file = sled-agent/config-reconciler/src/dump_setup.rs:1183
19:45:03.553Z INFO SledAgent (DumpSetup-worker): Archiving 107 log files from oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6 zone
    file = sled-agent/config-reconciler/src/dump_setup.rs:1183
19:45:03.976Z INFO SledAgent (DumpSetup-worker): Archiving 107 log files from oxz_clickhouse_d661166f-ed05-426b-9059-a07f76203e12 zone
    file = sled-agent/config-reconciler/src/dump_setup.rs:1183
20:15:07.284Z INFO SledAgent (DumpSetup-worker): Archiving 1 log files from oxz_switch zone
    file = sled-agent/config-reconciler/src/dump_setup.rs:1183

I power-cycled sled 14 from sled 16's switch zone like this:

root@oxz_switch1:~# pilot sp cycle 14
14  BRM42220011        ok
root@oxz_switch1:~# date
October 17, 2025 at 08:18:57 PM UTC
root@oxz_switch1:~#

Once everything came back up, oxlog still found just the one log file:

BRM42220011 # /opt/oxide/oxlog/oxlog logs --current --archived --extra oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6 nexus
/pool/ext/0015c91c-1949-46f8-8146-e9186dd01820/crypt/debug/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6/oxide-nexus:default.log.1760730300

but the other is there. It just has a different pattern because the live log file had a different name format to start with:

BRM42220011 # ls /pool/ext/*/crypt/debug/*/*nexus*
/pool/ext/0015c91c-1949-46f8-8146-e9186dd01820/crypt/debug/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6/oxide-nexus:default.log.1760730300
/pool/ext/5419daad-8017-49c2-ba71-52df5793d51a/crypt/debug/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6/oxide-nexus:default.1760732327

That second one has a bunch of my test requests in it:

BRM42220011 # grep -c daptest /pool/ext/*/crypt/debug/*/*nexus*
/pool/ext/0015c91c-1949-46f8-8146-e9186dd01820/crypt/debug/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6/oxide-nexus:default.log.1760730300:0
/pool/ext/5419daad-8017-49c2-ba71-52df5793d51a/crypt/debug/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6/oxide-nexus:default.1760732327:115

and the last one is:

BRM42220011 # grep -h daptest /pool/ext/*/crypt/debug/*/*nexus* | tail -1 | looker
20:18:47.721Z INFO 316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6 (dropshot_external): request completed
    error_message_external = Not Found
    error_message_internal = no route found (no path in router)
    file = /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.16.4/src/server.rs:855
    latency_us = 11
    local_addr = 172.30.2.5:443
    method = GET
    remote_addr = 172.20.17.110:51990
    req_id = 40cd92df-d7a1-4f20-93f9-daefb07faddd
    response_code = 404
    uri = https://recovery.sys.berlin.eng.oxide.computer/daptest

This is indeed the very last one reported by my curl script before it failed:

HTTP/2 404
content-type: application/json
x-request-id: 40cd92df-d7a1-4f20-93f9-daefb07faddd
date: Fri, 17 Oct 2025 20:18:47 GMT
content-length: 84

{
  "request_id": "40cd92df-d7a1-4f20-93f9-daefb07faddd",
  "message": "Not Found"
}curl: (35) Recv failure: Operation timed out

This is a successful test! We have the log file from the Nexus zone right up until the sled power-cycled.


I also tested sled agent restart in about the same way: starting a workload, then doing the restart, and looking for archived log files. I knew from previous tests that it wouldn't work. I just wanted to make sure nothing terrible happened. There wound up being another periodic archival log file:

BRM42220011 # ls /pool/ext/*/crypt/debug/*/*nexus*
/pool/ext/0015c91c-1949-46f8-8146-e9186dd01820/crypt/debug/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6/oxide-nexus:default.log.1760730300
/pool/ext/5419daad-8017-49c2-ba71-52df5793d51a/crypt/debug/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6/oxide-nexus:default.1760732327
/pool/ext/5419daad-8017-49c2-ba71-52df5793d51a/crypt/debug/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6/oxide-nexus:default.log.1760732996

But those timestamps are from before I restarted sled-agent:

BRM42220011 # svcadm restart sled-agent
BRM42220011 # svcs sled-agent
STATE          STIME    FMRI
online         20:36:38 svc:/oxide/sled-agent:default

and not even the first request id from my new curl run was in any of the archived files.


Finally, I tried zone expungement. I copied reconfigurator-cli to the sled 16 switch zone. I started the workload. Then I expunged the zone like this:

root@oxz_switch1:~# omdb reconfigurator export r1.json
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:101::3]:32221,[fd00:1122:3344:104::3]:32221,[fd00:1122:3344:104::4]:32221,[fd00:1122:3344:103::3]:32221,[fd00:1122:3344:102::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (198.0.0)
assembling reconfigurator state ... done
saving to r1.json ... done

root@oxz_switch1:~# ./reconfigurator-cli
generated RNG seed: scarcely-pliable-merlin
〉load r1.json 6a0ae138-2c4a-410c-8e20-ddaa91d3d8f7
loaded data from "r1.json"
result:
  system:
    using collection 6a0ae138-2c4a-410c-8e20-ddaa91d3d8f7 as source of sled inventory data
    loaded sleds: 48a4bb91-ae89-4da7-bed9-3bf92111eaab, 8022e2e4-348e-462a-993f-d1ac614d00cf, c82cc479-28fb-4e75-b912-6db22a96f6ff, f878c5de-73cd-4394-9e71-7700af103344
    loaded collections: 3b93a63d-8186-40a5-a5dd-e86a3e6271a6, 02069fd3-7473-4392-9bd7-54a081d2c832, cac010ae-a04a-4d8e-8b76-9af4aa2bf20a, 15496eb4-e7b0-45e1-bee2-a68cd7b13b73, 6a0ae138-2c4a-410c-8e20-ddaa91d3d8f7
    loaded blueprints: cf4d616a-45a6-4734-a1cd-a92cd1c73227
    loaded service IP pool ranges: [V4(Ipv4Range { first: 172.20.30.1, last: 172.20.30.10 })]
    loaded internal DNS generations: (none)
    loaded external DNS generations: (none)
  config:
    configured external DNS zone name: berlin.eng.oxide.computer
    configured silo names: default-silo, recovery
    active Nexus generation: 1

〉blueprint-list
T ENA ID                                   PARENT TIME_CREATED
* yes cf4d616a-45a6-4734-a1cd-a92cd1c73227 <none> 2025-10-17T19:37:19.773Z
〉blueprint-edit cf4d616a-45a6-4734-a1cd-a92cd1c73227 expunge-zones 316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6
blueprint 8d34f762-c30c-4915-a305-203c59321190 created from blueprint cf4d616a-45a6-4734-a1cd-a92cd1c73227: expunged zone 316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6 from sled c82cc479-28fb-4e75-b912-6db22a96f6ff

〉blueprint-save 8d34f762-c30c-4915-a305-203c59321190 expunge.json
saved blueprint 8d34f762-c30c-4915-a305-203c59321190 to "expunge.json"
〉
root@oxz_switch1:~# omdb -w nexus blueprints import expunge.json
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:103::4]:12232
uploaded new blueprint 8d34f762-c30c-4915-a305-203c59321190
root@oxz_switch1:~# omdb -w nexus blueprints target set 8d34f762-c30c-4915-a305-203c59321190 enabled
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:103::4]:12232
set target blueprint to 8d34f762-c30c-4915-a305-203c59321190

Zone list before:

BRM42220011 # zoneadm list
global
oxz_switch
oxz_internal_dns_2acded62-b6c3-4694-972c-a7e7a333c02a
oxz_ntp_f57a9538-e11a-4e6c-9bec-7f6d6e53c762
oxz_crucible_8897ec64-9085-4e72-9a33-5ae154b38feb
oxz_crucible_802f3cee-0afb-4ce7-b11e-fcb79d56f857
oxz_crucible_9d0d0761-aa1a-494c-9e4a-094cb6b1f6be
oxz_crucible_8cb0a1ff-5b93-480e-89ae-1f69001fc8c2
oxz_crucible_2f8c07fc-c516-4f07-a87a-d491a4b572e0
oxz_crucible_0d715d86-0a08-4dd9-a494-72edfc456814
oxz_crucible_084484b5-1ec8-4502-8ff7-913216f41c1e
oxz_crucible_df6c0b65-eb55-4b6b-89b8-e49f6aa5b0e9
oxz_crucible_0157efc4-148b-45c5-a7fa-be098d2d53a5
oxz_cockroachdb_d975901a-be4f-433e-bfe3-bb8a8f09eab7
oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6
oxz_clickhouse_d661166f-ed05-426b-9059-a07f76203e12

After:

BRM42220011 # zoneadm list
global
oxz_switch
oxz_internal_dns_2acded62-b6c3-4694-972c-a7e7a333c02a
oxz_ntp_f57a9538-e11a-4e6c-9bec-7f6d6e53c762
oxz_crucible_8897ec64-9085-4e72-9a33-5ae154b38feb
oxz_crucible_802f3cee-0afb-4ce7-b11e-fcb79d56f857
oxz_crucible_9d0d0761-aa1a-494c-9e4a-094cb6b1f6be
oxz_crucible_8cb0a1ff-5b93-480e-89ae-1f69001fc8c2
oxz_crucible_2f8c07fc-c516-4f07-a87a-d491a4b572e0
oxz_crucible_0d715d86-0a08-4dd9-a494-72edfc456814
oxz_crucible_084484b5-1ec8-4502-8ff7-913216f41c1e
oxz_crucible_df6c0b65-eb55-4b6b-89b8-e49f6aa5b0e9
oxz_crucible_0157efc4-148b-45c5-a7fa-be098d2d53a5
oxz_cockroachdb_d975901a-be4f-433e-bfe3-bb8a8f09eab7
oxz_clickhouse_d661166f-ed05-426b-9059-a07f76203e12
oxz_nexus_6882894c-5126-4989-95ca-2a23c224ff41

Interestingly, the curl workload kept going without issue. There was a delay while the zone was down, but aside from that, there was no noticeable interruption.

We have these archive files:

BRM42220011 # ls /pool/ext/*/crypt/debug/*/*nexus*
/pool/ext/0015c91c-1949-46f8-8146-e9186dd01820/crypt/debug/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6/oxide-nexus:default.log.1760730300
/pool/ext/5419daad-8017-49c2-ba71-52df5793d51a/crypt/debug/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6/oxide-nexus:default.1760732327
/pool/ext/5419daad-8017-49c2-ba71-52df5793d51a/crypt/debug/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6/oxide-nexus:default.log.1760732996
/pool/ext/5b9cd107-fc17-46ad-9087-52acb1cf8799/crypt/debug/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6/oxide-nexus:default.1760733750

This time, we didn't have the very last response, but we had the one immediately previous to it:

BRM42220011 # grep "072cd886-b9f4-49ed-8c26-07d296b16b54" /pool/ext/*/crypt/debug/*/*nexus*
BRM42220011 # grep -h "e84e3fbc-e1ad-48d8-96e7-6b30ba109696" /pool/ext/*/crypt/debug/*/*nexus*| looker
20:42:30.381Z INFO 316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6 (dropshot_external): request completed
    error_message_external = Not Found
    error_message_internal = no route found (no path in router)
    file = /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.16.4/src/server.rs:855
    latency_us = 16
    local_addr = 172.30.2.5:443
    method = GET
    remote_addr = 172.20.17.110:52291
    req_id = e84e3fbc-e1ad-48d8-96e7-6b30ba109696
    response_code = 404
    uri = https://recovery.sys.berlin.eng.oxide.computer/daptest

It's not surprising that we could lose a few seconds' worth of logs since we archive while the zone is still running.

Here are all the related log entries for the Nexus zone across all of this testing:

BRM42220011 # grep oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6 $(svcs -L sled-agent) | grep -i former | looker
00:03:25.487Z INFO SledAgent (ConfigReconcilerTask): archiving logs from former zone root
    file = sled-agent/config-reconciler/src/reconciler_task/external_disks.rs:920
    path = /pool/ext/85d31101-b257-42cc-bc55-87e63b5170ea/crypt/zone/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6
00:03:25.487Z INFO SledAgent (DumpSetup): requesting archive of former zone root
    file = sled-agent/config-reconciler/src/dump_setup.rs:381
    zone_root = /pool/ext/85d31101-b257-42cc-bc55-87e63b5170ea/crypt/zone/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6
00:03:27.221Z INFO SledAgent (DumpSetup-worker): Archived logs from former zone root
    file = sled-agent/config-reconciler/src/dump_setup.rs:670
    zone_root = /pool/ext/85d31101-b257-42cc-bc55-87e63b5170ea/crypt/zone/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6
00:03:27.221Z INFO SledAgent (ConfigReconcilerTask): destroying former zone root
    dataset_name = oxp_85d31101-b257-42cc-bc55-87e63b5170ea/crypt/zone/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6
    file = sled-agent/config-reconciler/src/reconciler_task/external_disks.rs:929
20:37:10.772Z INFO SledAgent (ConfigReconcilerTask): archiving logs from former zone root
    file = sled-agent/config-reconciler/src/reconciler_task/external_disks.rs:920
    path = /pool/ext/85d31101-b257-42cc-bc55-87e63b5170ea/crypt/zone/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6
20:37:10.772Z INFO SledAgent (DumpSetup): requesting archive of former zone root
    file = sled-agent/config-reconciler/src/dump_setup.rs:381
    zone_root = /pool/ext/85d31101-b257-42cc-bc55-87e63b5170ea/crypt/zone/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6
20:37:11.061Z INFO SledAgent (DumpSetup-worker): Archived logs from former zone root
    file = sled-agent/config-reconciler/src/dump_setup.rs:670
    zone_root = /pool/ext/85d31101-b257-42cc-bc55-87e63b5170ea/crypt/zone/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6
20:37:11.061Z INFO SledAgent (ConfigReconcilerTask): destroying former zone root
    dataset_name = oxp_85d31101-b257-42cc-bc55-87e63b5170ea/crypt/zone/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6
    file = sled-agent/config-reconciler/src/reconciler_task/external_disks.rs:929
20:42:31.213Z INFO SledAgent (DumpSetup): requesting archive of former zone root
    file = sled-agent/config-reconciler/src/dump_setup.rs:381
    zone_root = /pool/ext/85d31101-b257-42cc-bc55-87e63b5170ea/crypt/zone/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6
20:42:31.489Z INFO SledAgent (DumpSetup-worker): Archived logs from former zone root
    file = sled-agent/config-reconciler/src/dump_setup.rs:670
    zone_root = /pool/ext/85d31101-b257-42cc-bc55-87e63b5170ea/crypt/zone/oxz_nexus_316f3b34-3a2b-4bbc-aaa8-068a8dadf4a6

At this point, I provisioned an instance that also landed on sled 14. I went through similar testing with it.

Initially:

BRM42220011 # ls -l /pool/ext/5419daad-8017-49c2-ba71-52df5793d51a/crypt/zone/oxz_propolis-server_a0ceac88-bd2c-4587-8195-00e8eb1db0a2/root/var/svc/log//*propolis*
-rw-r--r--   1 root     root      102304 Oct 17 21:24 /pool/ext/5419daad-8017-49c2-ba71-52df5793d51a/crypt/zone/oxz_propolis-server_a0ceac88-bd2c-4587-8195-00e8eb1db0a2/root/var/svc/log//system-illumos-propolis-server:default.log
BRM42220011 # /opt/oxide/oxlog/oxlog logs oxz_propolis-server_a0ceac88-bd2c-4587-8195-00e8eb1db0a2 propolis-server --current --archived --extra
/pool/ext/5419daad-8017-49c2-ba71-52df5793d51a/crypt/zone/oxz_propolis-server_a0ceac88-bd2c-4587-8195-00e8eb1db0a2/root/var/svc/log/system-illumos-propolis-server:default.log

I started tail -f'ing the live log file. Then I power-cycled the box:

root@oxz_switch1:~# pilot sp cycle 14
14  BRM42220011        ok
root@oxz_switch1:~# date
October 17, 2025 at 09:28:01 PM UTC

The last log entries I saw in tail were:

21:27:45.674Z INFO propolis-server (vm_state_driver): request completed
    file = /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.16.2/src/server.rs:867
    latency_us = 2369
    local_addr = [fd00:1122:3344:102::1:0]:35109
    method = GET
    remote_addr = [fd00:1122:3344:101::5]:57948
    req_id = e067bdc6-ab84-4f75-aa69-a60befecd9a6
    response_code = 200
    uri = /0b5e64f5-b193-45f2-a4e3-019c6e6457da
21:27:55.674Z INFO propolis-server (vm_state_driver): request completed
    file = /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.16.2/src/server.rs:867
    latency_us = 2272
    local_addr = [fd00:1122:3344:102::1:0]:35109
    method = GET
    remote_addr = [fd00:1122:3344:101::5]:57948
    req_id = 1498ed4c-116a-4d31-a60c-b83d62d012e0
    response_code = 200
    uri = /0b5e64f5-b193-45f2-a4e3-019c6e6457da

After things came up:

BRM42220011 # ls -l /pool/ext/*/crypt/debug/*a0ceac88-bd2c-4587-8195-00e8eb1db0a2*/*propolis*
-rw-r--r--   1 root     root      140817 Dec 28  1986 /pool/ext/73692e34-104a-45ae-abd2-3d5a62734abe/crypt/debug/oxz_propolis-server_a0ceac88-bd2c-4587-8195-00e8eb1db0a2/system-illumos-propolis-server:default.1760736465

and the last log entry is the second-to-last one from my tail:

BRM42220011 # tail -n 1 /pool/ext/*/crypt/debug/*a0ceac88-bd2c-4587-8195-00e8eb1db0a2*/*propolis* | looker
21:27:45.674Z INFO propolis-server (vm_state_driver): request completed
    file = /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.16.2/src/server.rs:867
    latency_us = 2369
    local_addr = [fd00:1122:3344:102::1:0]:35109
    method = GET
    remote_addr = [fd00:1122:3344:101::5]:57948
    req_id = e067bdc6-ab84-4f75-aa69-a60befecd9a6
    response_code = 200
    uri = /0b5e64f5-b193-45f2-a4e3-019c6e6457da

so we lost a few seconds (again, and as expected) but it worked.

@davepacheco
Copy link
Collaborator Author

Alternatives considered

For completeness I want to summarize the other options we considered specifically for the disk adoption path (sled boot and sled agent restart), since that's the one that feels risky.

  1. Do nothing. This has been a problem for a long time. The increased urgency just comes from the introduction of self-service update. Using that feature means that we'll lose a bunch of logs. The main risk is that we hit some update-related issue, we may not have the logs to debug it. While we had previously determined this wasn't a blocker, having hit panic in crucible::client::DownstairsClient::process_io_completion during rack update crucible#1788 gave me serious pause about doing nothing.
  2. Instead of moving the "wipe" functionality like this PR did, we could have the caller supply a callback of some sort (either a &dyn Fn or a more specific trait) that does the archiving. This feels the least risky in that it's pretty straightforward to reason about, but it makes the code considerably more complicated and confusing going forward.
  3. Remove the "wipe" functionality on import altogether (and do not replace it). The risk is that (we believed) we'd now be starting zones with non-empty roots that previously could never start that way and that may break some of their assumptions. I now believe that on sled agent restart, this wouldn't be an issue (because we uninstall the zone and so wipe these datasets anyway), but it would be a problem on sled boot.
  4. Move the "wipe" functionality to where we really want it: a combination of zone start/shutdown. (Shutdown to cover the expungement case and startup to cover ungraceful restarts.) This is the right long-term answer but it's tricky and out of scope for this release.
  5. Try to do the archival before the import (and not have to change the wipe behavior). This doesn't work because the dataset is not mounted yet if the sled just booted.

@davepacheco
Copy link
Collaborator Author

On the question of: does this actually fix #3860?

For self-service update: yes, I believe this fix covers the ways we lose logs during a self-service update.

For the MUPdate-based update procedure: the procedure for parking the rack shuts down sled agent and all the zones. It then runs zoneadm uninstall for all the zones. I believe this will remove all the logs we might want. So none of this will help that case. However, I'm not sure there's a reason for us to uninstall and delete those zones, since they're not persistent anyway. If we skip these steps, then I believe the data won't be deleted and it should be archived successfully when the system comes back up. I have not tested this.

I'm going to leave #3860 open to cover the MUPdate case until we fix it some other way or change the procedure above and verify that it works.

@davepacheco davepacheco marked this pull request as ready for review October 17, 2025 22:52
@davepacheco davepacheco enabled auto-merge (squash) October 18, 2025 00:02
@davepacheco davepacheco merged commit 27a6508 into main Oct 18, 2025
17 checks passed
@davepacheco davepacheco deleted the dap/archive-on-startup branch October 18, 2025 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants