-
Notifications
You must be signed in to change notification settings - Fork 59
sled agent should archive log files on startup and zone shutdown #9226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…osed as far out as I thought
…e how am I going to call this from ensure_zpool_has_datasets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the summary you've given a few times - each of these changes individually looks fine and pretty low risk; taken as a whole it's a lot harder to be as confident.
I'm happy to approve based on the manual testing you've done so far, but am also fine if we want to hold off on merging it for more testing or to try it in anger to help with the Crucible stuff or whatever.
} = request; | ||
self | ||
.dump_setup | ||
.archive_former_zone_root(&path, completion_tx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine but want to reason about it - if this method blocks for longer than we expect, we'll get stuck here and won't receive any changed()
requests in the other select!
branches. But if that happens, presumably something is wrong with the DumpSetup
task, and the only thing our other branches do is instruct it to do work anyway, so presumably they would also get stuck in the same way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that's correct.
} | ||
|
||
// Make a best effort to archive the zone's log files. | ||
if let Some(zone_dataset_root) = running_zone.root().parent() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm vaguely surprised we need .parent()
here - is that because root()
returns the path we need plus /root
on the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, and the dump setup stuff appends that root
on its own. (This also surprised me. It burned one of my early testing laps.)
let newly_adopted_disk = match self.disks.entry(disk_id) { | ||
Entry::Vacant(vacant) => { | ||
let new_state = vacant.insert(disk_state); | ||
match &new_state.state { | ||
DiskState::Managed(disk) => { | ||
Some((disk_id, *disk.zpool_name())) | ||
} | ||
DiskState::FailedToManage(..) => None, | ||
} | ||
} | ||
Entry::Occupied(mut occupied) => { | ||
let old_state = &occupied.insert(disk_state).state; | ||
let new_state = &occupied.get().state; | ||
match (old_state, new_state) { | ||
(DiskState::Managed(_), _) => None, | ||
(_, DiskState::FailedToManage(_)) => None, | ||
( | ||
DiskState::FailedToManage(_), | ||
DiskState::Managed(disk), | ||
) => Some((disk_id, *disk.zpool_name())), | ||
} | ||
} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be simpler if we peeked at disk_state
before inserting it. Totally untested, but something like
let is_now_managed = match &disk_state.state {
DiskState::Managed(disk) => Some((disk_id, *disk.zpool_name()),
DiskState::FailedToManage(..) => None,
};
let newly_adopted_disk = match self.disks.insert(disk_state) {
Some(old_state) => match old_state.state {
// If the disk was already managed, it isn't newly adopted.
DiskState::Managed(_) => None,
// The disk is newly adopted if the new state is Managed.
DiskState::FailedToManage(..) => is_now_managed,
}
// We had no old state at all - the disk is newly adopted if the new state is Managed.
None => is_now_managed,
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, but assuming you're okay with it, I'd like to leave this as-is for now. (Candidly, I don't think it's simpler enough to warrant another testing lap at the moment.)
BackgroundLog rotation and archivalFirst, some background. Services in control plane zones log to files in /var/svc/log. Losing logs during zone shutdownThere are several reasons that we lose logs during zone shutdown.
The fix in this PR is to explictly archive files in the zone immediately prior to shutting it down. Losing logs after sled resetIf a sled resets (as a result of any kind of reboot or a power cycle, as happens during a self-service update): we'd expect that when it comes back up, the log files are still in the ZFS datasets that store the zones' root filesystems. So why do they get lost? When adopting a disk, sled agent always wipes the parent dataset of the datasets that store all of the zones' root filesystems. This is because we intend these filesystems to be transient and don't want software to rely on them being persistent. The fix in this PR is to explicitly archive files from these datasets before destroying them. This required some surgery, though, because currently this happens inside the Losing logs after sled agent restartIf sled agent restarts, we also go through the above code path that wipes the datasets. But even before that, early during startup, sled agent halts and uninstalls all of zones that look like they were created by the control plane. This wipes their root filesystems. This happens long before almost anything else in sled agent is set up, including the DumpSetup subsystem, config reconciler, disk adoption, etc., so there's no opportunity to trigger archival at this point. This PR does not attempt to fix this case. We'll still lose logs after a sled agent restart (including crash). Changes in this PRLog archival ("dump setup" subsystem)
Disk adoptionA bunch of functionality moved from ZonesThe zone shutdown path uses the new interface to trigger zone archival and waits for it to complete or fail. RisksGiven the late nature of this PR in the release cycle and the complexity of the subsystems involved, I want to be explicit about the risks here. Risks I think are pretty mitigated:
Other risks:
Test planHere's the basic idea:
I also tested some of this with a Propolis zone. I'll post more details shortly. |
Testing detailsI set up
Because I'd run into some strangeness in a4x2, I wanted to test on a system with a boundary NTP zone and also a scrimlet. I picked sled 14:
My "workload" was this bash loop, run from my laptop:
I used
I got one of these with request id "4f0d2150-b6a0-4924-871d-7f44d0673d8e" and confirmed that I saw it in the live Nexus log:
Here's what
We already have one archived one because I'd let this system sit a bit before I got to testing. I checked the sled agent log for messages from
I power-cycled sled 14 from sled 16's switch zone like this:
Once everything came back up,
but the other is there. It just has a different pattern because the live log file had a different name format to start with:
That second one has a bunch of my test requests in it:
and the last one is:
This is indeed the very last one reported by my curl script before it failed:
This is a successful test! We have the log file from the Nexus zone right up until the sled power-cycled. I also tested sled agent restart in about the same way: starting a workload, then doing the restart, and looking for archived log files. I knew from previous tests that it wouldn't work. I just wanted to make sure nothing terrible happened. There wound up being another periodic archival log file:
But those timestamps are from before I restarted sled-agent:
and not even the first request id from my new curl run was in any of the archived files. Finally, I tried zone expungement. I copied
Zone list before:
After:
Interestingly, the curl workload kept going without issue. There was a delay while the zone was down, but aside from that, there was no noticeable interruption. We have these archive files:
This time, we didn't have the very last response, but we had the one immediately previous to it:
It's not surprising that we could lose a few seconds' worth of logs since we archive while the zone is still running. Here are all the related log entries for the Nexus zone across all of this testing:
At this point, I provisioned an instance that also landed on sled 14. I went through similar testing with it. Initially:
I started
The last log entries I saw in
After things came up:
and the last log entry is the second-to-last one from my
so we lost a few seconds (again, and as expected) but it worked. |
Alternatives consideredFor completeness I want to summarize the other options we considered specifically for the disk adoption path (sled boot and sled agent restart), since that's the one that feels risky.
|
On the question of: does this actually fix #3860? For self-service update: yes, I believe this fix covers the ways we lose logs during a self-service update. For the MUPdate-based update procedure: the procedure for parking the rack shuts down sled agent and all the zones. It then runs I'm going to leave #3860 open to cover the MUPdate case until we fix it some other way or change the procedure above and verify that it works. |
This PR aims to preserve log files from inside control plane zones in two situations where they get lost today during a self-service update:
These affect situations other than self-service update, too, like when a sled panics.
There's a different situation that this does not help: sled-agent restarts (including crashes). More on that below.
Related to #3860, but I'm not going to declare that fixed by this PR (see #9226 (comment)).