-
Notifications
You must be signed in to change notification settings - Fork 247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
disks: add support for nested RAID devices #1010
Conversation
Awesome! 🎉 We have no test coverage for RAID right now. If adding RAID support to the blackbox tests turns out to be too hard, let's at least fix up the old CL RAID tests in kola and add a test for this case. |
👍 to adding RAID (and other device type) tests back into kola for *COS. |
OK, added a test, but this now requires coreos/coreos-assembler#1573! Though I'm hitting some SELinux violations I need to investigate. Likely missing a relabel on the root of the RAID device itself. |
Add support for RAID devices which use other RAID devices. To do this, instead of trying to be clever and figure out the dependencies between the various RAID devices, we just run all the operations in parallel. The goroutines which have dependencies will naturally wait for the lower-level devices to appear. Hit this while testing FCOS rootfs-on-RAID. Wanted to make sure that my code recursed correctly up the block device hierarchy when figuring out the rootmap by doing a RAID10 (i.e. RAID0 on RAID1). Closes: coreos#581
Restarted CI on this now that coreos/coreos-assembler#1573 is in! |
Woohoo! |
So actually, this does race once in a while (I'm getting ~1/10 runs on QEMU). What happens is that the first layer of RAID gets created successfully, but then when calling Looking at the docs though, it seems like general I/O doesn't have to wait for syncing to complete. So e.g. we could create a filesystem on top of a fresh RAID device. So I think the I guess what we could do is before calling |
We should check that the sync is actually the issue, rather than some shorter race. I'm pretty sure I've created RAIDs on top of unsynced RAIDs before, and waiting for sync is infeasible (it might take hours). |
Yeah, agreed. I investigated this by calling diff --git a/internal/exec/stages/disks/raid.go b/internal/exec/stages/disks/raid.go
index 104f25e..1cdc0d7 100644
--- a/internal/exec/stages/disks/raid.go
+++ b/internal/exec/stages/disks/raid.go
@@ -77,6 +77,11 @@ func (s stage) createRaids(config types.Config) error {
exec.Command(distro.MdadmCmd(), args...),
"creating %q", md.Name,
); err != nil {
+ x := append([]string{"--detail"}, devs...)
+ s.Logger.LogCmd(
+ exec.Command(distro.MdadmCmd(), x...),
+ "examining devices for %q", md.Name,
+ )
results <- fmt.Errorf("mdadm failed: %v", err)
} else {
results <- nil And from there I saw that the RAID was still syncing. It looked fine otherwise. So I presumed that was the source of the error. |
Dismissing approval pending debugging
I think this would be a nice to have, but it doesn't seem like a common enough configuration at this point to be worth trying to debug the issues further. |
Add support for RAID devices which use other RAID devices. To do this,
instead of trying to be clever and figure out the dependencies between
the various RAID devices, we just run all the operations in parallel.
The goroutines which have dependencies will naturally wait for the
lower-level devices to appear.
Hit this while testing FCOS rootfs-on-RAID. Wanted to make sure that my
code recursed correctly up the block device hierarchy when figuring out
the rootmap by doing a RAID10 (i.e. RAID0 on RAID1).
Closes: #581