Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disks: add support for nested RAID devices #1010

Closed
wants to merge 1 commit into from

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Jun 29, 2020

Add support for RAID devices which use other RAID devices. To do this,
instead of trying to be clever and figure out the dependencies between
the various RAID devices, we just run all the operations in parallel.
The goroutines which have dependencies will naturally wait for the
lower-level devices to appear.

Hit this while testing FCOS rootfs-on-RAID. Wanted to make sure that my
code recursed correctly up the block device hierarchy when figuring out
the rootmap by doing a RAID10 (i.e. RAID0 on RAID1).

Closes: #581

@bgilbert
Copy link
Contributor

Awesome! 🎉

We have no test coverage for RAID right now. If adding RAID support to the blackbox tests turns out to be too hard, let's at least fix up the old CL RAID tests in kola and add a test for this case.

@arithx
Copy link
Contributor

arithx commented Jul 1, 2020

If adding RAID support to the blackbox tests turns out to be too hard, let's at least fix up the old CL RAID tests in kola and add a test for this case.

👍 to adding RAID (and other device type) tests back into kola for *COS.

@jlebon
Copy link
Member Author

jlebon commented Jul 3, 2020

OK, added a test, but this now requires coreos/coreos-assembler#1573! Though I'm hitting some SELinux violations I need to investigate. Likely missing a relabel on the root of the RAID device itself.

tests/kola/raid/config.ign Outdated Show resolved Hide resolved
tests/kola/raid/config.ign Show resolved Hide resolved
@jlebon jlebon marked this pull request as ready for review July 9, 2020 13:22
Add support for RAID devices which use other RAID devices. To do this,
instead of trying to be clever and figure out the dependencies between
the various RAID devices, we just run all the operations in parallel.
The goroutines which have dependencies will naturally wait for the
lower-level devices to appear.

Hit this while testing FCOS rootfs-on-RAID. Wanted to make sure that my
code recursed correctly up the block device hierarchy when figuring out
the rootmap by doing a RAID10 (i.e. RAID0 on RAID1).

Closes: coreos#581
@jlebon
Copy link
Member Author

jlebon commented Jul 10, 2020

Restarted CI on this now that coreos/coreos-assembler#1573 is in!

@jlebon
Copy link
Member Author

jlebon commented Jul 10, 2020

--- PASS: ext.-ci_coreos_ignition_PR-1010-CPPJDI6YYCQTOSCQWYOT4ZSJMWH2CNT3H25RYEAFLGVH2JLKBTBQ.raid (56.58s)

Woohoo!

bgilbert
bgilbert previously approved these changes Jul 10, 2020
@jlebon
Copy link
Member Author

jlebon commented Jul 13, 2020

So actually, this does race once in a while (I'm getting ~1/10 runs on QEMU). What happens is that the first layer of RAID gets created successfully, but then when calling mdadm to create the second layer, if it happens before the first layer has finished syncing, it will get EBUSY.

Looking at the docs though, it seems like general I/O doesn't have to wait for syncing to complete. So e.g. we could create a filesystem on top of a fresh RAID device. So I think the EBUSY issue is specific to nested RAID.

I guess what we could do is before calling mdadm, we go through each member device and if it's a RAID device, do an mdadm --wait.

@bgilbert
Copy link
Contributor

We should check that the sync is actually the issue, rather than some shorter race. I'm pretty sure I've created RAIDs on top of unsynced RAIDs before, and waiting for sync is infeasible (it might take hours).

@jlebon
Copy link
Member Author

jlebon commented Jul 13, 2020

We should check that the sync is actually the issue, rather than some shorter race. I'm pretty sure I've created RAIDs on top of unsynced RAIDs before, and waiting for sync is infeasible (it might take hours).

Yeah, agreed. I investigated this by calling mdadm --detail in the error path:

diff --git a/internal/exec/stages/disks/raid.go b/internal/exec/stages/disks/raid.go
index 104f25e..1cdc0d7 100644
--- a/internal/exec/stages/disks/raid.go
+++ b/internal/exec/stages/disks/raid.go
@@ -77,6 +77,11 @@ func (s stage) createRaids(config types.Config) error {
                                        exec.Command(distro.MdadmCmd(), args...),
                                        "creating %q", md.Name,
                                ); err != nil {
+                                       x := append([]string{"--detail"}, devs...)
+                                       s.Logger.LogCmd(
+                                               exec.Command(distro.MdadmCmd(), x...),
+                                               "examining devices for %q", md.Name,
+                                       )
                                        results <- fmt.Errorf("mdadm failed: %v", err)
                                } else {
                                        results <- nil

And from there I saw that the RAID was still syncing. It looked fine otherwise. So I presumed that was the source of the error.

@bgilbert bgilbert dismissed their stale review September 10, 2020 04:54

Dismissing approval pending debugging

@jlebon
Copy link
Member Author

jlebon commented Sep 7, 2023

I think this would be a nice to have, but it doesn't seem like a common enough configuration at this point to be worth trying to debug the issues further.

@jlebon jlebon closed this Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ignition does not support nested RAID
3 participants