disks: add support for nested RAID devices #1010

jlebon · 2020-06-29T20:45:06Z

Add support for RAID devices which use other RAID devices. To do this,
instead of trying to be clever and figure out the dependencies between
the various RAID devices, we just run all the operations in parallel.
The goroutines which have dependencies will naturally wait for the
lower-level devices to appear.

Hit this while testing FCOS rootfs-on-RAID. Wanted to make sure that my
code recursed correctly up the block device hierarchy when figuring out
the rootmap by doing a RAID10 (i.e. RAID0 on RAID1).

Closes: #581

bgilbert · 2020-06-30T12:40:51Z

Awesome! 🎉

We have no test coverage for RAID right now. If adding RAID support to the blackbox tests turns out to be too hard, let's at least fix up the old CL RAID tests in kola and add a test for this case.

arithx · 2020-07-01T05:00:47Z

If adding RAID support to the blackbox tests turns out to be too hard, let's at least fix up the old CL RAID tests in kola and add a test for this case.

👍 to adding RAID (and other device type) tests back into kola for *COS.

jlebon · 2020-07-03T22:28:03Z

OK, added a test, but this now requires coreos/coreos-assembler#1573! Though I'm hitting some SELinux violations I need to investigate. Likely missing a relabel on the root of the RAID device itself.

tests/kola/raid/config.ign

Add support for RAID devices which use other RAID devices. To do this, instead of trying to be clever and figure out the dependencies between the various RAID devices, we just run all the operations in parallel. The goroutines which have dependencies will naturally wait for the lower-level devices to appear. Hit this while testing FCOS rootfs-on-RAID. Wanted to make sure that my code recursed correctly up the block device hierarchy when figuring out the rootmap by doing a RAID10 (i.e. RAID0 on RAID1). Closes: coreos#581

jlebon · 2020-07-10T18:51:10Z

Restarted CI on this now that coreos/coreos-assembler#1573 is in!

jlebon · 2020-07-10T19:26:49Z

--- PASS: ext.-ci_coreos_ignition_PR-1010-CPPJDI6YYCQTOSCQWYOT4ZSJMWH2CNT3H25RYEAFLGVH2JLKBTBQ.raid (56.58s)

Woohoo!

jlebon · 2020-07-13T15:28:08Z

So actually, this does race once in a while (I'm getting ~1/10 runs on QEMU). What happens is that the first layer of RAID gets created successfully, but then when calling mdadm to create the second layer, if it happens before the first layer has finished syncing, it will get EBUSY.

Looking at the docs though, it seems like general I/O doesn't have to wait for syncing to complete. So e.g. we could create a filesystem on top of a fresh RAID device. So I think the EBUSY issue is specific to nested RAID.

I guess what we could do is before calling mdadm, we go through each member device and if it's a RAID device, do an mdadm --wait.

bgilbert · 2020-07-13T15:52:37Z

We should check that the sync is actually the issue, rather than some shorter race. I'm pretty sure I've created RAIDs on top of unsynced RAIDs before, and waiting for sync is infeasible (it might take hours).

jlebon · 2020-07-13T15:58:01Z

We should check that the sync is actually the issue, rather than some shorter race. I'm pretty sure I've created RAIDs on top of unsynced RAIDs before, and waiting for sync is infeasible (it might take hours).

Yeah, agreed. I investigated this by calling mdadm --detail in the error path:

diff --git a/internal/exec/stages/disks/raid.go b/internal/exec/stages/disks/raid.go
index 104f25e..1cdc0d7 100644
--- a/internal/exec/stages/disks/raid.go
+++ b/internal/exec/stages/disks/raid.go
@@ -77,6 +77,11 @@ func (s stage) createRaids(config types.Config) error {
                                        exec.Command(distro.MdadmCmd(), args...),
                                        "creating %q", md.Name,
                                ); err != nil {
+                                       x := append([]string{"--detail"}, devs...)
+                                       s.Logger.LogCmd(
+                                               exec.Command(distro.MdadmCmd(), x...),
+                                               "examining devices for %q", md.Name,
+                                       )
                                        results <- fmt.Errorf("mdadm failed: %v", err)
                                } else {
                                        results <- nil

And from there I saw that the RAID was still syncing. It looked fine otherwise. So I presumed that was the source of the error.

Dismissing approval pending debugging

jlebon · 2023-09-07T21:20:03Z

I think this would be a nice to have, but it doesn't seem like a common enough configuration at this point to be worth trying to debug the issues further.

jlebon mentioned this pull request Jun 29, 2020

40ignition-ostree: add coreos-inject-rootmap.service coreos/fedora-coreos-config#503

Merged

jlebon force-pushed the pr/nested-raid branch from 12eb58d to fed7f5f Compare July 3, 2020 22:25

jlebon marked this pull request as draft July 3, 2020 22:28

jlebon mentioned this pull request Jul 3, 2020

kola: add additionalDisks support in kola.json coreos/coreos-assembler#1573

Merged

bgilbert reviewed Jul 6, 2020

View reviewed changes

tests/kola/raid/config.ign Outdated Show resolved Hide resolved

tests/kola/raid/config.ign Show resolved Hide resolved

jlebon force-pushed the pr/nested-raid branch from fed7f5f to 2b4f523 Compare July 9, 2020 13:17

jlebon marked this pull request as ready for review July 9, 2020 13:22

jlebon force-pushed the pr/nested-raid branch from 2b4f523 to 47bfa80 Compare July 9, 2020 18:46

bgilbert previously approved these changes Jul 10, 2020

View reviewed changes

jlebon closed this Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

disks: add support for nested RAID devices #1010

disks: add support for nested RAID devices #1010

jlebon commented Jun 29, 2020

bgilbert commented Jun 30, 2020

arithx commented Jul 1, 2020

jlebon commented Jul 3, 2020

jlebon commented Jul 10, 2020

jlebon commented Jul 10, 2020

jlebon commented Jul 13, 2020

bgilbert commented Jul 13, 2020

jlebon commented Jul 13, 2020

jlebon commented Sep 7, 2023

disks: add support for nested RAID devices #1010

disks: add support for nested RAID devices #1010

Conversation

jlebon commented Jun 29, 2020

bgilbert commented Jun 30, 2020

arithx commented Jul 1, 2020

jlebon commented Jul 3, 2020

jlebon commented Jul 10, 2020

jlebon commented Jul 10, 2020

jlebon commented Jul 13, 2020

bgilbert commented Jul 13, 2020

jlebon commented Jul 13, 2020

jlebon commented Sep 7, 2023