Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partition Layout #18

Closed
ajeddeloh opened this issue Jul 23, 2018 · 39 comments
Closed

Partition Layout #18

ajeddeloh opened this issue Jul 23, 2018 · 39 comments

Comments

@ajeddeloh
Copy link
Contributor

ajeddeloh commented Jul 23, 2018

In converstations we had recently, we think that FCOS should have a default partition layout, similar to how CL has a standard fs layout since it provides consistency across bare metal and clouds and well as making the image "dd-able" directly to a drive (which makes installation trivial). Any further disk modification should be done via Ignition.

What should that partition layout look like?

My (quick and not fully thought out) proposal:

1 - EFI-SYSTEM (needed for all EFI systems)
2 - BIOS-BOOT  (holds grub for systems that need bios booting)
3 - ROOT       (i.e. everything else)

Ideally we'd be able to move ROOT around using Ignition and re-deploying the OSTree to where we moved it to between the disks and files stages. If you're on a EFI system you could even wipe away BIOS-BOOT to make more room (not that its terribly large). There's some tricky cases with that which we're still exploring, but it should be possible at very least in simple cases.

@dustymabe
Copy link
Member

dustymabe commented Jul 23, 2018

1 - EFI-SYSTEM (needed for all EFI systems)
2 - BIOS-BOOT  (holds grub for systems that need bios booting)
3 - ROOT       (i.e. everything else)

sounds like a reasonable default.. just to be clear people would be able to add a separate mounted /var/ or /home/, etc, via ignition on boot but everyone would start from the same place ?

@ajeddeloh
Copy link
Contributor Author

Yeah.
Isn't /home actually just part of /var on ostree systems?

@dustymabe
Copy link
Member

Isn't /home actually just part of /var on ostree systems?

it is but I think we've enabled (because people asked for it) to have /home/ (or rather /var/home/?) be a separate block device mount.

@cgwalters
Copy link
Member

Do you see us adopting the GPT generator and avoiding the need to have the OS mounts in /etc/fstab too?

@cgwalters
Copy link
Member

cgwalters commented Jul 23, 2018

it is but I think we've enabled (because people asked for it) to have /home/ (or rather /var/home/?) be a separate block device mount.

Yeah, we support that; it was the primary rationale for the systemd fstab symlink patch.

@ajeddeloh
Copy link
Contributor Author

We could say that all the standard top level directories on / except /var and /boot needs to be on the same partition (kinda like how we say you can't move USR-{A.B} on container linux). Since /var is empty by default and can be populated by systemd-tmpfiles users can add a mount unit for it and partition it however they want. You want a seperate partition for /var, /var/home and /var/srv? Go for it.

I'm not a huge fan of the GPT generator. It's pretty limited since it only supports a few partitions and reeks magic that users may not be aware of. Much better to have explicit mount units imho.

@cgwalters
Copy link
Member

cgwalters commented Jul 23, 2018

I'm not a huge fan of the GPT generator. It's pretty limited since it only supports a few partitions and reeks magic that users may not be aware of. Much better to have explicit mount units imho.

Yeah; agree with the magic aspect. Though the specific thing I don't like about it is that if one plugs in a backup drive into a system, you might end up having the old /home partition mounted into your current OS. The scheme actually says explicitly:

Since the GPT partition table scheme cannot express which sets of partitions belong to a single OS installation, and to avoid the risk of accidentally mixing and matching incorrect combinations of these partitions, we decided to not define auto-discovery of these partition types within this specification.

Having a "full installer" that auto-generates UUIDs for each partition and binds those in /etc/fstab avoids all issues like this (except when LVM is in use).

The only way I can think of to address this in a "dd install" is to generate the machine id at install time, walk the target disk post-install and change the GUID/mount units. Ug. I guess CL has gone a long time with the current approach, and while the "foreign partition problem" does occur in the real world, it's definitely possible to work around it.

@ajeddeloh
Copy link
Contributor Author

ajeddeloh commented Jul 23, 2018

CL makes use of labels instead of uuids for partitions*. Ignition can bake in a specified uuid to your Ignition config but that means your machines will all have the same one. Since everyone is running off the same dd'd image anyway there's already duplicate uuids across machines (hasn't been a problem as far as I've seen). Labels actually work pretty well for this case. It means that while different machines will have partitions with different uuids, all of your config files and such (e.g. /etc/fstab, mount units, etc) are the same across machines and contain no machine-specific bits.

We could use uuids if Ignition supported templating but thats a rabbit hole I really don't want to go down, especailly since I see the "no machine specific bits" to be a feature.

*for the root-on-raid case we actually have special type guids for raid devices containing the root. This allows us know which devices to start in the initramfs. FWIW I'd like to avoid this for FCOS. Hell, stick a config file in /boot some tool in the initramfs knows how to read for all I care.

@JasonGiedymin
Copy link

I have to chime in on this portion. Soon I will be open sourcing some lvm encryption work done on Atomic LVM managed volumes. We store keys remotely basically. During that exercise the part which stopped the system startup and decryption from being more elegant and robust were UUIDs (we're on baremetal with diverse set of disks). If I had a more generic interface like labels to work with, live would have been easier.

@ajeddeloh ajeddeloh added the meeting topics for meetings label Jul 25, 2018
@cgwalters
Copy link
Member

One thing that's strongly implied by this is that Anaconda is probably not the primary path for bare metal installs; a whole goal of this is that "install" is just dd. By default we boot to a live image just like CL?

Side note: It'd be a nice twist to have the installer come as a privileged container.

However, a whole interesting question is how we generate that "base disk image" - I'd probably at least initially start with using Anaconda for it personally, but there's also the libguestfs-style disk image creation.

One question though; do we still support people using Anaconda (say that I want dm-crypt for everything, or XFS reflink=1, or...)? And there's a tension between Ignition for disk provisioning vs kickstart here.

@bgilbert
Copy link
Contributor

bgilbert commented Jul 25, 2018

@cgwalters Off-topic for this issue, but I don't see any reason to support installing via Anaconda, even optionally. If you want dm-crypt or XFS or whatever, we should be supporting that through Ignition.

@ajeddeloh Why don't you like the type GUID approach? It seems to work pretty well for CL.

@ajeddeloh
Copy link
Contributor Author

Re installer in a container: we could package it as such, but it should also be so simple you don't need to. The current one is just a few hundred lines of bash (most of which is a ascii armored pubkey).

Re using anaconda bare metal installation: I really really really don't like this. Your Ignition config is the declaration of what your machine looks like; there shouldn't be anything else controlling that. It's also yet another thing to support. If we want to support dmcrypt or other filesystem/partition weirdness we should implement that in Ignition.

@bgilbert It's a hack from when we were thinking we shouldn't use the boot partition. We're going to need to store some config on the boot partition anyway for supporting encryption. It'd be much cleaner to just store a "map" for mounting the root partition in a config file. That would also eliminate the need to do GPT on RAID on GPT on Disks (and instead of GPT on RAID on Disks)

@cgwalters
Copy link
Member

@cgwalters Off-topic for this issue, but I don't see any reason to support Anaconda, even optionally. If you want dm-crypt or XFS or whatever, we should be supporting that through Ignition.

I'd say it's a distinct thread but the installer path is pretty intertwined with our default partitioning.
OK, with "full" Ignition-disks where we support re-locating the OS to the new root - yes. I know I keep getting hung up on this; the concept is foreign to me. I have concerns about the amount of overlap with Anaconda this is going to imply going forward, but...I also definitely see the value in the simplification here.

@cgwalters
Copy link
Member

Soon I will be open sourcing some lvm encryption work done on Atomic LVM managed volumes. We store keys remotely basically.

Sounds like https://github.com/latchset/clevis ?

@JasonGiedymin
Copy link

I’ve modified https://github.com/HouzuoGuo/cryptctl to support LVM and stronger auditing with additional logging and events actions. It is deployed on all of our bare metal atomic nodes.

@dustymabe
Copy link
Member

Discussed in the In meeting today. This is what we came up with:

  • we'd like to strive for a fixed partition layout for our shipped image artifacts
  • the goal here is to have an image that is dd-able to any harddrive and can be booted
  • other mounts (like /var/) can be added at runtime on first boot

Potential issues:

  • 4k sectors

With 4k sectors we need to consider the GPT partition layout as well as the filesystem on top. related CL issue. We will address this issue when we hit it and call it out as a risk for now.

I believe @bgilbert also had another item in open floor that was relevant to this ticket that was regarding a user filling up the root partition and not being able to receive updates any longer.

@cgwalters
Copy link
Member

cgwalters commented Aug 1, 2018

How about

1 - EFI-SYSTEM (needed for all EFI systems)
2 - BIOS-BOOT  (holds grub for systems that need bios booting)
3 - ROOT        3GB
4 - VAR  (rest of disk)

And we also make /etc a read-only bind mount by default (after Ignition has run). This would be a pretty aggressive stance; having a separate /var avoids downloading too many container images blocking OS updates. And IMO a read-only-by-default-/etc matches nicely with Ignition's configure-once model.

A big topic that crosses this though is whether or not we use LVM by default (and if we don't whether it can be configured via ignition). If neither, then that 3GB (or whatever) rootfs is going to feel less flexible.

@dustymabe dustymabe removed the meeting topics for meetings label Aug 1, 2018
@bgilbert
Copy link
Contributor

bgilbert commented Aug 1, 2018

My question in the meeting was: in CL, we can in principle continue to update a machine whose root filesystem is full, because we're only touching /boot and /usr and those are on separate partitions. Do we want to take steps to ensure the same for FCOS?

@cgwalters I like that model. In that case, the flexibility of the small root is only an issue for distro maintainers, since the user shouldn't be putting any data there. It's a potential concern (c.f. Fedora increasing the size of /boot some years ago) but seems like it should be manageable. I don't think LVM would help us anyway, since there's no guarantee that we can resize /var downward on existing systems if we later need a larger /.

@JasonGiedymin
Copy link

As long as I can dmcrypt /var later, and add lvm to new disks which later I can manage to dmcrypt.

Believe that for atomic now all users are mapped into that writable space in var. I think right now also var and sysroot are on a shared partion with atomic.

@vtolstov
Copy link

vtolstov commented Aug 1, 2018

i'm use ostree on netbook with emmc, 3gb root is too small - not able to upgrade sometimes...
I'm test on servers and netbook - minimal size of rootfs 5-6Gb, and i'm vote for lvm.

@dustymabe
Copy link
Member

i'm use ostree on netbook with emmc, 3gb root is too small

is /var/ on separate filesystem for you?

@vtolstov
Copy link

vtolstov commented Aug 2, 2018

/var on rootfs
/var/home on separate lv (10Gb)

@dustymabe
Copy link
Member

/var on rootfs

yeah. this is something colin is proposing we change to not be on rootfs by default, which would mean we would require less space for root.

@vtolstov
Copy link

vtolstov commented Aug 2, 2018

@dustymabe no i don't think that this layout change things:
in my case: sudo du -hsx /var/: 468M /var/
all space used by ostree files: sudo du -hsx /ostree/: 2.8G /ostree/

@ajeddeloh
Copy link
Contributor Author

Regarding LVM and Ignition. I want that to happen. Much like the partitioning work, it's going to be tricky to implement, but imo it's 100% worth doing. That being said I don't think we should have our standard partition layout be LVM based. Keep it simple.

Regarding moving /var to a seperate partition: I'm in favor of this. Not only does it help (although not completely avoid) the issue of / filling up, it's similar to how you can blow away the root fs on CL which would be nice for existing CL users (as well as being nice in general).

Re: 4k sectors and GPT. I think we can ship a disk that supports both. The gpt spec lets you move the partition table around and only the actual contents of the GPT header are included in the CRCs (not the entirety of the sector). The only fixed things are that the GPT header must be at LBA1 and the backup must be at the last LBA. Since the header is <= 512 bytes, we can have both where they want to be for the primary and backup.

Here's what it would look like:

  Sector Size    |
512    |  4096   | Contents
----------------------------
 0     |   0     | MBR (protective gpt)
 1     |   0     | GPT Header for 512 byte sectors
 2-7   |   0     | Unused
 8     |   1     | GPT Header for 4k byte sector
 9-15  |   1     | Unsused
16-n   | 2-n/8   | Partition table
       |         |
...    | ...     | Padding, partitions, etc
       |         |
N-(n+8)|N-(1+n/8)| Backup Partition table
N-7    |   N     | Backup GPT for 4k sectors
N      |   N     | Backup GPT for 512 byte sectors

There's a few problems/risks:

  1. I don't know of any partitioning tool that supports creating this
  2. On a 4k disk only the 4k header will be updated. Likewise ona 512 byte disk only the 512 byte header will be updated. This means as soon as you change a partition or even the disk guid the other one becomes invalid. This is fine imo.
  3. Some partitioning tools might not implement the spec correctly and might expect the partition table / backup table to be at LBA2 / LBAn-1. We should audit the common ones.

cc @lucab for the GPT stuff.

@dustymabe
Copy link
Member

closing this ticket as we've decided that a static partition layout is suitable for FCOS. Implementation details can be worked out later I believe.

@ajeddeloh
Copy link
Contributor Author

Writing down a couple things mentioned elsewhere about the 4k/512 hybrid plan for the record.

My proposal technically breaks the GPT spec since the space in a sector after the GPT bits is defined as being all zeros. Whether anything cares is another question. It would also use just about every "feature" GPT has, and thus be at risk of not working on machines with poorly implemented EFIs (or very well implemented EFIs that check things are zeroed accordingly).

@arithx arithx mentioned this issue Aug 17, 2018
cgwalters added a commit to coreos/fedora-coreos-config that referenced this issue Sep 13, 2018
Try to match the design in coreos/fedora-coreos-tracker#18

 - no lvm
 - separate /var
cgwalters added a commit to coreos/fedora-coreos-config that referenced this issue Sep 14, 2018
This is part of coreos/fedora-coreos-tracker#18

For now, this just drops LVM to make it easier to use Ignition
to both build images, and help enable ignition-disks.

Note that I tried to use a separate `/var` but this currently
does not work with our Ignition, which would need to learn
how to mount `/var` in the initramfs.

We add growpart logic adapted from
projectatomic/container-storage-setup@d4994e6

(Probably at some point should teach growpart how to grow based
 on mount point paths...)
cgwalters added a commit to coreos/fedora-coreos-config that referenced this issue Sep 14, 2018
This is part of coreos/fedora-coreos-tracker#18

For now, this just drops LVM to make it easier to use Ignition
to both build images, and help enable ignition-disks.

Note that I tried to use a separate `/var` but this currently
does not work with our Ignition, which would need to learn
how to mount `/var` in the initramfs.

We add growpart logic adapted from
projectatomic/container-storage-setup@d4994e6

(Probably at some point should teach growpart how to grow based
 on mount point paths...)
@cgwalters
Copy link
Member

Having a split /var is blocked on https://github.com/dustymabe/ignition-dracut/issues/18

@cgwalters
Copy link
Member

The fire alarm went off at the Westford office today and I happened to be standing near Vivek Goyal and Mike Snitzer (kernel filesystem/block people). Mike in particular said that supporting both 4k and 512b in one disk image couldn't be done because the filesystems rely on sector writes being atomic.

It seems like the simplest plan is to just make two disk images?

@ajeddeloh
Copy link
Contributor Author

Yeah probably. SGTM.

@snitm
Copy link

snitm commented Oct 9, 2018

The fire alarm went off at the Westford office today and I happened to be standing near Vivek Goyal and Mike Snitzer (kernel filesystem/block people). Mike in particular said that supporting both 4k and 512b in one disk image couldn't be done because the filesystems rely on sector writes being atomic.

It seems like the simplest plan is to just make two disk images?

Right, sadly you cannot issue 512b IO to a native 4K device. A filesystem (e.g. XFS) that is formatted to use 512b assumes 512b is the atomic unit of IO. It'll fail to mount if the underlying device is actually a native 4K block device.

You might think to go the other way and try to format the filesystem with a 4K blocksize and use that single FS image for both 512b and 4K devices. BUT, there is increased potential for a partial 4K write to a 512b device to leave the device with 512b IOs having been written (yet the larger 4K being incomplete) -- this is also known as "torn writes".

cgwalters added a commit to cgwalters/coreos-assembler that referenced this issue Nov 10, 2018
We discussed possibly using a `var` partition for FCOS in
coreos/fedora-coreos-tracker#18

I would like to do so for my own Silverblue install, and possibly
for Silverblue by default.

So let's mount that partition if it exists, which means the other
code that cleans out what Anaconda did in `/var` will work.
@cgwalters
Copy link
Member

I've been thinking about the dm-crypt aspect again. Some prior discussion is in coreos/ignition#577

One thing I'm wavering on a bit is how clunky it feels to rewrite all of the operating system files on boot. If we're in a cloud scenario, we don't have a lot of choice unless we provide people a tool for creating new snapshots (big implications there).

On bare metal though, I think we could instead do a "minimal re-partitioner" (not quite an installer) that created dm-crypt on the target system, then took the raw disk image and mounted it, and did a filesystem-level copy.

(Aside: I believe Android images encrypt on boot when you initialize them the first time, and this is probably a lot nicer since they switched to using ext4 encryption. Although I'm not sure the OS is ever encrypted, it's dm-verity.)

@cgwalters
Copy link
Member

(The reason I'm thinking about dm-verity is that there are definitely server-side uses for it, but I'd like Silverblue to inherit as much technology as possible from CoreOS, and dm-crypt is really quite important on client-side devices)

@ajeddeloh
Copy link
Contributor Author

I hope we're using LUKS not just dm-crypt unless there's a good reason not to.

We want to ensure that Ignition remains the only "source of truth" for configuration. The Ignition config may not be known at install time, so we don't want to do anything special at install time.

I instead wonder if we could add an "optimization" to Ignition/the initramfs to detect if we're recreating the root and save the repo to a tmpfs or something similar, so if we blow away the root we can repopulate it from a local source instead.

Finally, what are the use cases for encrypting more than just /var?

@cgwalters
Copy link
Member

I hope we're using LUKS not just dm-crypt unless there's a good reason not to.

Hmm; I had to look up the distinction layers here, I had always been using them interchangably.

Finally, what are the use cases for encrypting more than just /var?

One tricky thing is a lot of use cases want /etc encrypted too. Having /usr encrypted provides integrity at least, and some backstop confidentiality in case there are any secrets.

@jlebon
Copy link
Member

jlebon commented Jan 21, 2019

How about

1 - EFI-SYSTEM (needed for all EFI systems)
2 - BIOS-BOOT  (holds grub for systems that need bios booting)
3 - ROOT        3GB
4 - VAR  (rest of disk)

Hmm, we'll have to think long and hard before choosing a size for ROOT. We don't want to realize down the line that e.g. the f31 -> f32 update won't fit. One issue too is that there might be more than just two commits in there. E.g. layered pkgs & pinned deployments (and I think there were discussions making the number of rollback deployments configurable?). So settling on the right size will be tricky no matter what.

@ajeddeloh
Copy link
Contributor Author

Agreed. We could publish guidelines saying "resize to X if you plan on doing a bunch of pinning or other things that take up a lot of space", but that's not ideal. It shouldn't be nearly as bad as it was with CL since ostree dedups across deployments. My guess is ~2x the size of a single deployment should be ok.

@cgwalters
Copy link
Member

(Happened to stumble across https://bugzilla.redhat.com/show_bug.cgi?id=1061478 )

@cgwalters
Copy link
Member

Let's call what we have now with the FCOS preview release "phase 1". Phase 2 work: #94

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants