Automated Policy-based Disk Encryption on Boot Images #15

darkmuggle · 2019-09-17T16:02:18Z

Starting with OpenShift 4.3, CoreOS boot images will be encryption-ready. On first boot either administator provided or automated policy will be used for the root disk encryption. Bootstrapping and node provisioning MUST fail when the policy cannot be applied unless the administrator has opted out.

darkmuggle · 2019-09-17T16:53:49Z

/assign @childsb

enhancements/coreos-encrypted-by-default.md

darkmuggle · 2019-09-17T19:28:06Z

/cc @miabbott @ashcrow @imcleod

cgwalters · 2019-09-18T12:51:42Z

I started writing some code around the ignition+ostree redeployment bit: coreos/ignition-dracut#107

enhancements/coreos-encrypted-by-default.md

chancez · 2019-09-18T15:20:36Z

/unassign @chancez

darkmuggle · 2019-09-18T19:44:47Z

I started writing some code around the ignition+ostree redeployment bit: coreos/ignition-dracut#107

As noted in the Ignition Dracut PR, Ignition does not support LUKS creation.

enhancements/coreos-encrypted-by-default.md

cgwalters · 2019-09-18T22:37:04Z

Let me try to rephrase/capture my concerns.

I'd like to see this proposal define a very high level way (e.g. via install-config.yaml or via MachineConfig) that admins can configure this. Let's avoid other implementation details that expose to administrators how it's implemented - e.g. rather than documenting a coreos.no_encryption=1 kernel commandline option - we only document an install-config.yaml stanza.

The reason is: this proposal creates yet another substantial difference between FCOS and RHCOS. Further, it creates that divergence as a short-term fix for one point in larger problem space around root filesystem configuration. This proposal is also implicitly saying that root filesystem encryption is e.g. more important than switching RHCOS to Ignition spec 3 and adding LUKS support to Ignition - that may be the case, but we should understand that. (For example, we want to support LUKS on RAID, LVM etc.)

I'd like to ensure that whatever we do here can be transparently upgraded to full Ignition 3 support later.

Now, this gets into my concern around exactly what admins can configure. My 2c on this is that we start from the very simplest: by default, the root filesystem is encrypted with a key that is not accessible after boot. If a TPM is available, we use that. If a TPM is not available...well, we can only easily do hacks - or we do something like have an in-cluster Tang server, i.e. we do NBDE by default. We need to clearly call out the disaster recovery implications of this - we need to make clear, there's no "recovery key" or anything like that. (Or, alternatively we require the admin specify a recovery key?)

We know we want to support external-to-the cluster Tang servers - that's an easy case. But again, let's define what admins write in an install-config.yaml (Or a MachineConfig extension).

darkmuggle · 2019-09-18T23:44:08Z

Unpacking @cgwalters concerns here (which I greatly appreciate!)

I'd like to see this proposal define a very high level way (e.g. via install-config.yaml or via MachineConfig) that admins can configure this. Let's avoid other implementation details that expose to administrators how it's implemented - e.g. rather than documenting a coreos.no_encryption=1 kernel commandline option - we only document an install-config.yaml stanza.

That's fair. I spend the time defining the OS level perspective and I wanted to get the base idea out for review first.

The reason is: this proposal creates a substantial difference between FCOS and RHCOS.

FCOS is for general clusters and container workloads, while RHCOS is single-purpose in running Openshift. Alignment is highly desirable; divergence should be tolerable when it enables key features for that single-use.

Further, it creates that divergence as a short-term fix for one point in larger problem space around root filesystem configuration. This proposal is also implicitly saying that root filesystem encryption is e.g. more important than switching RHCOS to Ignition spec 3 and adding LUKS support to Ignition - that may be the case, but we should understand that. (For example, we want to support LUKS on RAID, LVM etc.)

Calling this a "short-term" does not do the proposal justice. To our users, encrypted data-at-risk is very germane and it can make the difference between adoption. Users are asking for this now and its an adoption blocker. For example, without encryption, FIPS 140-2 compliance cannot happen.

Simply put, our users do not care about Ignition or Ignition specs unless they have to; that's a developer concern. The users care about encryption.

What this proposal does is to us developers give a PATH and TIME at the cost of short-term divergence in on-disk format. This proposal solves the problem in a universal way, and gives developers time to ensure that we do the Ignition work well.

I'd like to ensure that whatever we do here can be transparently upgraded to full Ignition 3 support later.

One of the merits of this idea is that it does not require Ignition or MCO support today; both would be channels for delivering the Cleivs config.

I have spoken to the Ignition maintainers about the migration path and did so specifically before submitting this enhancement. Once we have LUKS support in Ignition, we can use a default Ignition configuration that replicates the behavior. I would argue that since the change to a LUKS container is for the boot image, this gives a path forward since it does not have a surface in Ignition.

Now, this gets into my concern around exactly what admins can configure. My 2c on this is that we start from the very simplest: by default, the root filesystem is encrypted with a key that is not accessible after boot.

By key, I would assume you mean the LUKS passphrase?

The proposal state the key is not accessible after boot without doing per-node spelunking. The random passphrase is provisioned only for the purpose of re-encrypting the LUKS container. After the Clevis binding happens, the key-slot with the random passphrase is destroyed, and the only way to access it is through the token written by Clevis.

If a TPM is available, we use that. If a TPM is not available...well, we can only easily do hacks - or we do something like have an in-cluster Tang server, i.e. we do NBDE by default.

In-cluster Tang was considered but rejected due to the complexity of the idea. How do you deal with a rolling reboot or a power-outage?

By "hack" are you referencing the proposed plain-text pin for Clevis? I would agree its less than perfect -- and the proposal calls out it specifically -- but it solves the situation where a cluster is bootstrapped and then an admin on day N wishes to move to a NBDE model.

We need to clearly call out the disaster recovery implications of this - we need to make clear, there's no "recovery key" or anything like that. (Or, alternatively we require the admin specify a recovery key?)

Under the "Risk and Mitigations," the proposal illustrates how recovery would look if it was booted into the initramfs: clevis-luks-open -d ... In terms of Openshift, though, unless you're in a cluster down the situation, restoring access to the network-bound service (for Tang or KMS) and rebooting should be sufficient in 99% of the cases. Otherwise, doing the Openshift method of reprovisioning the node would be the preferable case.

We know we want to support external-to-the cluster Tang servers - that's an easy case. But again, let's define what admins write in an install-config.yaml (Or a MachineConfig extension).

I work on that tomorrow

cgwalters · 2019-09-19T12:35:35Z

Simply put, our users do not care about Ignition or Ignition specs unless they have to; that's a developer concern. The users care about encryption.

MachineConfigs embed Ignition - so Ignition is very much user facing. We want to dig ourselves out of the pit we're in now where searching for Ignition stuff lands on coreos.com and one gets confused about spec 2 vs 3 and whether it's talking about Container Linux or... etc.

I think we definitely want administrators to write Ignition (or MachineConfigs) and finish the vision of supporting re-laying out the rootfs. Let's be sure we're on the same page on that! This document seems to more describe changing Ignition as "out of scope". Agreed to change the wording there to make it clear that e.g. in the future, RHCOS will switch back to unifying with FCOS and not come as a null cipher LUKS by default?

cgwalters · 2019-09-19T13:15:45Z

BTW, another reason we want the full Ignition re-partitioning support is that today one can provide an Ignition config which creates /var/lib/etcd as a separate partition for the masters which is a good practice I think (maybe it's on a hardware raid device, separate higher-IOPS EBS volume or whatever). Today one can't easily LUKS encrypt that, and this enhancement in general doesn't have any user story for how one does LUKS for anything except the default /.

ashcrow · 2019-09-19T15:27:40Z

FCOS being the upstream of RHCOS is super important, but it can't be the reason we don't move forward adding encryption at rest support for OCP.

We want to dig ourselves out of the pit we're in now where searching for Ignition stuff lands on coreos.com and one gets confused about spec 2 vs 3 and whether it's talking about Container Linux or... etc.

100%

I think we definitely want administrators to write Ignition (or MachineConfigs) and finish the vision of supporting re-laying out the rootfs. Let's be sure we're on the same page on that!

I'm not sure we all are on that same page. Allowing (and in some cases encouraging) MC editing was not originally what we wanted admins to do. However, it became a necessity. We also now have fcct as a way to hide some of the complexity.

I'm a bit confused on how writing MCs are strongly linked to re-laying out rootfs. Is this meaning re-formatting? Encryption/re-encrption? etc..

BTW, another reason we want the full Ignition re-partitioning support is that today one can provide an Ignition config which creates /var/lib/etcd as a separate partition for the masters [..]

True. People can do that, but they can do a lot of things with the awesome tool set Linux and OCP provide. Do you see /var/lib/etcd re-partitioning and encryption a requirement for the initial encryption story?

enhancements/coreos-encrypted-by-default.md

cgwalters · 2019-09-19T16:25:21Z

I'm not sure we all are on that same page. Allowing (and in some cases encouraging) MC editing was not originally what we wanted admins to do. However, it became a necessity. We also now have fcct as a way to hide some of the complexity.

For sure there's been debate on this. But ultimately we shipped OpenShift 4.1, and the way to configure the host is MachineConfig and it's documented.
(Although now that I look, not very well...the upstream docs are better argh)

fcct as you know only applies to FCOS. However I see MachineConfigs as also increasingly playing the role of FCCT in offering "high level sugar" around Ignition for RHCOS/OpenShift.

I'm a bit confused on how writing MCs are strongly linked to re-laying out rootfs. Is this meaning re-formatting? Encryption/re-encrption? etc..

Well, this hasn't been implemented or designed fully, but the way I was thinking of this is that an administrator should be able to create a MachineConfig object that includes the storage section (disallowed today). If implemented in a basic way, this would cause new workers to spawn using that storage configuration. So yes, if you wanted to e.g. turn on or off LUKS that could be a day-2 operation at least for workers. Implementing it in a more sophisticated way would require machineAPI integration - the MCO would have to say "OK I can't live-update storage, let me ask the machineAPI to reprovision this machine".

ashcrow · 2019-09-20T16:29:14Z

Approved from my POV 👍

cgwalters · 2019-09-20T16:51:26Z

In prow terms,
/approve

There's other edits/improvements to make here, but the broad outline LGTM. We probably need to be sure architects are good with this as well before trying to revise a lot of the details.

darkmuggle · 2019-09-20T20:58:32Z

/assign @smarterclayton @imcleod

miabbott

I think this is an excellent starting point for the data encryption story. 👍

miabbott · 2019-09-23T13:55:12Z

Since there are explicit requirements on the Installer for supporting this, let's get @abhinavdahiya and @wking aware of this, too.

ashcrow · 2019-09-24T17:22:47Z

/lgtm

openshift-ci-robot · 2019-09-24T17:23:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, cgwalters, darkmuggle, miabbott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ashcrow]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Part of: openshift/enhancements#15 We added FIPS to the MCO a while ago: openshift/machine-config-operator#889 However, during some discussion it became clear that the main use case for FIPS is "day 1" - it doesn't make sense to turn it on "day 2" because the standard requires that e.g. long-term key material was created with FIPS enabled. Further, it's unlikely that admins will want to turn it *off* if they ever had it on. This is a good candidate for an install config.

Today for IaaS clouds, we default to an instance type, which then in turn usually provides a default size. RHCOS resizes on boot to that size, distinct from its "default" 16G size. However, libvirt installs were inheriting our default size. We'd like to shrink it because we plan to land encryption: openshift/enhancements#15 And the less data we need to encrypt, the better. (In the future I'd like to make this configurable with a variable, but let's just prepare for the encryption work now)

Part of: openshift/enhancements#15 We added FIPS to the MCO a while ago: openshift/machine-config-operator#889 However, during some discussion it became clear that the main use case for FIPS is "day 1" - it doesn't make sense to turn it on "day 2" because the standard requires that e.g. long-term key material was created with FIPS enabled. Further, it's unlikely that admins will want to turn it *off* if they ever had it on. This is a good candidate for an install config.

Encrypted clusters at the OS-level by default.

ad625ce

openshift-ci-robot requested review from chancez and smarterclayton September 17, 2019 16:02

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 17, 2019

cgwalters reviewed Sep 17, 2019

View reviewed changes

openshift-ci-robot requested review from ashcrow, imcleod and miabbott September 17, 2019 19:28

cgwalters mentioned this pull request Sep 18, 2019

WIP: Redeploy rootfs coreos/ignition-dracut#107

Closed

ashcrow reviewed Sep 18, 2019

View reviewed changes