Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filesystem cleanExcept directive to preserve wanted files #1316

Closed
wants to merge 1 commit into from

Conversation

pothos
Copy link
Contributor

@pothos pothos commented Feb 2, 2022

The wipeFilesystem directive causes all state to be lost. A more
fine-grained mechanism is needed to clean a filesystem from previous
unwanted state while allowing for some files or directories to be kept.

Add a new cleanExcept directive that when specified will remove all
directories and files on the filesystem except those that match a list
of regular expressions.

(Didn't look into tests yet)

Please give any feedback, I find this valuable for handling configuration changes without config drift but it could also be used for reusing data disks as alternative to wiping them.

The wipeFilesystem directive causes all state to be lost. A more
fine-grained mechanism is needed to clean a filesystem from previous
unwanted state while allowing for some files or directories to be kept.

Add a new cleanExcept directive that when specified will remove all
directories and files on the filesystem except those that match a list
of regular expressions.
@pothos pothos force-pushed the kai/ignitionv3-rerun-poc branch from fa72e70 to 4a3227b Compare February 2, 2022 17:58
@pothos
Copy link
Contributor Author

pothos commented Feb 2, 2022

Ok, didn't get yet how to do the translation (panic: Translator not defined for types.Config to types.Config [recovered] panic: Translator not defined for types.Config to types.Config) but for the regular case with 3.4.0-experimental this config worked (On a Flatcar test build):

{
  "ignition": {
    "version": "3.4.0-experimental"
  },
  "storage": {
    "files": [
      {
        "path": "/test",
        "contents": {
          "source": "data:,helloworld%0A",
          "verification": {}
        },
        "mode": 420
      }
    ],
    "filesystems": [
      {
          "device": "/dev/disk/by-label/ROOT",
          "format": "ext4",
          "label": "ROOT",
          "cleanExcept": ["/var/lib/docker", "/var/lib/containerd", "/etc/ssh/ssh_host_.*", "/var/log"]
      }
    ]
  }
}

@bgilbert
Copy link
Contributor

bgilbert commented Feb 2, 2022

Thanks for the PR! Note that the the storage.filesystems section, and the disks stage generally, don't deal with the contents of filesystems. Deletion of files/directories or directory trees could make sense as part of the files stage, though (see #739).

I'd like to know more about the use case:

I find this valuable for handling configuration changes without config drift but it could also be used for reusing data disks as alternative to wiping them.

The second one is the use case I was imagining. Could you expand on the first one? It sounds as though you're trying to rerun Ignition to modify an existing OS installation, which is explicitly not supported.

@pothos
Copy link
Contributor Author

pothos commented Feb 3, 2022

The files stage is also good, I just had placed it close to wipeFilesystem which is the only similar option available now, and initially I wanted it to work per filesystem. When moving it to the files stage it would clean all filesystems but that's ok if it's documented.

On the use case, yes, this is about rerunning Ignition - currently this is possible with wipeFilesystem but it still discards too much local data, thus the idea here of preserving selectively.
I think this is the missing piece to be able to actually use Ignition it many environments and it goes in line with "Ignition encourages immutable infrastructure, in which machine modification requires that users discard the old node and re-provision the machine. This maintains the user’s machines in a well known state with relatively simple tooling." because it (still) gives a way to discard the node's state without having to start provisioning from scratch.
Taking a one-line config change as an example, do you really need to throw away all local data and start a new node? In many cases this is not easily possible at all because it takes too long, specially if data has to be moved to the node again, and is disruptive because the IP address, SSH host key, and local data all get lost. Due to this, one currently has to resort to things like Ansible to manage the configuration - which makes Ignition almost useless because it would just hold duplicate initial configuration and there are often other means available for injecting SSH keys to let Ansible do its job.
One can already rerun Ignition in combination with wipeFilesystem but it still loses too much local data. With the new feature here one can reprovisiong the instance with Ignition and hit a good compromise that doesn't need Ansible. Doing the userdata change and triggering Ignition again is a very customized action already, I'm not talking about recommending to rerun Ignition in general because it requires the user to understand which files may be preserved and which not to maintain the properties Ignition has.

@bgilbert
Copy link
Contributor

bgilbert commented Feb 3, 2022

I see a few related issues here:

  1. The desire to modify nodes in place, via some sort of configuration management similar to Ansible, rather than reset them and start over. Ignition does not and will not support this. Ignition is designed to apply a container-like workflow to nodes: update your configuration, test it in staging, launch some new nodes in production, migrate your workloads, then tear down the old nodes.

    If you try to modify nodes in place, you'll inevitably miss some state drift, may not have fully tested the configuration changes you're applying, and (if you're applying them to a running node) will hit race conditions inherent to modifying configs at runtime. (These are the same reasons you wouldn't shell into a running container and update its software.) Don't do this. Immutable infrastructure requires a fundamental shift in how nodes are managed, but it's worth it.

  2. Reducing the wall-clock time and infrastructure needed to reprovision nodes. Some hardware spends a lot of time in POST, and also it's inefficient to redownload and write out an entire OS image identical to the OS already installed. A "factory reset" capability would be a useful optimization (for example, see Add factory reset capability fedora-coreos-tracker#399), but it's better done by the OS than by Ignition for a couple reasons:

    1. The OS knows how to reset itself, and the user doesn't. Resetting might involve rerunning the OS installer with some special options, or might involve removing files and copying other files around. A simple (incomplete) implementation on Fedora CoreOS might involve deleting /var and replacing /etc with a copy of /usr/etc. It's awkward to ask the user to write a config that correctly handles this for their OS.
    2. Even if we asked the user to specify what to reset, Ignition can't handle that unless we allow Ignition to run again on a system that's already been provisioned. Because of point 1 above, we won't do that.
  3. Reprovisioning nodes without wiping user data. This is a useful optimization, since the node might contain a lot of cached data that could be reused, or persistent data that could be reused without copying it from another replica. The usual approach for doing this today is to put user data on a separate filesystem which is defined in the Ignition config with wipeFilesystem false. Ignition will provision that filesystem on initial install, and then leave it alone on reinstalls.

    The OS's implementation of factory reset could help here. If the OS has a way to reset the root filesystem while leaving certain data intact, it could offer an option to do so, for example a command like factory-reset --preserve /srv/important. That would allow retaining state across reprovisions without putting it on a separate filesystem.

@pothos
Copy link
Contributor Author

pothos commented Feb 3, 2022

Thanks for the link to the OS reset discussions.
Maybe I should say again that I approached this from the Flatcar Container Linux view, i.e., wipeFilesystem for the rootfs is already quite a good OS "factory reset" and what I added here is the --preserve /srv/important part you suggested.
This is not about rerunning Ignition "at runtime" but through a reboot. This means I'm already as convinced as you regarding point 1.

I think it makes sense to have that it Ignition because it is close to the configuration the user specifies. Whether preserving something is valid or not is tied to the application. Yes, you are right, it's even more tied to the OS and it seems strange to give this into the hand of the user, but in Flatcar's case we can wipe everything on the rootfs and the user would only state paths to preserve that the user cares about. Having this as part of the OS reset is also ok, maybe we could clean the rootfs in the systemd shutdown hook shortly before the reboot instead of continuing with this PR, it just gets "farer" away from the single config file Ignition would give me to express all this.

@bgilbert
Copy link
Contributor

bgilbert commented Feb 3, 2022

Point 1 doesn't require not reconfiguring nodes while they're running, it requires not reconfiguring them at all. If you want to rerun Ignition, the only safe way is to reset the node to factory state first. Otherwise, if the user's config fails to properly perform the reset, the node will still boot successfully, but with a mix of old and new customizations. (And if the machine ID isn't cleared, any new systemd services won't be enabled correctly.) This violates the principle that Ignition should fail if it can't deliver what was requested. Even if we accepted that possibility for Flatcar, we'd still be adding a special feature for factory reset which doesn't help OSes with more complicated reset procedures.

I agree that it's unfortunate to require the user to specify preserved directories in a separate command, outside the Ignition config. One option is for Flatcar to define a config file, say /etc/preserve-on-reset, which lists the paths to preserve. Ignition configs could write that file, and the factory-reset tool could read it. The list would be defined by the previous config instead of the new one, which might be a bit unexpected, but the previous config is the one that knows how the machine is currently being used.

(Or the tool could parse the config file out of the new Ignition config, but that's ugly.)

By the way, unless Flatcar fixed this, wiping / in Container Linux will give you a working system but is not exactly a factory reset. /etc ships with files that aren't recreated by systemd-tmpfiles.

@pothos
Copy link
Contributor Author

pothos commented Feb 4, 2022

The idea was that cleanExcept deletes everything except the paths given there to avoid ending up with mix of old and new customizations. The requirement is that the path does not include old configurations. In this regard your idea to couple it to the old config instead of the new config makes sense.

I will close this and rather move the cleaning step to a new factory reset action. In the end for Flatcar it's the same thing, just done before the reboot instead of by Ignition. The factory reset action could also ensure that the user doesn't try to preserve /etc/machine-id and instead offer a flag that sets the kernel arg systemd.machine_id (as I called out in the docs entry here). For preventing other mistakes there could be a warning that /etc/systemd is likely not the right thing to preserve - and maybe even more warnings or hard errors could be done based on the old ignition config; still it will require some thinking on what is valid and what not for the software in question.

@pothos pothos closed this Feb 4, 2022
@bgilbert
Copy link
Contributor

bgilbert commented Feb 6, 2022

Sounds good. Thanks for the discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants