F33: transitioning existing systems to systemd-resolved on upgrade #646

dustymabe · 2020-10-07T03:16:18Z

With the change to systemd-resolved we need to do some sort of intervention to make the systemd-resolved change take effect on existing systems that are upgraded.

The scriptlets for the systemd rpm have something for this but "upgrade" logic for OSTree systems doesn't really work because the compose starts with a fresh world view every time:

$ rpm -q --scripts systemd
<snip>
# Create /etc/resolv.conf symlink.
# We would also create it using tmpfiles, but let's do this here
# too before NetworkManager gets a chance. (systemd-tmpfiles invocation above
# does not do this, because it's marked with ! and we don't specify --boot.)
# https://bugzilla.redhat.com/show_bug.cgi?id=1873856
if systemctl -q is-enabled systemd-resolved.service &>/dev/null; then
  ln -fsv ../run/systemd/resolve/stub-resolv.conf /etc/resolv.conf
fi
<snip>

For us on existing systems the resolv.conf file will already exist and contain some contents like:

[core@fedora ~]$ cat /etc/resolv.conf 
# Generated by NetworkManager
nameserver 192.168.1.1

I suggest we write some migration logic that basically detects the # Generated by NetworkManager and runs the ln -fsv ../run/systemd/resolve/stub-resolv.conf /etc/resolv.conf if it has that in it. According to the change document, as long as that symlink is set up NetworkManager knows what to do to take advantage of systemd-resolved.

Any resolv.conf that had been hand edited and not managed by NetworkManager would be left alone.

The text was updated successfully, but these errors were encountered:

cgwalters · 2020-10-07T13:46:27Z

Yeah, all upgrade logic like this that depends on per-user/per-system state needs to be a systemd unit.
RPM %post should only be about things like generating cache files (ldconfig) etc.

Related discussion: https://mail.gnome.org/archives/ostree-list/2020-February/msg00000.html

dustymabe · 2020-10-07T13:53:36Z

Discussed with Luca and Jonathan. We agreed we don't have to solve this migration problem for the first release of F33 into the next stream, since DNS will continue to work through NetworkManager controlled resolv.conf for now.

We also noted that this is a problem that will need to be solved for other OSTree distributions for upgrades as well.

slankes · 2020-10-11T09:58:57Z

Just a data point: This change broke an install of mine that has unbound running as a caching dns server in a container. Because that container could no longer start all the others that depend on working dns ceased to work as well. I have fixed this for now by masking systemd-resolved.service.

dustymabe · 2020-10-12T22:26:34Z

First off, thank you for running next and helping find issues for yourself and other users/community members.

Just so I understand fully, the unbind caching DNS server in a container is now failing to start because it's trying to bind to the same ports that systemd-resolved is now using and that's the conflict?

dustymabe · 2020-10-13T20:19:05Z

Discussed this briefly with @jlebon and @bgilbert. In the past we may have considered only shipping the enabled systemd-resolved in newly installted systems and left upgraded systems alone. However we would like to minimize "drift" from what the rest of Fedora is doing. The current proposal is:

For users running local resolvers (like @slankes) we'll put out a coreos-status post that details the problem and recommends they mask systemd-resolved sometime between now and the time Fedora 33 hits testing/stable. They'll need to do it anyway for fresh installs, so there is some action on their part needed anyway.

[MANUAL RUN]
ln -sf /dev/null /etc/systemd/system/systemd-resolved.service

[via FCCT]
storage:
  links:
  - path: /etc/systemd/system/systemd-resolved.service
    target: /dev/null

For all other users we'll auto migrate them by using a systemd service in a barrier release. This systemd service will run before NetworkManager and systemd-resolved. It will detect if systemd-resolved is enabled (i.e. it won't be if it's masked) and update resolv.conf to be a symlink to ../run/systemd/resolve/stub-resolv.conf if it's detected to have been managed by NM (detected via the # Generated by NetworkManager at the top of the file).

slankes · 2020-10-14T07:24:40Z

That sounds sensible - thanks for picking the issue up.

travier · 2020-10-14T13:43:08Z

With FCCT, you can also use:

systemd:
  units:
    - name: systemd-resolved.service
      mask: true

dustymabe · 2020-10-14T16:26:55Z

With FCCT, you can also use:

systemd:
  units:
    - name: systemd-resolved.service
      mask: true

The only problem there that I think people will run in to is that there is no systemd-resolved.service in our current stable and testing streams. So I think it will fail.

dustymabe · 2020-10-14T16:29:55Z

oh actually, I forgot we just added it (but disabled by default).

jlebon · 2020-10-14T17:03:28Z

For all other users we'll auto migrate them by using a systemd service in a barrier release. This systemd service will run before NetworkManager and systemd-resolved. It will detect if systemd-resolved is enabled (i.e. it won't be if it's masked) and update resolv.conf to be a symlink to ../run/systemd/resolve/stub-resolv.conf if it's detected to have been managed by NM (detected via the # Generated by NetworkManager at the top of the file).

One thing we discussed in the community meeting was whether we should have the "conditional on NM-managed" bit. It technically deviates from the Fedora version of this.

It makes sense offhand, though... since we're not actively disabling systemd-resolved in that case, it basically means that anyone who's touched /etc/resolv.conf will have it running, even though it's likely they would have wanted it disabled.

If instead we just say "we only look at is-enabled, just like the rest of Fedora", that simplifies the message and forces sysadmins to really see what they want there. (Also, if you've tweaked /etc/resolv.conf just because you were testing something temporarily, you don't lose out on the auto-migration.)

Not sure, I think I'm good either way, but just wanted to flag this.

basvdlei · 2020-10-14T21:16:06Z

I ran into the following issue when trying out next release 33.20201006.1.0: https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#known-issues

Since systemd-resolved configures a stub listener on the loopback interface. While the kubelet by default will have pods with the dnsPolicy: "Default" inherit /etc/resolv.conf. But since containers have their own loopback interface, they will not be able to connect to systemd-resolved.

As far as I can tell, this will probably break DNS resolving on most Kubernetes installations. Since most clusters will have a CoreDNS deployed with the dnsPolicy: "Default".

dustymabe · 2020-10-15T18:51:19Z

Thanks for pointing that out @basvdlei - the doc you linked to mentioned to use --resolv-conf /run/systemd/resolve/resolv.conf. Does that work in your testing?

That file isn't the stub (which is at /run/systemd/resolve/stub-resolv.conf), so it won't have the 127.0.0.53 address in it.

cc @dghubble @vrutkovs - do OKD and Typhoon have any special casing here?

vrutkovs · 2020-10-15T19:26:02Z

We're running OKD 4.6 nightlies with systemd-resolved enabled and didn't hit any problems so far

basvdlei · 2020-10-15T20:37:25Z

@dustymabe for historical reasons our kubelet container is running with /etc/resolv.conf volume mounts, which is why I hit this issue. Setting --resolv-conf /run/systemd/resolve/resolv.conf does work. But that's not compatible with the current Fedora CoreOS stable.

Both podman and moby now have (mostly undocumented) work-arounds to use systemd-resolved non-stubbed's resolv.conf when provisioning the container's resolv.conf. A kubelet running in one of those runtimes should get the correct DNS server(s) by default. Which would explain why OKD and Typhoon didn't run into this.

I can probably also remove the resolv.conf volume mount. This should work with all streams and allow the nodes to be updated as well. But I'll have to do some additional testing.

Hopefully I'm the only one crazy enough to have done this 😀

dghubble · 2020-10-15T20:38:53Z

FCOS 32

/etc/resolv.conf (nameserver upstream)

FCOS 33

/etc/resolv.conf --> /run/systemd/resolve/stub-resolv.conf
/run/systemd/resolve/resolv.conf (nameserver upstream)
/run/systemd/resolve/stub-resolv.conf (nameserver 127.0.0.1:53)

CL / Flatcar

/etc/resolv.conf --> /run/systemd/resolve/resolv.conf 
/run/systemd/resolve/resolv.conf (nameserver upstream)
/run/systemd/resolve/stub-resolv.conf (nameserver 127.0.0.1:53)

Kubelet uses the default /etc/resolv.conf. So initially I'd expect what @basvdlei mentioned. But the Typhoon Kubelet is run as an image by podman. podman determines the /etc/resolv.conf Kubelet sees. And the effective /etc/resolv.conf has the upstream nameserver. I'm not sure if podman is intentionally handling this or its a coincidence.

$ sudo podman exec -it 75e08b27e104 /bin/bash
$ cat /etc/resolv.conf
search region.compute.internal
nameserver 10.0.0.2

dghubble · 2020-10-15T20:42:41Z

Oh nice, thanks @basvdlei. So looks like podman is indeed intentionally seeing that /etc/resolv.conf is a symlink and using /run/systemd/resolve/resolv.conf (nameserver upstream). So I'd expect no change.

basvdlei · 2020-10-16T13:14:13Z

@dghubble yeah, it's not really obvious/transparent. Thanks for looking and all your hard work on Typhoon!

@dustymabe just wanted to confirm I can work around this issue by relying on podman to provision the resolv.conf in my kubelet container. I've never tried running the kubelet directly on the host, but if anyone is doing that, they might still run into this when upgrading.

Thinking out loud, so feel free to ignore. With "everything running in containers" on FCOS, almost nothing will be able to use the stub listener. Container workloads have none of the benefits, while DNS queries might give different results whether your inside a container of not. While not using stub config is a safer upgrade path.

dustymabe · 2020-10-16T15:51:14Z

@dustymabe for historical reasons our kubelet container is running with /etc/resolv.conf volume mounts, which is why I hit this issue. Setting --resolv-conf /run/systemd/resolve/resolv.conf does work. But that's not compatible with the current Fedora CoreOS stable.

Right. The systemd-resolved change is only in next for now because that's the only stream that has been rebased to F33.

This systemd unit migrates the /etc/resolv.conf file on systems to point to ../run/systemd/resolve/stub-resolv.conf if users haven't set up a custom resolv.conf. It will only run on Fedora 33 systems and will only execute once (a single migration). Fixes: coreos/fedora-coreos-tracker#646

See coreos/fedora-coreos-tracker#646

This is the first f33 release on the `testing` stream. Let's make it a barrier as agreed upon in coreos/fedora-coreos-tracker#646.

This systemd unit migrates the /etc/resolv.conf file on systems to point to ../run/systemd/resolve/stub-resolv.conf if users haven't set up a custom resolv.conf. It will only run on Fedora 33 systems and will only execute once (a single migration). Fixes: coreos/fedora-coreos-tracker#646

This systemd unit migrates the /etc/resolv.conf file on systems to point to ../run/systemd/resolve/stub-resolv.conf if users haven't set up a custom resolv.conf. It will only run on Fedora 33 systems and will only execute once (a single migration). Fixes: coreos/fedora-coreos-tracker#646 (cherry picked from commit 56b0ceb)

dustymabe · 2020-12-17T14:28:37Z

Note that due to complications we decided to not use systemd-resolved, but leave it enabled for now. This migration script will still run and systemd-resolved is still serving a function of populating entries in the file that is pointed to by /etc/resolv.conf, but it won't be the stub listener and glibc's resolver won't query systemd-resolved for DNS.

icedream · 2020-12-18T09:57:27Z

Some of the servers I maintain have updated to the latest Fedora CoreOS stable release this morning and all of them suddenly ran into DNS issues and it turned out to be an issue with systemd-resolved having been automatically enabled.

After the update, /etc/resolv.conf pointed to /run/systemd/resolve/stub-resolv.conf which does not exist (before the update /etc/resolv.conf was a file produced by NetworkManager).

systemd-resolved itself logged a rather confusing error Failed to symlink /run/systemd/resolve/stub-resolv.conf: Permission denied. Considering the last comment in this issue I think this is intended? I for now fixed it by disabling systemd-resolved, deleting the symlink and restarting NetworkManager.

Just wanted to ask if I should actually go ahead and completely mask systemd-resolved instead if I am already on F33 or whether what I have done is enough to solve the issue.

EDIT: Even with the service being masked before upgrading to F33 it seems the resolv.conf file still gets replaced with a symlink to stub-resolv.conf. Not sure if this has any relevance or whether that's erroneous behavior.

dustymabe · 2020-12-18T13:21:54Z

hey @icedream - I'm almost certain you're hitting the SELinux issue mentioned in https://discussion.fedoraproject.org/t/fedora-coreos-rebasing-to-fedora-33-features-and-known-issues/25474. If that's the case follow the steps to restore the SELinux policy from the base config and then apply your settings back on top. If possible please leave systemd-resolved unmasked so that you can stay with the defaults provided by FCOS, which will lead to less problems in the future. Otherwise, what you've done to restore /etc/resolv.conf to be managed by NetworkManager should suffice. Either way, you'll want to bring your SELinux policy up to date.

icedream · 2020-12-18T16:00:57Z

It turns out I do have a changed SELinux policy due to me launching SSH without systemd socket activation on a separate port. That change occurs on every boot though due to a system service I set up, so I will follow this and see if it works, thank you very much for the info @dustymabe!

icedream · 2021-01-18T18:13:12Z

Unfortunately after a bit of researching it seems that semanage port is the only way to label a port as being usable by the SSH server, and from what I understand it always overwrites the policy files. For now I will have to manually restore the policy, which unfortunately also means I will have to do manual upgrades instead of automatic ones to avoid this error for future updates until this is solved in another way.

EDIT: Experimental thought, but one could immediately overwrite the policy files via rsync as soon as semanage port has done its job, but that feels fragile and I doubt it will go well.

Anyways, that will do as a workaround for me to get DNS with systemd-resolved back working.

dustymabe mentioned this issue Oct 7, 2020

next: new release on 2020-10-07 (33.20201006.1.0) coreos/fedora-coreos-streams#197

Closed

35 tasks

lucab added the fallout/f33 label Oct 7, 2020

dustymabe mentioned this issue Oct 14, 2020

tracker: Fedora 33 rebase work #609

Closed

dustymabe mentioned this issue Oct 16, 2020

next: new release on 2020-10-20 (33.20201020.1.0) coreos/fedora-coreos-streams#207

Closed

35 tasks

dustymabe mentioned this issue Oct 19, 2020

overlay: 15fcos: add systemd unit to migrate to systemd-resolved coreos/fedora-coreos-config#700

Merged

dustymabe self-assigned this Oct 20, 2020

dustymabe added the jira for syncing to jira label Oct 20, 2020

dustymabe closed this as completed in coreos/fedora-coreos-config#700 Oct 21, 2020

sinnykumari added a commit to sinnykumari/fedora-coreos-streams that referenced this issue Oct 21, 2020

next: rollout 33.20201020.1.0 (barrier release)

73f18ca

See coreos/fedora-coreos-tracker#646

sinnykumari mentioned this issue Oct 21, 2020

next: rollout 33.20201020.1.0 (barrier release) coreos/fedora-coreos-streams#212

Merged

sinnykumari added a commit to coreos/fedora-coreos-streams that referenced this issue Oct 21, 2020

next: rollout 33.20201020.1.0 (barrier release)

d70968b

See coreos/fedora-coreos-tracker#646

dustymabe mentioned this issue Oct 28, 2020

2020-10-28: gather status update for Fedora Council #650

Closed

dustymabe added a commit to dustymabe/fedora-coreos-streams that referenced this issue Nov 18, 2020

testing: add barrier to the 33.20201116.2.0 release.

f2d5226

This is the first f33 release on the `testing` stream. Let's make it a barrier as agreed upon in coreos/fedora-coreos-tracker#646.

dustymabe mentioned this issue Nov 18, 2020

testing: add barrier to the 33.20201116.2.0 release. coreos/fedora-coreos-streams#227

Merged

dustymabe added a commit to dustymabe/fedora-coreos-streams that referenced this issue Nov 18, 2020

testing: add barrier to the 33.20201116.2.0 release.

bc5d7d2

This is the first f33 release on the `testing` stream. Let's make it a barrier as agreed upon in coreos/fedora-coreos-tracker#646.

dustymabe added a commit to coreos/fedora-coreos-streams that referenced this issue Nov 19, 2020

testing: add barrier to the 33.20201116.2.0 release.

2bc3cec

This is the first f33 release on the `testing` stream. Let's make it a barrier as agreed upon in coreos/fedora-coreos-tracker#646.

dustymabe mentioned this issue Dec 11, 2020

default hostname now is fedora, used to be localhost #649

Closed

gtema mentioned this issue Jan 21, 2021

Fresh Installation on FCOS33 fails due to https://github.com/coreos/fedora-coreos-tracker/issues/646 okd-project/okd#477

Closed

mecseid mentioned this issue Jul 13, 2021

DNSConfigForming event on Fedora CoreOS with hostNetwork rancher/rke#2570

Closed

xscd mentioned this issue Aug 28, 2021

[BUG] Silverblue uses systemd-resolved in foreign mode by default instead of stub mode fedora-silverblue/issue-tracker#192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F33: transitioning existing systems to systemd-resolved on upgrade #646

F33: transitioning existing systems to systemd-resolved on upgrade #646

dustymabe commented Oct 7, 2020

cgwalters commented Oct 7, 2020

dustymabe commented Oct 7, 2020

slankes commented Oct 11, 2020 •

edited

Loading

dustymabe commented Oct 12, 2020

dustymabe commented Oct 13, 2020

slankes commented Oct 14, 2020

travier commented Oct 14, 2020

dustymabe commented Oct 14, 2020 •

edited

Loading

dustymabe commented Oct 14, 2020

jlebon commented Oct 14, 2020

basvdlei commented Oct 14, 2020 •

edited

Loading

dustymabe commented Oct 15, 2020

vrutkovs commented Oct 15, 2020

basvdlei commented Oct 15, 2020

dghubble commented Oct 15, 2020 •

edited

Loading

dghubble commented Oct 15, 2020

basvdlei commented Oct 16, 2020

dustymabe commented Oct 16, 2020

dustymabe commented Dec 17, 2020

icedream commented Dec 18, 2020 •

edited

Loading

dustymabe commented Dec 18, 2020 •

edited

Loading

icedream commented Dec 18, 2020 •

edited

Loading

icedream commented Jan 18, 2021 •

edited

Loading

F33: transitioning existing systems to systemd-resolved on upgrade #646

F33: transitioning existing systems to systemd-resolved on upgrade #646

Comments

dustymabe commented Oct 7, 2020

cgwalters commented Oct 7, 2020

dustymabe commented Oct 7, 2020

slankes commented Oct 11, 2020 • edited Loading

dustymabe commented Oct 12, 2020

dustymabe commented Oct 13, 2020

slankes commented Oct 14, 2020

travier commented Oct 14, 2020

dustymabe commented Oct 14, 2020 • edited Loading

dustymabe commented Oct 14, 2020

jlebon commented Oct 14, 2020

basvdlei commented Oct 14, 2020 • edited Loading

dustymabe commented Oct 15, 2020

vrutkovs commented Oct 15, 2020

basvdlei commented Oct 15, 2020

dghubble commented Oct 15, 2020 • edited Loading

dghubble commented Oct 15, 2020

basvdlei commented Oct 16, 2020

dustymabe commented Oct 16, 2020

dustymabe commented Dec 17, 2020

icedream commented Dec 18, 2020 • edited Loading

dustymabe commented Dec 18, 2020 • edited Loading

icedream commented Dec 18, 2020 • edited Loading

icedream commented Jan 18, 2021 • edited Loading

slankes commented Oct 11, 2020 •

edited

Loading

dustymabe commented Oct 14, 2020 •

edited

Loading

basvdlei commented Oct 14, 2020 •

edited

Loading

dghubble commented Oct 15, 2020 •

edited

Loading

icedream commented Dec 18, 2020 •

edited

Loading

dustymabe commented Dec 18, 2020 •

edited

Loading

icedream commented Dec 18, 2020 •

edited

Loading

icedream commented Jan 18, 2021 •

edited

Loading