Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google-guest-agent after 20240701.00 persists a file that locks systemd-networkd to a specific interface device name #401

Open
char8 opened this issue Jul 16, 2024 · 21 comments
Assignees

Comments

@char8
Copy link

char8 commented Jul 16, 2024

We pulled in a new release of the guest agent (1:20240701.00-g1) incorporating #396 and #386 during a packer build of a new VM image.

This guest agent now writes a file /etc/netplan/20-google-guest-agent-ethernet.yaml with the contents:

network:
    version: 2
    ethernets:
        ens5:
            match:
                name: ens5
            mtu: 1460
            dhcp4: true
            dhcp4-overrides:
                use-domains: true

vs. the previous default /etc/netplan/90-default.yaml

network:
    version: 2
    ethernets:
        all-en:
            match:
                name: en*
            dhcp4: true
            dhcp4-overrides:
                use-domains: true
            dhcp6: true
            dhcp6-overrides:
                use-domains: true
        all-eth:
            match:
                name: eth*
            dhcp4: true
            dhcp4-overrides:
                use-domains: true
            dhcp6: true
            dhcp6-overrides:
                use-domains: true

the interface on the build instance is ens4 and the 20-google-guest-agent-ethernet.yaml file hardcodes that interface name into the image.

When the Image is run on a new VM, if that VM has a different network interface name (eg: we're seeing ens5 on some VMs), the network interface fails to come up since the declaration in the condig file is missing. This effectively breaks networking on the box as the ens5 interface is never brought up because /run/systemd/network/10-netplan-all-en.network is missing.

  • we confirmed this by upgrading the guest agent on a running VM and observing that run/systemd/network/10-netplan-all-en.network and /etc/netplan/90-default.yaml is missing post upgrade.
  • we see no evidence to indicate network device naming is predictable/persistent between reboots; a VM coming up with a different network interface name post a reboot will now not bring up networking due to this change
  • our workflow for creating custom machine images is now broken as the builder machine has different network interface names (ens4) to the VMs on our managed instance group (which come up with ens5).

We're running the debian-cloud/debian-12 image with:

netplan.io                            0.106-2+deb12u1
systemd                               252.26-1~deb12u2
google-guest-agent           1:20240701.00-g1

Post reboot; the guest agent is crashing because it can't reach the metadata API (since ens5 is not up), so it wouldn't be able to presumably re-generate the config for the new interface name.

2024-07-16T03:39:09.740174+00:00 packer-6695cb52-72cf-9b08-c0c0-dcfffc97fcf8 google_guest_agent[1121]: ERROR instance_setup.go:159 Failed to reach MDS(all retries exhausted): exhausted all (100) retries, last error: request failed with status code: [-1], error: [error connecting to metadata server: Get "http://169.254.169.254/computeMetadata/v1/?alt=json&recursive=true&timeout_sec=60": dial tcp 169.254.169.254:80: connect: network is unreachable]
2024-07-16T03:39:09.740991+00:00 packer-6695cb52-72cf-9b08-c0c0-dcfffc97fcf8 systemd[1]: google-guest-agent.service: Main process exited, code=exited, status=1/FAILURE
2024-07-16T03:39:09.741072+00:00 packer-6695cb52-72cf-9b08-c0c0-dcfffc97fcf8 systemd[1]: google-guest-agent.service: Failed with result 'exit-code'.
@char8
Copy link
Author

char8 commented Jul 16, 2024

updated the ticket after tracing the behaviour change to #386. We've pinned ourselves to 1:20240528.00-g1 for the moment. It would seem that the guest agent should regenerate the netplan configs on boot based on the ifname; but we're not seeing it do that.

@char8
Copy link
Author

char8 commented Jul 16, 2024

We've further identified the reason for the network interface name change to be the addition of local SSDs on the VM running the image (the VM running packer to build the image lacked NVMEs). The NVMe takes PCIe slot ID 4, pushing the NIC to PCIe slot 5, resulting in the change from ens4 -> ens5 based on systemd's naming scheme.

@ChaitanyaKulkarni28
Copy link
Member

We're looking into this issue. Does instance have any networking or its completely broken?

@drewhli drewhli self-assigned this Jul 16, 2024
@JakeCooper
Copy link

Networking is completely broken. Cannot SSH or anything

@restingbull
Copy link

We have the same issue.

@gaughen
Copy link

gaughen commented Jul 16, 2024

We have a package built from a previous stable state (with a newer version number) and are finishing up upgrade testing from the version with the issue to the previous stable package. We hope to release it shortly, and will update this bug with an ETA as soon as we have it.

@mikehardenize
Copy link

We also have broken networking after guest-agent updated on our system.

We're getting logs like:

Jul 17 13:19:34 app1-staging-app google_guest_agent[1683]: Rolling back systemd-networkd
Jul 17 13:19:34 app1-staging-app google_guest_agent[1683]: rolling back changes for systemd-networkd

It seems related to this:

Jul 17 12:55:30 app1-staging-app google_guest_agent[1665]: Setting up NetworkManager
Jul 17 12:55:30 app1-staging-app NetworkManager[779]: <warn>  [1721220930.0963] keyfile: guest-agent: invalid setting name 'guest-agent'
Jul 17 12:55:30 app1-staging-app s3fs[1037]: ### retrying...

We're running a Rocky 8 system and NetworkManager doesn't seem to understand the config that google-guest-agent is placing at /etc/NetworkManager/system-connections/google-guest-agent-eth0.nmconnection. The version of NetworkManager on this system is NetworkManager-1.40.16-15.el8_9.x86_64. We have a Rocky 9 system where this isn't a problem, but that one has a newer version of NetworkManager which does understand the config: NetworkManager-1.46.0-8.el9_4.x86_64

@gaughen
Copy link

gaughen commented Jul 17, 2024

The team will be releasing the new (old) version this morning.

@ChaitanyaKulkarni28
Copy link
Member

Thanks for bringing up the issues you're seeing. We have rolled back the changes and released another agent 20240716.00 version.

@a-crate
Copy link
Contributor

a-crate commented Jul 17, 2024

Hi @mikehardenize , we believe this might be a different issue. Can you provide reproduction steps and what functionality is not working that you're expecting to work? The log messages you have posted are expected, the agent will log Rolling back X for every network management software that it's not going to configure, and it prints Setting up NetworkManager with no error log after indicating successfully writing NetworkManger config files.

The NetworkManger warnings indicate that it doesn't understand the guest-agent ini key but this is expected and as far as we can tell the configuration file is otherwise applied and the NIC is activated and configured correctly. Is there some functionality breakage you're seeing? We can't reproduce the broken networking on Rocky Linux 8.

tpdownes added a commit to tpdownes/hpc-toolkit that referenced this issue Jul 17, 2024
Fix the versions for local google guest VM services so that they do not
upgrade to versions that are known to have boot-time issues for the
following combination:

- building image using Packer on a build VM without local NVME devices
- final image used on a VM with local NVME devices

In this combination, network configurations persist that do not match
the final naming conventions of the network interfaces because of
differing PCI bus layout.
tpdownes added a commit to tpdownes/hpc-toolkit that referenced this issue Jul 17, 2024
Fix the versions for local google guest VM services so that they do not
upgrade to versions that are known to have boot-time issues for the
following combination:

- building image using Packer on a build VM without local NVME devices
- final image used on a VM with local NVME devices

In this combination, network configurations persist that do not match
the final naming conventions of the network interfaces because of
differing PCI bus layout.
@JakeCooper
Copy link

JakeCooper commented Jul 17, 2024

I gotta say it is scary as heck to see pull requests like this

  • No description
  • Hundreds of lines of critical networking changes
  • No tests (from what I can see?)

CleanShot 2024-07-17 at 13 51 01

We're lucky we run a fleet of canary instances and caught this early before it made it to our other instances.

I won't be dogmatic about the solution, but, given the total littany of issues we're run into with Google cloud, this is one of the many nails in the coffin of "This cannot be trusted for production infrastructure"

@drewhli
Copy link
Contributor

drewhli commented Jul 18, 2024

For anyone still facing these issues, follow these steps to work around and update the guest-agent. Similar steps can be found here.

  1. Detach the boot disk from the affected VM.
  2. Attach the detached disk to a different VM that's working.
  3. SSH into the second VM and mount the new disk to a folder. If the new disk is designated as sdb, the command may look like the following:
sudo mkdir /mnt/data && sudo mount /dev/sdb1 /mnt/data
  1. Delete the file /mnt/data/etc/netplan/20-google-guest-agent-ethernet.yaml.
  2. Re-add the default netplan configuration file: Create a new file /mnt/data/etc/netplan/90-default.yaml, and paste the following contents inside:
network:
    version: 2
    ethernets:
        all-en:
            match:
                name: en*
            dhcp4: true
            dhcp4-overrides:
                use-domains: true
            dhcp6: true
            dhcp6-overrides:
                use-domains: true
        all-eth:
            match:
                name: eth*
            dhcp4: true
            dhcp4-overrides:
                use-domains: true
            dhcp6: true
            dhcp6-overrides:
                use-domains: true
  1. Unmount the disk
sudo umount /dev/sdb1
  1. Detach the disk from the second VM and re-attach it to the original VM. SSH and networking should be working again.

@mikehardenize
Copy link

Hi @mikehardenize , we believe this might be a different issue. Can you provide reproduction steps and what functionality is not working that you're expecting to work?

It's a Rocky Linux 8 system. We are using wireguard via systemd-networkd. After the upgrade of guest-agent, a default route is added for that wireguard gateway ip. This breaks our networking. The default route wasn't added prior to the upgrade.

sharabiani pushed a commit to sharabiani/hpc-toolkit that referenced this issue Aug 12, 2024
Fix the versions for local google guest VM services so that they do not
upgrade to versions that are known to have boot-time issues for the
following combination:

- building image using Packer on a build VM without local NVME devices
- final image used on a VM with local NVME devices

In this combination, network configurations persist that do not match
the final naming conventions of the network interfaces because of
differing PCI bus layout.
@mikehardenize
Copy link

This problem happened again this morning. Our Rocky Linux 8 server upgraded from google-guest-agent-20241022.00-g1.el8.x86_64 to google-guest-agent-20241031.00-g1.el8.x86_64 and we lost our network. How do we downgrade back to google-guest-agent-20241022.00-g1.el8.x86_64?

@ChaitanyaKulkarni28
Copy link
Member

Hi @mikehardenize, can you elaborate a bit on your issue? Is this same as you previously described (new routes getting added) or something else? This version of guest-agent does not manage primary nic, its done by NetworkManager. So, its unlikely to see agent configuring anything for primary nic here.

@mikehardenize
Copy link

We've had to scrap and rebuild that box now so I'm not sure if it was the same issue as before re adding a new default route to the wireguard interfaces. We have another Rocky 8 box which had the old RPM installed still, so we ended up just manually copying across the files that RPM installed during the build of the new server and it worked. We'll pin to google-guest-agent-20241022.00-g1.el8.x86_64 on our existing boxes for now.

I don't supposed you'd be able to supply a copy of the google-guest-agent-20241022.00-g1.el8.x86_64 RPM for us so we can do this more cleanly going forwards? I'm struggling to find this RPM anywhere.

Its just a Rocky 8 system, with networkd installed, wireguard installed from elrepo-release and a wireguard interface installed in /etc/systemd/network/

wg0.netdev:

[NetDev]
Name=wg0
Kind=wireguard
Description=Wireguard Server

[WireGuard]
PrivateKey=*OBFUSCATED*

[WireGuardPeer]
PublicKey=*OBFUSCATED*
PresharedKey=*OBFUSCATED*
AllowedIPs=0.0.0.0/0
AllowedIPs=::/0
Endpoint=*OBFUSCATED*

wg0.network:

[Match]
Name=wg0

[Network]
Address=10.111.114.2/24
Address=fc00:114::2/64

[Route]
Gateway = 10.111.114.1
Destination = 10.111.114.0/24
GatewayOnlink = true

[Route]
Gateway = fc00:114::1
Destination = fc00:114::0/64
GatewayOnLink = true

@ChaitanyaKulkarni28
Copy link
Member

I don't supposed you'd be able to supply a copy of the google-guest-agent-20241022.00-g1.el8.x86_64 RPM...

That's right, we serve only single latest version of guest-agent.

These config files you shared are not written by guest-agent, is there anything in /etc/NetworkManager/system-connections that agent might've written?

Note: If this is a single-nic instance agent will not configure that interface, only on multi-nic instance secondary interfaces are configured.

@mikehardenize
Copy link

mikehardenize commented Nov 13, 2024

The config I supplied is for our wireguard interface which is brought up by networkd, manually added by ourselves. Right now, that interface doesn't have a default route. By design. However, after your previous upgrade which had to be rolled back it started adding one. I'm not sure if this is what happened in this case again, as we don't have the original disk anymore.

In /etc/NetworkManager/system-connections the first time things broke, we had a file google-guest-agent-eth0.nmconnection containing:

[guest-agent]
ManagedByGuestAgent = true

[connection]
interface-name = eth0
id             = google-guest-agent-eth0
type           = ethernet

[ipv4]
method = auto

[ipv6]
method = auto

NetworkManager did not understand this config:

Jul 17 12:55:30 app1-staging-app google_guest_agent[1665]: Setting up NetworkManager
Jul 17 12:55:30 app1-staging-app NetworkManager[779]: <warn>  [1721220930.0963] keyfile: guest-agent: invalid setting name 'guest-agent'

@ChaitanyaKulkarni28
Copy link
Member

@mikehardenize I think this file might be stale, left over or something from older July version (previous version that had this issue). I double checked launching Rocky 8 instance running 20241031.00 version of agent and it does not write any configs for the primary nic -

# curl "http://metadata.google.internal/computeMetadata/v1/instance/image" -H "Metadata-Flavor: Google"
projects/rocky-linux-cloud/global/images/rocky-linux-8-v20241112
#
# ls /etc/NetworkManager/system-connections/
#
# rpm -qa | grep google-guest
google-guest-agent-20241031.00-g1.el8.x86_64

Guest Agent does not manage primary nic config unless it is explicitly configured to do so by setting manage_primary_nic = true

Refer this for more details on it.

About NetworkManager not understanding the config, I believe that's just a warning about unknown flag -

[guest-agent]
ManagedByGuestAgent = true

we add this to a file to identify its agent managed before making any modifications. But regardless, if you're installing 20241031.00 version of the guest-agent you should not see any agent managed configs on singe nic instance, if its a multi-nic instance it will be only for secondary nics.

@mikehardenize
Copy link

Hi, yes that file was provided from the previous time we had a problem. We don't have a copy of the disk from this times failure so I can't tell you what was on it. Is there any way we can get a copy of the google-guest-agent-20241022.00-g1.el8.x86_64 rpm so we can just keep a local copy of that and install and pin it?

@ChaitanyaKulkarni28
Copy link
Member

ChaitanyaKulkarni28 commented Nov 14, 2024

No, currently we don't keep full history of packages we ever published and always serve just the current latest only.

Would mind trying 20241031.00 version in your test environment or something? I'm certain that in any case it won't mange primary nic configs (unless explicitly configured to as mentioned before), meaning, it won't write that file that you were seeing before. If there's any bug or something we do want to make sure we fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants