-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
google-guest-agent after 20240701.00 persists a file that locks systemd-networkd to a specific interface device name #401
Comments
updated the ticket after tracing the behaviour change to #386. We've pinned ourselves to |
We've further identified the reason for the network interface name change to be the addition of local SSDs on the VM running the image (the VM running packer to build the image lacked NVMEs). The NVMe takes PCIe slot ID 4, pushing the NIC to PCIe slot 5, resulting in the change from ens4 -> ens5 based on systemd's naming scheme. |
We're looking into this issue. Does instance have any networking or its completely broken? |
Networking is completely broken. Cannot SSH or anything |
We have the same issue. |
We have a package built from a previous stable state (with a newer version number) and are finishing up upgrade testing from the version with the issue to the previous stable package. We hope to release it shortly, and will update this bug with an ETA as soon as we have it. |
We also have broken networking after guest-agent updated on our system. We're getting logs like:
It seems related to this:
We're running a Rocky 8 system and NetworkManager doesn't seem to understand the config that google-guest-agent is placing at /etc/NetworkManager/system-connections/google-guest-agent-eth0.nmconnection. The version of NetworkManager on this system is |
The team will be releasing the new (old) version this morning. |
Thanks for bringing up the issues you're seeing. We have rolled back the changes and released another agent |
Hi @mikehardenize , we believe this might be a different issue. Can you provide reproduction steps and what functionality is not working that you're expecting to work? The log messages you have posted are expected, the agent will log The NetworkManger warnings indicate that it doesn't understand the |
Fix the versions for local google guest VM services so that they do not upgrade to versions that are known to have boot-time issues for the following combination: - building image using Packer on a build VM without local NVME devices - final image used on a VM with local NVME devices In this combination, network configurations persist that do not match the final naming conventions of the network interfaces because of differing PCI bus layout.
Fix the versions for local google guest VM services so that they do not upgrade to versions that are known to have boot-time issues for the following combination: - building image using Packer on a build VM without local NVME devices - final image used on a VM with local NVME devices In this combination, network configurations persist that do not match the final naming conventions of the network interfaces because of differing PCI bus layout.
I gotta say it is scary as heck to see pull requests like this
We're lucky we run a fleet of canary instances and caught this early before it made it to our other instances. I won't be dogmatic about the solution, but, given the total littany of issues we're run into with Google cloud, this is one of the many nails in the coffin of "This cannot be trusted for production infrastructure" |
For anyone still facing these issues, follow these steps to work around and update the
|
It's a Rocky Linux 8 system. We are using wireguard via systemd-networkd. After the upgrade of guest-agent, a default route is added for that wireguard gateway ip. This breaks our networking. The default route wasn't added prior to the upgrade. |
Fix the versions for local google guest VM services so that they do not upgrade to versions that are known to have boot-time issues for the following combination: - building image using Packer on a build VM without local NVME devices - final image used on a VM with local NVME devices In this combination, network configurations persist that do not match the final naming conventions of the network interfaces because of differing PCI bus layout.
This problem happened again this morning. Our Rocky Linux 8 server upgraded from |
Hi @mikehardenize, can you elaborate a bit on your issue? Is this same as you previously described (new routes getting added) or something else? This version of guest-agent does not manage primary nic, its done by |
We've had to scrap and rebuild that box now so I'm not sure if it was the same issue as before re adding a new default route to the wireguard interfaces. We have another Rocky 8 box which had the old RPM installed still, so we ended up just manually copying across the files that RPM installed during the build of the new server and it worked. We'll pin to google-guest-agent-20241022.00-g1.el8.x86_64 on our existing boxes for now. I don't supposed you'd be able to supply a copy of the google-guest-agent-20241022.00-g1.el8.x86_64 RPM for us so we can do this more cleanly going forwards? I'm struggling to find this RPM anywhere. Its just a Rocky 8 system, with networkd installed, wireguard installed from elrepo-release and a wireguard interface installed in /etc/systemd/network/ wg0.netdev:
wg0.network:
|
That's right, we serve only single latest version of guest-agent. These config files you shared are not written by guest-agent, is there anything in Note: If this is a single-nic instance agent will not configure that interface, only on multi-nic instance secondary interfaces are configured. |
The config I supplied is for our wireguard interface which is brought up by networkd, manually added by ourselves. Right now, that interface doesn't have a default route. By design. However, after your previous upgrade which had to be rolled back it started adding one. I'm not sure if this is what happened in this case again, as we don't have the original disk anymore. In
NetworkManager did not understand this config:
|
@mikehardenize I think this file might be stale, left over or something from older July version (previous version that had this issue). I double checked launching Rocky 8 instance running
Guest Agent does not manage primary nic config unless it is explicitly configured to do so by setting Refer this for more details on it. About NetworkManager not understanding the config, I believe that's just a warning about unknown flag -
we add this to a file to identify its agent managed before making any modifications. But regardless, if you're installing |
Hi, yes that file was provided from the previous time we had a problem. We don't have a copy of the disk from this times failure so I can't tell you what was on it. Is there any way we can get a copy of the google-guest-agent-20241022.00-g1.el8.x86_64 rpm so we can just keep a local copy of that and install and pin it? |
No, currently we don't keep full history of packages we ever published and always serve just the current latest only. Would mind trying |
We pulled in a new release of the guest agent (
1:20240701.00-g1
) incorporating #396 and #386 during a packer build of a new VM image.This guest agent now writes a file
/etc/netplan/20-google-guest-agent-ethernet.yaml
with the contents:vs. the previous default
/etc/netplan/90-default.yaml
the interface on the build instance is
ens4
and the20-google-guest-agent-ethernet.yaml
file hardcodes that interface name into the image.When the Image is run on a new VM, if that VM has a different network interface name (eg: we're seeing
ens5
on some VMs), the network interface fails to come up since the declaration in the condig file is missing. This effectively breaks networking on the box as theens5
interface is never brought up because/run/systemd/network/10-netplan-all-en.network
is missing.run/systemd/network/10-netplan-all-en.network
and/etc/netplan/90-default.yaml
is missing post upgrade.ens4
) to the VMs on our managed instance group (which come up withens5
).We're running the
debian-cloud/debian-12
image with:Post reboot; the guest agent is crashing because it can't reach the metadata API (since ens5 is not up), so it wouldn't be able to presumably re-generate the config for the new interface name.
The text was updated successfully, but these errors were encountered: