-
Notifications
You must be signed in to change notification settings - Fork 909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure parse_network_config uses fallback cfg when generate IMDS network cfg fails #549
Azure parse_network_config uses fallback cfg when generate IMDS network cfg fails #549
Conversation
…lure to generate network config from IMDS
@anhvoms @Moustafa-Moustafa @trstringer This prevents invalid/corrupted IMDS network metadata from causing provisioning as a whole to fail. |
@johnsonshi thanks for this PR, could you attach as a comment the response of cloud-init query -all (I'm specifically interested in the network section as surfaced by IMDS on one of these failed nodes). If reproducible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One general question I have is if IMDS is in a state that cloud-init can't parse is IMDS network config content recoverable? As in, would retries buy us anything in these cases?
I'm afraid not. The IMDS inconsistencies and delays aren't deterministic, and it's impossible to "determine" whether the complete network metadata returned is even complete or not. The root cause of these inconsistencies is due to a platform issue. |
We discovered these by looking at the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @johnsonshi sorry for the delay here on response. I think we can drop that UT you mentioned and please update the pull request description to make note of the additional mlx5_core driver blacklist functionality on fallback config.
We'll be using the pull request description as your squashed merge commit for this PR when this lands.
Apologies I've got a couple of high priority items so this has been delayed. |
@blackboxsw I've updated the PR description + added the comments on the VM instance SKUs with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent @johnsonshi thanks for the additions. I've launched and upgraded to this version cloud-init and see correct network configuration emitted and network properly setup on non-infiniband eth0 device.
ubuntu@SRU-worked-azure:~$ grep Trace /var/log/cloud-init.log
ubuntu@SRU-worked-azure:~$ cloud-init status --long
status: done
time: Thu, 24 Sep 2020 16:43:35 +0000
detail:
DataSourceAzure [seed=/var/lib/waagent]
ubuntu@SRU-worked-azure:~$ cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by the datasource. Changes
# to it will not persist across an instance reboot. To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
ethernets:
eth0:
dhcp4: true
dhcp4-overrides: &id001
route-metric: 100
dhcp6: true
dhcp6-overrides: *id001
match:
driver: hv_netvsc
macaddress: 00:0d:3a:e2:d9:0e
set-name: eth0
eth1:
dhcp4: true
dhcp4-overrides: &id002
route-metric: 200
dhcp6: true
dhcp6-overrides: *id002
match:
driver: hv_netvsc
macaddress: 00:0d:3a:e2:de:17
set-name: eth1
version: 2
ubuntu@SRU-worked-azure:~$ ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:0d:3a:e2:d9:0e brd ff:ff:ff:ff:ff:ff
inet 10.0.0.4/24 brd 10.0.0.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 ace:cab:deca:deed::4/128 scope global dynamic noprefixroute
valid_lft 17279947sec preferred_lft 8639947sec
inet6 fe80::20d:3aff:fee2:d90e/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:0d:3a:e2:de:17 brd ff:ff:ff:ff:ff:ff
inet 10.0.0.5/24 brd 10.0.0.255 scope global eth1
valid_lft forever preferred_lft forever
inet6 fe80::20d:3aff:fee2:de17/64 scope link
valid_lft forever preferred_lft forever
4: rename4: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth0 state UP group default qlen 1000
link/ether 00:0d:3a:e2:d9:0e brd ff:ff:ff:ff:ff:ff
5: rename5: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth1 state UP group default qlen 1000
link/ether 00:0d:3a:e2:de:17 brd ff:ff:ff:ff:ff:ff
Azure datasource's
parse_network_config
throws a fatal uncaught exception when an exception is raised during generation of network config from IMDS metadata. This happens when IMDS metadata is invalid/corrupted (such as when it is missing network or interface metadata). This causes the rest of provisioning to fail.This changes
parse_network_config
to be a non-fatal implementation. Additionally, when generating network config from IMDS metadata fails, fall back on generating fallback network config (_generate_network_config_from_fallback_config
).This also changes fallback network config generation (
_generate_network_config_from_fallback_config
) to blacklist an additional driver:mlx5_core
.