Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud-init upgrade causes vultr init networking to fail. #5092

Open
pnearing opened this issue Mar 23, 2024 · 10 comments
Open

cloud-init upgrade causes vultr init networking to fail. #5092

pnearing opened this issue Mar 23, 2024 · 10 comments
Labels
incomplete Action required by submitter

Comments

@pnearing
Copy link

pnearing commented Mar 23, 2024

On upgrading an Ubuntu Mantic server on Vultr I started getting an error on boot:

2024-03-23 15:02:27,710 - url_helper.py[WARNING]: Calling 'None' failed [119/120s]: request error [HTTPConnectionPool(host='fd00:ec2::254', port=80): Max retries exceeded with url: /2
009-04-04/meta-data/instance-id (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7aefa0738690>: Failed to establish a new connection: [Errno 101] Network
is unreachable'))]

And now this message appears on login:


This system is using the EC2 Metadata Service, but does not appear to
be running on Amazon EC2 or one of cloud-init's known platforms that
provide a EC2 Metadata service. In the future, cloud-init may stop
reading metadata from the EC2 Metadata Service unless the platform can
be identified.

If you are seeing this message, please file a bug against
cloud-init at
https://github.com/canonical/cloud-init/issues
Make sure to include the cloud provider your instance is
running on.

For more information see
#2795

After you have filed a bug, you can disable this warning by
launching your instance with the cloud-config below, or
putting that content into
/etc/cloud/cloud.cfg.d/99-ec2-datasource.cfg

cloud-config
datasource:
Ec2:
strict_id: false


Disable the warnings above by:
touch /root/.cloud-warnings.skip
or
touch /var/lib/cloud/instance/warnings/.skip

Any more information you might require please let me know.

@blackboxsw
Copy link
Collaborator

Can you please perform a cloud-init collect-logs on the system and attach the .tgz to this bug to give us a bit more information. Also are there any other files in /etc/cloud/cloud.cfg at play here?

@blackboxsw
Copy link
Collaborator

Generally speaking, we'd expect Vultr datasource to be discovered here if we are using Latest cloud-init on Mantic. So, I'm presuming we have an issue earlier in logs that lead to Ec2 being detected instead of Vultr. The cloud-init collect-logs requested above will hopefully give us all the information we need to discover how this instance managed to not detect Vultr and fallback to Ec2. the logs that are of most interest here (which will be included in that tar file) are /run/cloud-init/ds-identify.log (initial datasource detection) and /var/log/cloud-init.log (which will potentially show us why Vultr wasn't detected). cloud-init status --format=json may also show you known errors quickly.

@blackboxsw blackboxsw added the incomplete Action required by submitter label Mar 25, 2024
@blackboxsw
Copy link
Collaborator

Yes something else is going on here besides just upgrade path. I launched a Vultr mantic 23.10 instance w/ cloud-init 23.3.3 and upgraded to latest cloud-init 23.4.4 and rebooted with no issues in Vultr datasource detection.

I did recognize a known small bug dealing a warning about scripts/vendor which has already landed in #4986. But that issue would not have cause Vultr datasource to go undiscovered.

root@test-mantic:~# cloud-init --version
/usr/bin/cloud-init 23.4.4-0ubuntu0~23.10.1
root@test-mantic:~# cloud-id
vultr

@blackboxsw
Copy link
Collaborator

CC: @eb3095 just FYI as I don't see a problem at the moment, but we'll wait on logs.

@pnearing
Copy link
Author

Please find attached the logs, as well, I've not changed any config in /etc/cloud.
cloud-init.tar.gz

@blackboxsw
Copy link
Collaborator

Thanks a lot for the logs @pnearing. Near as I can tell something between 03/09 and the reboot after 03/23 somehow altered the list of datasources that cloud-init tried to discover on this system from [ Vultr, None] to the full list of all potential datasources
2024-03-23 13:31:15,412 - __init__.py[DEBUG]: Looking for data source in: ['NoCloud', 'ConfigDrive', 'OpenNebula', 'DigitalOcean', 'Azure', 'AltCloud', 'OVF', 'MAAS', 'GCE', 'OpenStack', 'CloudSigma', 'SmartOS', 'Bigstep', 'Scaleway', 'AliYun', 'Ec2', 'CloudStack', 'Hetzner', 'IBMCloud', 'Oracle', 'Exoscale', 'RbxCloud', 'UpCloud', 'VMware', 'Vultr', 'LXD', 'NWCS', 'None'], via packages ['', 'cloudinit.sources'] that matches dependencies ['FILESYSTEM']

Normally /usr/lib/cloud-init/ds-identify would filter this list of datasources to only what could be viable, but there is configuration on this Vultr instance that is setting manual_cache_clean: true which prevents ds-identify from trying to filter this list of datasources in systemd generator timeframe. You can see that breadcrumb comment in /run/cloud-init/ds-identify.log manual_cache_clean enabled. Not writing datasource_list. This prevents ds-itentify from writing out /run/cloud-init/cloud.cfg with a limited datasource_list: [ Vultr, None ] set of values. Therefore you now see in latest cloud-init logs that cloud-init spends a long time trying to detect a whole bunch on inapplicable datasources, and that lovely error banner message telling you to file a bug.

Normally I expect to see /etc/cloud/cloud.cfg with a datasource_list: ['Vultr'] limited datasource_list. Did something change in /etc/cloud or /etc/cloud/cloud.cfg.d/*cfg files to change this default setting? This ultimately is likely the problem we are seeing here.

, the file /var/lib/cloud/instance/manual-clean exists on this machine or manual_cache_clean: true was set in configuration or user-data. When manual-clean marker exists. ds-it

@TheRealFalcon
Copy link
Member

Did something change in /etc/cloud or /etc/cloud/cloud.cfg.d/*cfg files to change this default setting? This ultimately is likely the problem we are seeing here.

No config change needed. The python version changed (presumably on upgrade) causing the cache to clear.

@eb3095
Copy link
Contributor

eb3095 commented Mar 26, 2024

Yeah this sounds like an extension of one of the issues I was dealing with in IRC. manual_cache_clean: true was added, I had mentioned we were doing this, because we were seeing issues where if something broke with our host networking or dhcp would fail it would cause the server to re-init as nocloud or whatever that default was cycling all the keys and the root user which was a terrible UX for our users and caused it to re-init yet again when networking came back. That was the only solution we were able to find to prevent this behavior, but we didnt make that default option, we change it in our own provided cloud.cfg in our images.

I'de be happy to reopen this issue and find a more amicable solution so we did not need to do that.

@blackboxsw
Copy link
Collaborator

blackboxsw commented Mar 26, 2024

@eb3095 I did see the provided /etc/cloud/cloud.cfg in your images which does limit datasource_list: [ Vultr, None ]. I think if you are generating images, and packaging files delivered to /etc/cloud for configuration, you may want to write the datasource_list configuration to a file like /etc/cloud/cloud.cfg.d/95-ds-vultr.cfg containing:

datasource_list: [ Vultr, None]

The reason being that cloud-init upstream(and dpkg-reconfigure cloud-init) will write /etc/cloud/cloud.cfg.d/90_dpkg.cfg which overrides the default datasource_list to the potentially long list of all datasources that we see above causing tracebacks and errors because Ec2 datasource will get detected on Vultr platforms if Ec2 is before Vultr in the datasource_list. Whatever /etc/cloud/cloud.cfg.d file you choose it'll need to be lexicographically sorted later above 90_dpkg

@eb3095
Copy link
Contributor

eb3095 commented Mar 26, 2024

Fantastic, I will get right on that. Thanks for the advice.

@github-actions github-actions bot added the Stale label Sep 13, 2024
@aciba90 aciba90 removed the Stale label Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incomplete Action required by submitter
Projects
None yet
Development

No branches or pull requests

5 participants