Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Set provider ID through kubelet in ubuntu templates #1551

Merged

Conversation

lentzi90
Copy link
Contributor

@lentzi90 lentzi90 commented May 15, 2023

What this PR does / why we need it:

To avoid issues with potential mismatches between node names and
openstack servers we can configure the kubelet to set the provider ID.
Otherwise, the cloud controller will try to match nodes and servers
based on just the name. The names can differ because of special
characters, like dots. When this happens, the cloud controller will be
unable to match them and thus believe that the node has no underlying
server.

Also removes some unused (I think) cluster templates and kustomizations.

I will make a follow up PR for the flatcar templates.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1547 or at least works around it.

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.
  2. I tagged this as 🌱 not sure if it makes sense

TODOs:

  • squashed commits
  • if necessary:
    • includes documentation
    • adds unit tests

/hold

Some of the kustomizations and cluster-templates seem to be no longer
used and can be removed. These were left over when the default template
was switched to use the external cloud provider.
@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 15, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lentzi90

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link

netlify bot commented May 15, 2023

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name Link
🔨 Latest commit 8c7b85e
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-cluster-api-openstack/deploys/64706adae285560008bd93ba
😎 Deploy Preview https://deploy-preview-1551--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 15, 2023
@lentzi90
Copy link
Contributor Author

@tormath1 can you confirm if the flatcar template is used/needed or not? I removed it because there was no make target for it, so I assumed it was just forgotten.

Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 15, 2023
@tormath1
Copy link
Contributor

tormath1 commented May 15, 2023

@tormath1 can you confirm if the flatcar template is used/needed or not? I removed it because there was no make target for it, so I assumed it was just forgotten.

Thanks for the ping - I think it has been missed from the previous PR to promote the external-cloud-provider as the default one (https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1522/files). So we can remove it :)

@jichenjc
Copy link
Contributor

/test pull-cluster-api-provider-openstack-e2e-test

@jichenjc
Copy link
Contributor

I am curious on {{ instance_id }} comes from.. is this a Cluster API definition and CAPO feed for it somewhere?
I didn't see it used currently in CAPO now ..so curious how the template become real provider ID..

@lentzi90
Copy link
Contributor Author

I am curious on {{ instance_id }} comes from.. is this a Cluster API definition and CAPO feed for it somewhere? I didn't see it used currently in CAPO now ..so curious how the template become real provider ID..

It comes with the instance data from cloud-init. Basically all of the kubeadm config will be injected through cloud-init (or ignition) and can have templated values that is filled in from the instance data.

@lentzi90 lentzi90 force-pushed the lentzi90/remove-unused-templates branch from cf69675 to ac86215 Compare May 16, 2023 05:40
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 16, 2023
@lentzi90 lentzi90 force-pushed the lentzi90/remove-unused-templates branch from ac86215 to b0186f4 Compare May 16, 2023 05:52
@lentzi90
Copy link
Contributor Author

Hmm I'm trying to figure out where the issue is for flatcar but not finding any relevant logs. Any clue @tormath1 ?

@lentzi90
Copy link
Contributor Author

Documenting some debugging progress here.

The service that is responsible for setting the variables for flatcar is coreos-metadata:

$ systemctl cat coreos-metadata                                                                                                                              
# /usr/lib/systemd/system/coreos-metadata.service                                                                                                                                                          
[Unit]                                                                                                                                                                                                     
Description=Flatcar Metadata Agent                                                                                                                                                                         
                                                                                                                                                                                                           
[Service]                                                                                                                                                                                                  
Type=oneshot                                                                                                                                                                                               
RemainAfterExit=yes                                                                                                                                                                                        
Restart=on-failure                                                                                                                                                                                         
RestartSec=10                                                                                                                                                                                              
Environment=COREOS_METADATA_OPT_PROVIDER=--cmdline                                                                                                                                                         
ExecStart=/usr/bin/coreos-metadata ${COREOS_METADATA_OPT_PROVIDER} --attributes=/run/metadata/flatcar                                                                                                      
ExecStartPost=/usr/bin/sed --in-place "s/AFTERBURN/COREOS/g ; s/AWS/EC2/g ; s/GCP/GCE/g" /run/metadata/flatcar                                                                                             
ExecStartPost=/usr/bin/ln -fs /run/metadata/flatcar /run/metadata/coreos                                                                                                                                   
                                                                                                                                                                                                           
[Install]                                                                                                                                                                                                  
RequiredBy=metadata.target                                                                                                                                                                                 
Alias=afterburn.service

It ran successfully and filled in the /run/metadata/flatcar file:

$ systemctl status coreos-metadata                                                                                                                           
● coreos-metadata.service - Flatcar Metadata Agent                                                                                                                                                         
     Loaded: loaded (/usr/lib/systemd/system/coreos-metadata.service; disabled; preset: disabled)                                                                                                          
     Active: active (exited) since Thu 2023-05-25 08:17:17 UTC; 3min 1s ago                                                                                                                                
   Main PID: 1090 (code=exited, status=0/SUCCESS)                                                                                                                                                          
        CPU: 108ms                                                                                                                                                                                         
                                                                                                                                                                                                           
May 25 08:17:17 cluster-e2e-6tznc9-control-plane-zxn8l.novalocal coreos-metadata[1090]: May 25 08:17:17.369 INFO Fetch successful                                                                          
May 25 08:17:17 cluster-e2e-6tznc9-control-plane-zxn8l.novalocal coreos-metadata[1090]: May 25 08:17:17.369 INFO Fetching http://169.254.169.254/latest/meta-data/instance-id: Attempt #1                  
May 25 08:17:17 cluster-e2e-6tznc9-control-plane-zxn8l.novalocal coreos-metadata[1090]: May 25 08:17:17.422 INFO Fetch successful                                                                          
May 25 08:17:17 cluster-e2e-6tznc9-control-plane-zxn8l.novalocal coreos-metadata[1090]: May 25 08:17:17.423 INFO Fetching http://169.254.169.254/latest/meta-data/instance-type: Attempt #1                
May 25 08:17:17 cluster-e2e-6tznc9-control-plane-zxn8l.novalocal coreos-metadata[1090]: May 25 08:17:17.475 INFO Fetch successful                                                                          
May 25 08:17:17 cluster-e2e-6tznc9-control-plane-zxn8l.novalocal coreos-metadata[1090]: May 25 08:17:17.475 INFO Fetching http://169.254.169.254/latest/meta-data/local-ipv4: Attempt #1                   
May 25 08:17:17 cluster-e2e-6tznc9-control-plane-zxn8l.novalocal coreos-metadata[1090]: May 25 08:17:17.528 INFO Fetch successful                                                                          
May 25 08:17:17 cluster-e2e-6tznc9-control-plane-zxn8l.novalocal coreos-metadata[1090]: May 25 08:17:17.528 INFO Fetching http://169.254.169.254/latest/meta-data/public-ipv4: Attempt #1                  
May 25 08:17:17 cluster-e2e-6tznc9-control-plane-zxn8l.novalocal coreos-metadata[1090]: May 25 08:17:17.578 INFO Fetch successful                                                                          
May 25 08:17:17 cluster-e2e-6tznc9-control-plane-zxn8l.novalocal systemd[1]: Finished coreos-metadata.service.

However, the instance ID is incorrect, as @tormath1 had already found out:

$ cat /run/metadata/flatcar                                                                                                                                  
COREOS_OPENSTACK_IPV4_LOCAL=10.6.0.65                                                                                                                                                                      
COREOS_OPENSTACK_IPV4_PUBLIC=172.24.4.96                                                                                                                                                                   
COREOS_OPENSTACK_INSTANCE_TYPE=m1.medium                                                                                                                                                                   
COREOS_OPENSTACK_HOSTNAME=cluster-e2e-nfkf4l-control-plane-pvng4.novalocal                                                                                                                                 
COREOS_OPENSTACK_INSTANCE_ID=i-00000005

As you an see, the metadata service gets the data from http://169.254.169.254/latest/meta-data. But there is another layer of data at http://169.254.169.254/openstack/latest/meta-data:

$ curl http://169.254.169.254/openstack/latest/meta_data.json
{"uuid": "36db0850-9770-41ee-8310-eca39bb62547", "public_keys": {"cluster-api-provider-openstack-sigs-k8s-io": "ssh-rsa AAA..."}, "keys": [{"name": "cluster-api-provider-openstack-sigs-k8s-io", "type": "ssh", "data": "ssh-rsa
 AAA..."}], "hostname"
: "cluster-e2e-6tznc9-control-plane-zxn8l.novalocal", "name": "cluster-e2e-6tznc9-control-plane-zxn8l", "launch_index": 0, "availability_zone": "testaz1", "random_seed": "CM0...", "project_id": "d2f2d9be8c2544d3b417ecf581dee54b", "devices": [], "dedicated_cpus": []}

The uuid has the correct value that we are after. When I compare with the cloud-init instance data I can also there see the wrong instance ID in one place, but it is correct in a few other places. I'm not sure where to go from here.

Could it be that this is an issue in the coreos-metadata service? I.e. it is fetching the wrong data?
Or could it be that the devstack is not configured correctly?
Maybe it is all correct and we just have to manually get the correct instance ID in a different way?

@mdbooth
Copy link
Contributor

mdbooth commented May 25, 2023

There are 2 problems here:

  1. coreos-metadata (aka afterburn?) is using OpenStack's EC2 metadata service rather than OpenStack's native metadata service.
  2. It seems there's a bug in ec2 metadata which means it's returning i- rather than the actual uuid.

And a third problem:
3. ec2 metadata is deep frozen, hopelessly bitrotted, and chances are nobody's going to fix it.

We should probably try to fix it in afterburn, tbh. @tormath1 what's the relationship between afterburn and coreos-metadata?

@lentzi90
Copy link
Contributor Author

I can open an issue in afterburn. (From what I can see the coreos-metadata is the same as afterburn.) For some reason they are checking the EC2 service when fetching from network. When using config-drive it combines some fields from the EC2 service and some from OpenStack, which seems really weird to me.

https://github.com/coreos/afterburn/blob/f22131c775094c559baf5c5d712425f341f3556e/src/providers/openstack/configdrive.rs#L146-L151

@mdbooth
Copy link
Contributor

mdbooth commented May 25, 2023

I can open an issue in afterburn. (From what I can see the coreos-metadata is the same as afterburn.) For some reason they are checking the EC2 service when fetching from network. When using config-drive it combines some fields from the EC2 service and some from OpenStack, which seems really weird to me.

https://github.com/coreos/afterburn/blob/f22131c775094c559baf5c5d712425f341f3556e/src/providers/openstack/configdrive.rs#L146-L151

Beat you to it: coreos/afterburn#930

@mdbooth
Copy link
Contributor

mdbooth commented May 25, 2023

Because they're separate templates, we have the option to fix this in Flatcar independently right? I wonder if we move on with the Ubuntu fix for now and raise a separate issue to fix it for Flatcar. @lentzi90 @tormath1 Thoughts?

@lentzi90
Copy link
Contributor Author

Yes that works for me! I can split this PR into one for ubuntu and one for flatcar.

@lentzi90 lentzi90 force-pushed the lentzi90/remove-unused-templates branch from b0186f4 to 88d8bf5 Compare May 25, 2023 11:39
@lentzi90 lentzi90 changed the title 🌱 Set provider ID through kubelet in templates 🌱 Set provider ID through kubelet in ubuntu templates May 25, 2023
@lentzi90 lentzi90 force-pushed the lentzi90/remove-unused-templates branch from 88d8bf5 to 6693c30 Compare May 25, 2023 12:05
This commit sets the provider ID in the ubuntu templates.
To avoid issues with potential mismatches between node names and
openstack servers we can configure the kubelet to set the provider ID.
Otherwise, the cloud controller will try to match nodes and servers
based on just the name. The names can differ because of special
characters, like dots. When this happens, the cloud controller will be
unable to match them and thus believe that the node has no underlying
server.
@lentzi90 lentzi90 force-pushed the lentzi90/remove-unused-templates branch from 6693c30 to 8c7b85e Compare May 26, 2023 08:16
Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Comment on lines +26 to +28
# Fixme(lentzi90): This is here just to override the value set in the default
# kustomization. It will be replaced with a value that works for flatcar in
# https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1564
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You added this comment in both places other than the one where I asked for it 🤣 I'll try not to take it personally.

It's not worth updating imho. The important thing is that that somebody looking will see the comment, and they'll see one of these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG sorry! I'm not sure how I managed to miss the first occurance 🤦 😂

@@ -20,6 +20,7 @@ spec:
nodeRegistration:
kubeletExtraArgs:
cloud-provider: external
provider-id: openstack:///'{{ instance_id }}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why we're not inheriting this from default? The answer to this question would be outside the scope of this PR, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the generated template! So it is actually inherited from default and that is why we see it here. 🙂 The without-lb patch is over here and does not have any provider-id set.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 26, 2023
@lentzi90
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 26, 2023
@k8s-ci-robot k8s-ci-robot merged commit b21411b into kubernetes-sigs:main May 26, 2023
@lentzi90 lentzi90 deleted the lentzi90/remove-unused-templates branch May 26, 2023 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Node/server name mismatch due to dots in name
5 participants