VM can't join on cluster #438

HakjunMIN · 2024-07-15T07:20:02Z

Version

Karpenter Version: v0.0.0

Kubernetes Version: v1.0.0

Expected Behavior

Once nodeclame, Node is normally created then should join in cluster

Actual Behavior

Once nodeclame, Node is normally created but can't join in the cluster so continue to recreate a new node repeatedly and old one is deleted

Never VM added to the cluster

 k get node
NAME                                STATUS   ROLES   AGE     VERSION
aks-nodepool1-38787940-vmss000000   Ready    agent   4h28m   v1.28.10
aks-nodepool1-38787940-vmss000001   Ready    agent   4h28m   v1.28.10
aks-nodepool1-38787940-vmss000002   Ready    agent   4h28m   v1.28.10
aks-nodepool1-38787940-vmss000003   Ready    agent   4h28m   v1.28.10
aks-nodepool1-38787940-vmss000004   Ready    agent   4h27m   v1.28.10
aks-nodepool1-38787940-vmss000005   Ready    agent   4h28m   v1.28.10
aks-nodepool1-38787940-vmss000006   Ready    agent   4h28m   v1.28.10

Steps to Reproduce the Problem

Setup following this repo's README

Resource Specs and Logs

ms"}
{"level":"INFO","time":"2024-07-15T07:07:17.444Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"bbaa9b7","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2024-07-15T07:07:17.459Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"bbaa9b7","nodepool":"gpu","nodeclaim":"gpu-qmtr5","requests":{"cpu":"250m","memory":"470Mi","nvidia.com/gpu":"1","pods":"7"},"instance-types":"Standard_NC16as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC4as_T4_v3, Standard_NC64as_T4_v3 and 8 other(s)"}
{"level":"INFO","time":"2024-07-15T07:07:17.490Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_NV6ads_A10_v5","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5"}
{"level":"INFO","time":"2024-07-15T07:07:17.497Z","logger":"controller.nodeclaim.lifecycle","message":"Resolved image /CommunityGalleries/AKSUbuntu-38d80f77-467a-481f-a8d4-09b6d4220bd2/images/2204gen2containerd/versions/202407.08.0 for instance type Standard_NV6ads_A10_v5","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5"}
{"level":"DEBUG","time":"2024-07-15T07:07:17.497Z","logger":"controller.nodeclaim.lifecycle","message":"Returning 2 IPv4 backend pools: [/subscriptions/12f55838-824f-4f06-a39a-e452ff7fdb7a/resourceGroups/MC_kubeflow_kubeflowcl_japaneast/providers/Microsoft.Network/loadBalancers/kubernetes/backendAddressPools/aksOutboundBackendPool /subscriptions/12f55838-824f-4f06-a39a-e452ff7fdb7a/resourceGroups/MC_kubeflow_kubeflowcl_japaneast/providers/Microsoft.Network/loadBalancers/kubernetes/backendAddressPools/kubernetes]","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5"}
{"level":"DEBUG","time":"2024-07-15T07:07:17.497Z","logger":"controller.nodeclaim.lifecycle","message":"Creating network interface aks-gpu-qmtr5","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5"}
{"level":"DEBUG","time":"2024-07-15T07:07:24.784Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"bbaa9b7"}
{"level":"DEBUG","time":"2024-07-15T07:07:25.785Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"bbaa9b7"}
{"level":"DEBUG","time":"2024-07-15T07:07:26.786Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"bbaa9b7"}
{"level":"DEBUG","time":"2024-07-15T07:07:27.428Z","logger":"controller.provisioner","message":"waiting on cluster sync","commit":"bbaa9b7"}
{"level":"DEBUG","time":"2024-07-15T07:07:27.786Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"bbaa9b7"}
{"level":"DEBUG","time":"2024-07-15T07:07:28.719Z","logger":"controller.nodeclaim.lifecycle","message":"Successfully created network interface: /subscriptions/12f55838-824f-4f06-a39a-e452ff7fdb7a/resourceGroups/MC_kubeflow_kubeflowcl_japaneast/providers/Microsoft.Network/networkInterfaces/aks-gpu-qmtr5","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5"}
{"level":"DEBUG","time":"2024-07-15T07:07:28.719Z","logger":"controller.nodeclaim.lifecycle","message":"Creating virtual machine aks-gpu-qmtr5 (Standard_NV6ads_A10_v5)","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5"}
{"level":"DEBUG","time":"2024-07-15T07:08:23.918Z","logger":"controller.nodeclaim.lifecycle","message":"Created  virtual machine AKS identifying extension for aks-gpu-qmtr5, with an id of /subscriptions/12f55838-824f-4f06-a39a-e452ff7fdb7a/resourceGroups/MC_kubeflow_kubeflowcl_japaneast/providers/Microsoft.Compute/virtualMachines/aks-gpu-qmtr5/extensions/computeAksLinuxBilling","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5"}
{"level":"INFO","time":"2024-07-15T07:08:23.918Z","logger":"controller.nodeclaim.lifecycle","message":"launched new instance","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5","launched-instance":"/subscriptions/12f55838-824f-4f06-a39a-e452ff7fdb7a/resourceGroups/MC_kubeflow_kubeflowcl_japaneast/providers/Microsoft.Compute/virtualMachines/aks-gpu-qmtr5","hostname":"aks-gpu-qmtr5","type":"Standard_NV6ads_A10_v5","zone":"1","capacity-type":"on-demand"}
{"level":"INFO","time":"2024-07-15T07:08:23.918Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5","provider-id":"azure:///subscriptions/12f55838-824f-4f06-a39a-e452ff7fdb7a/resourceGroups/mc_kubeflow_kubeflowcl_japaneast/providers/Microsoft.Compute/virtualMachines/aks-gpu-qmtr5","instance-type":"Standard_NV6ads_A10_v5","zone":"","capacity-type":"on-demand","allocatable":{"cpu":"5840m","ephemeral-storage":"128G","memory":"46288Mi","nvidia.com/gpu":"1","pods":"110"}}

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

HakjunMIN · 2024-07-29T06:45:41Z

looked this related to #247

Just updating cluster with empty (az aks update -n cluster -g rg) made this resolved.

I though this issue was already resolved by some release. Any way I'll check more using my cluster

Bryce-Soghigian · 2024-08-20T08:58:25Z

This issue is only solved for managed karpenter. Self-hosted will still have this side effect.

tallaxes · 2024-10-22T19:35:28Z

Closing as duplicate

maulik13 mentioned this issue Aug 14, 2024

Karpenter stuck with pod Pod should schedule on: nodeclaim/... #432

Open

tallaxes added the triage/duplicate Indicates an issue is a duplicate of other open issue. label Oct 22, 2024

tallaxes closed this as completed Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VM can't join on cluster #438

VM can't join on cluster #438

HakjunMIN commented Jul 15, 2024 •

edited

Loading

HakjunMIN commented Jul 29, 2024 •

edited

Loading

Bryce-Soghigian commented Aug 20, 2024

tallaxes commented Oct 22, 2024

VM can't join on cluster #438

VM can't join on cluster #438

Comments

HakjunMIN commented Jul 15, 2024 • edited Loading

Version

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Resource Specs and Logs

Community Note

HakjunMIN commented Jul 29, 2024 • edited Loading

Bryce-Soghigian commented Aug 20, 2024

tallaxes commented Oct 22, 2024

HakjunMIN commented Jul 15, 2024 •

edited

Loading

HakjunMIN commented Jul 29, 2024 •

edited

Loading