Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VM can't join on cluster #438

Closed
HakjunMIN opened this issue Jul 15, 2024 · 3 comments
Closed

VM can't join on cluster #438

HakjunMIN opened this issue Jul 15, 2024 · 3 comments
Labels
triage/duplicate Indicates an issue is a duplicate of other open issue.

Comments

@HakjunMIN
Copy link

HakjunMIN commented Jul 15, 2024

Version

Karpenter Version: v0.0.0

Kubernetes Version: v1.0.0

Expected Behavior

Once nodeclame, Node is normally created then should join in cluster

Actual Behavior

Once nodeclame, Node is normally created but can't join in the cluster so continue to recreate a new node repeatedly and old one is deleted

Never VM added to the cluster

 k get node
NAME                                STATUS   ROLES   AGE     VERSION
aks-nodepool1-38787940-vmss000000   Ready    agent   4h28m   v1.28.10
aks-nodepool1-38787940-vmss000001   Ready    agent   4h28m   v1.28.10
aks-nodepool1-38787940-vmss000002   Ready    agent   4h28m   v1.28.10
aks-nodepool1-38787940-vmss000003   Ready    agent   4h28m   v1.28.10
aks-nodepool1-38787940-vmss000004   Ready    agent   4h27m   v1.28.10
aks-nodepool1-38787940-vmss000005   Ready    agent   4h28m   v1.28.10
aks-nodepool1-38787940-vmss000006   Ready    agent   4h28m   v1.28.10

Steps to Reproduce the Problem

Setup following this repo's README

Resource Specs and Logs

ms"}
{"level":"INFO","time":"2024-07-15T07:07:17.444Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"bbaa9b7","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2024-07-15T07:07:17.459Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"bbaa9b7","nodepool":"gpu","nodeclaim":"gpu-qmtr5","requests":{"cpu":"250m","memory":"470Mi","nvidia.com/gpu":"1","pods":"7"},"instance-types":"Standard_NC16as_T4_v3, Standard_NC24ads_A100_v4, Standard_NC48ads_A100_v4, Standard_NC4as_T4_v3, Standard_NC64as_T4_v3 and 8 other(s)"}
{"level":"INFO","time":"2024-07-15T07:07:17.490Z","logger":"controller.nodeclaim.lifecycle","message":"Selected instance type Standard_NV6ads_A10_v5","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5"}
{"level":"INFO","time":"2024-07-15T07:07:17.497Z","logger":"controller.nodeclaim.lifecycle","message":"Resolved image /CommunityGalleries/AKSUbuntu-38d80f77-467a-481f-a8d4-09b6d4220bd2/images/2204gen2containerd/versions/202407.08.0 for instance type Standard_NV6ads_A10_v5","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5"}
{"level":"DEBUG","time":"2024-07-15T07:07:17.497Z","logger":"controller.nodeclaim.lifecycle","message":"Returning 2 IPv4 backend pools: [/subscriptions/12f55838-824f-4f06-a39a-e452ff7fdb7a/resourceGroups/MC_kubeflow_kubeflowcl_japaneast/providers/Microsoft.Network/loadBalancers/kubernetes/backendAddressPools/aksOutboundBackendPool /subscriptions/12f55838-824f-4f06-a39a-e452ff7fdb7a/resourceGroups/MC_kubeflow_kubeflowcl_japaneast/providers/Microsoft.Network/loadBalancers/kubernetes/backendAddressPools/kubernetes]","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5"}
{"level":"DEBUG","time":"2024-07-15T07:07:17.497Z","logger":"controller.nodeclaim.lifecycle","message":"Creating network interface aks-gpu-qmtr5","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5"}
{"level":"DEBUG","time":"2024-07-15T07:07:24.784Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"bbaa9b7"}
{"level":"DEBUG","time":"2024-07-15T07:07:25.785Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"bbaa9b7"}
{"level":"DEBUG","time":"2024-07-15T07:07:26.786Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"bbaa9b7"}
{"level":"DEBUG","time":"2024-07-15T07:07:27.428Z","logger":"controller.provisioner","message":"waiting on cluster sync","commit":"bbaa9b7"}
{"level":"DEBUG","time":"2024-07-15T07:07:27.786Z","logger":"controller.disruption","message":"waiting on cluster sync","commit":"bbaa9b7"}
{"level":"DEBUG","time":"2024-07-15T07:07:28.719Z","logger":"controller.nodeclaim.lifecycle","message":"Successfully created network interface: /subscriptions/12f55838-824f-4f06-a39a-e452ff7fdb7a/resourceGroups/MC_kubeflow_kubeflowcl_japaneast/providers/Microsoft.Network/networkInterfaces/aks-gpu-qmtr5","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5"}
{"level":"DEBUG","time":"2024-07-15T07:07:28.719Z","logger":"controller.nodeclaim.lifecycle","message":"Creating virtual machine aks-gpu-qmtr5 (Standard_NV6ads_A10_v5)","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5"}
{"level":"DEBUG","time":"2024-07-15T07:08:23.918Z","logger":"controller.nodeclaim.lifecycle","message":"Created  virtual machine AKS identifying extension for aks-gpu-qmtr5, with an id of /subscriptions/12f55838-824f-4f06-a39a-e452ff7fdb7a/resourceGroups/MC_kubeflow_kubeflowcl_japaneast/providers/Microsoft.Compute/virtualMachines/aks-gpu-qmtr5/extensions/computeAksLinuxBilling","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5"}
{"level":"INFO","time":"2024-07-15T07:08:23.918Z","logger":"controller.nodeclaim.lifecycle","message":"launched new instance","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5","launched-instance":"/subscriptions/12f55838-824f-4f06-a39a-e452ff7fdb7a/resourceGroups/MC_kubeflow_kubeflowcl_japaneast/providers/Microsoft.Compute/virtualMachines/aks-gpu-qmtr5","hostname":"aks-gpu-qmtr5","type":"Standard_NV6ads_A10_v5","zone":"1","capacity-type":"on-demand"}
{"level":"INFO","time":"2024-07-15T07:08:23.918Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"bbaa9b7","nodeclaim":"gpu-qmtr5","provider-id":"azure:///subscriptions/12f55838-824f-4f06-a39a-e452ff7fdb7a/resourceGroups/mc_kubeflow_kubeflowcl_japaneast/providers/Microsoft.Compute/virtualMachines/aks-gpu-qmtr5","instance-type":"Standard_NV6ads_A10_v5","zone":"","capacity-type":"on-demand","allocatable":{"cpu":"5840m","ephemeral-storage":"128G","memory":"46288Mi","nvidia.com/gpu":"1","pods":"110"}}

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@HakjunMIN
Copy link
Author

HakjunMIN commented Jul 29, 2024

looked this related to #247

Just updating cluster with empty (az aks update -n cluster -g rg) made this resolved.

I though this issue was already resolved by some release. Any way I'll check more using my cluster

@Bryce-Soghigian
Copy link
Collaborator

This issue is only solved for managed karpenter. Self-hosted will still have this side effect.

@tallaxes tallaxes added the triage/duplicate Indicates an issue is a duplicate of other open issue. label Oct 22, 2024
@tallaxes
Copy link
Collaborator

Closing as duplicate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/duplicate Indicates an issue is a duplicate of other open issue.
Projects
None yet
Development

No branches or pull requests

3 participants