Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In tree 1.21.2 Windows ELB fails to reconcile #706

Closed
CecileRobertMichon opened this issue Jul 14, 2021 · 10 comments · Fixed by kubernetes/kubernetes#103997
Closed

In tree 1.21.2 Windows ELB fails to reconcile #706

CecileRobertMichon opened this issue Jul 14, 2021 · 10 comments · Fixed by kubernetes/kubernetes#103997
Assignees

Comments

@CecileRobertMichon
Copy link
Contributor

CecileRobertMichon commented Jul 14, 2021

What happened:

With standalone VMs (CAPZ cluster) using vmType "vmss", can repro 100% of the time using in tree cloud provider k8s 1.21.2:

Linux ILB + ELB and Windows ILB reconcile successfully:
I0714 13:08:21.026400 1 azure_loadbalancer.go:1502] reconcileLoadBalancer for service(default/webroha7z-ilb): lb(capz-e2e-3k3e35-internal) finished
I0714 13:08:43.680525 1 azure_loadbalancer.go:1502] reconcileLoadBalancer for service(default/webroha7z-elb): lb(capz-e2e-3k3e35) finished
I0714 13:11:36.489670 1 azure_loadbalancer.go:1502] reconcileLoadBalancer for service(default/web-windowsat01ng-ilb): lb(capz-e2e-3k3e35-internal) finished

Then Windows ELB fails with this error:

I0714 13:12:16.946939       1 azure_loadbalancer.go:1098] reconcileLoadBalancer for service(default/web-windowsat01ng-elb) - wantLb(true): started
I0714 13:12:16.947571       1 endpointslice_controller.go:318] Finished syncing service "default/web-windowsat01ng-elb" endpoint slices. (139.302µs)
I0714 13:12:16.947981       1 endpoints_controller.go:381] Finished syncing service "default/web-windowsat01ng-elb" endpoints. (358.705µs)
I0714 13:12:16.951031       1 endpoints_controller.go:381] Finished syncing service "default/web-windowsat01ng-ilb" endpoints. (12.889669ms)
I0714 13:12:16.953040       1 garbagecollector.go:580] "Deleting object" object="default/web-windowsat01ng-ilb-vmx2b" objectUID=6b695e9d-98f6-471a-8cd2-0febab6e90bf kind="EndpointSlice" propagationPolicy=Background
I0714 13:12:16.961648       1 resource_quota_monitor.go:355] QuotaMonitor process object: discovery.k8s.io/v1, Resource=endpointslices, namespace default, name web-windowsat01ng-ilb-vmx2b, uid 6b695e9d-98f6-471a-8cd2-0febab6e90bf, event type delete
I0714 13:12:16.985196       1 azure_backoff.go:285] LoadBalancerClient.List(capz-e2e-3k3e35) success
I0714 13:12:16.985245       1 azure_loadbalancer.go:1106] reconcileLoadBalancer for service(default/web-windowsat01ng-elb): lb(capz-e2e-3k3e35/capz-e2e-3k3e35) wantLb(true) resolved load balancer name
I0714 13:12:16.985307       1 azure_vmss.go:1379] Can not extract scale set name from ipConfigurationID (/subscriptions/===REDACTED===/resourceGroups/capz-e2e-3k3e35/providers/Microsoft.Network/networkInterfaces/capz-e2e-3k3e35-md-0-k9qll-nic/ipConfigurations/pipConfig), assuming it is managed by availability set
E0714 13:12:16.985391       1 azure_loadbalancer.go:189] reconcileLoadBalancer(default/web-windowsat01ng-elb) failed: not a vmss instance
I0714 13:12:16.985549       1 controller.go:837] Finished syncing service "default/web-windowsat01ng-elb" (68.624396ms)
E0714 13:12:16.986239       1 controller.go:307] error processing service default/web-windowsat01ng-elb (will retry): failed to ensure load balancer: not a vmss instance
I0714 13:12:16.986223       1 event.go:291] "Event occurred" object="default/web-windowsat01ng-elb" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: not a vmss instance"

Full logs: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/1524/pull-cluster-api-provider-azure-e2e-windows/1415291396272689152/artifacts/clusters/capz-e2e-3k3e35/kube-system/kube-controller-manager-capz-e2e-3k3e35-control-plane-bhf94/kube-controller-manager.log

What you expected to happen:

How to reproduce it:

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@CecileRobertMichon
Copy link
Contributor Author

The same errors happens with k8s 1.20.8

@CecileRobertMichon
Copy link
Contributor Author

/assign @nilo19

@nilo19
Copy link
Contributor

nilo19 commented Jul 27, 2021

Can we add a windows + standalone vm check-in or regular test?

@nilo19
Copy link
Contributor

nilo19 commented Jul 27, 2021

@CecileRobertMichon could you test out-of-tree ccm, which contains the fix.

name, rg, err := ss.availabilitySet.GetNodeNameByIPConfigurationID(ipConfigurationID)

@nilo19
Copy link
Contributor

nilo19 commented Jul 27, 2021

If it works, I will cherry pick it into k/k.

@CecileRobertMichon
Copy link
Contributor Author

CecileRobertMichon commented Jul 27, 2021

Can we add a windows + standalone vm check-in or regular test?

opened #705

could you test out-of-tree ccm, which contains the fix.

will do

@CecileRobertMichon
Copy link
Contributor Author

@nilo19 I can confirm I cannot repro this with out-of-tree. This only happens with in-tree. Please go ahead and cherry-pick in k/k.

@CecileRobertMichon
Copy link
Contributor Author

@nilo19 we can also repro this with Linux

I0729 16:20:17.835405       1 event.go:291] "Event occurred" object="default/nginx-lb" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer"
E0729 16:20:17.886545       1 azure_loadbalancer.go:193] reconcileLoadBalancer(default/nginx-lb) failed: not a vmss instance
E0729 16:20:17.890728       1 controller.go:275] error processing service default/nginx-lb (will retry): failed to ensure load balancer: not a vmss instance
I0729 16:20:17.891595       1 event.go:291] "Event occurred" object="default/nginx-lb" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: not a vmss instance"

@CecileRobertMichon
Copy link
Contributor Author

@nilo19 @feiskyer would it be possible to also cherry pick this to 1.20 and 1.21?

@jsturtevant
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants