You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When creating a cluster and creating the first service with externalTrafficPolicy: Local, we are observing that the new nodes are not added to the load balancer in a timely manner as they should be.
It takes one of three things to trigger the nodes to be added to the load balancer:
Restarting the CCM
Adding another node (this triggers old nodes to add, but not always the new node)
Waiting for the full resync to occur (IIRC 10 hours?)
What you expected to happen:
Nodes should be added to the load balancer as soon as they are added to the cluster.
How to reproduce it (as minimally and precisely as possible):
Create a cluster with the standard load balancer, and create a single service of eTP: Local. This doesn't always happen, but, we are observing with a good frequency that the worker nodes are not being added to the load balancer when the load balancer is created, and the nodes are freshly added.
Anything else we need to know?:
I believe this to be an incompatibility between the predicates used in the cloud-provider core service controller, and the logic in the Azure provider that excludes nodes from being added to the load balancer when they are unready.
In kubernetes/kubernetes#109706, the service controller was updated so that it monitors the nodes that it's passing into UpdateLoadBalancer. The PR aimed to reduce the number of times you would resync the load balancer membership.
The UpdateLoadBalancer function will only be called when the set of nodes changes, which means either a new node is added/removed or a node is excluded via label/tainted/has a transition of the provider ID.
When a node is not ready, it is excluded which means that, when a node is first added, there is a race. If the CCM observes the update for the load balancers before the Node becomes ready, then the node is excluded from the load balancer. In this case, with the new reduction of calls to UpdateLoadBalancer, when the node flicks to being ready, the node no longer gets added to the load balancers, as nothing is calling UpdateLoadBalancer. According to the core CCM code, nothing has changed.
In my opinion, and that of KEP-3458, we need to remove the readiness check from the Azure CCM and allow the nodes to be added independent of their readiness.
To do this though, we must first fix the broken health checks (#3499) which has a PR up already (#3887). This is also related to #3500 which adds another reason not to manage load balancer membership based on whether a node is ready or not.
Environment:
Kubernetes version (use kubectl version): 1.27
Cloud provider or hardware configuration: Azure 1.27
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Network plugin and version (if this is a network-related bug):
Others:
The text was updated successfully, but these errors were encountered:
Thanks @JoelSpeed for reporting the issues and adding the KEP links. Fully agreed with you that health probes should be improved and aligned with KEP-3836 and KEP-3458.
What happened:
When creating a cluster and creating the first service with
externalTrafficPolicy: Local
, we are observing that the new nodes are not added to the load balancer in a timely manner as they should be.It takes one of three things to trigger the nodes to be added to the load balancer:
What you expected to happen:
Nodes should be added to the load balancer as soon as they are added to the cluster.
How to reproduce it (as minimally and precisely as possible):
Create a cluster with the standard load balancer, and create a single service of eTP: Local. This doesn't always happen, but, we are observing with a good frequency that the worker nodes are not being added to the load balancer when the load balancer is created, and the nodes are freshly added.
Anything else we need to know?:
I believe this to be an incompatibility between the predicates used in the cloud-provider core service controller, and the logic in the Azure provider that excludes nodes from being added to the load balancer when they are unready.
In kubernetes/kubernetes#109706, the service controller was updated so that it monitors the nodes that it's passing into
UpdateLoadBalancer
. The PR aimed to reduce the number of times you would resync the load balancer membership.The UpdateLoadBalancer function will only be called when the set of nodes changes, which means either a new node is added/removed or a node is excluded via label/tainted/has a transition of the provider ID.
The call to update the load balancers eventually calls down to EnsureHostsInPool, which then calls out to ShouldNodeExcludedFromLoadBalancer. This determines whether the node should be included or not.
When a node is not ready, it is excluded which means that, when a node is first added, there is a race. If the CCM observes the update for the load balancers before the Node becomes ready, then the node is excluded from the load balancer. In this case, with the new reduction of calls to
UpdateLoadBalancer
, when the node flicks to being ready, the node no longer gets added to the load balancers, as nothing is callingUpdateLoadBalancer
. According to the core CCM code, nothing has changed.In my opinion, and that of KEP-3458, we need to remove the readiness check from the Azure CCM and allow the nodes to be added independent of their readiness.
To do this though, we must first fix the broken health checks (#3499) which has a PR up already (#3887). This is also related to #3500 which adds another reason not to manage load balancer membership based on whether a node is ready or not.
Environment:
kubectl version
): 1.27cat /etc/os-release
):uname -a
):The text was updated successfully, but these errors were encountered: