[BUG] Statefulset deployment leaves zombie lsp #4738

patriziobassi · 2024-11-16T19:14:08Z

Kube-OVN Version

1.12.6

Kubernetes Version

1.30

Operation-system/Kernel Version

Ubuntu 22.04

Description

Hi,

when deploying some statefulset (at least it is happened several time with statefulset, maybe even with other resources) it is pretty easy to have dead lsps when trying to redeploy.

this happens when for instance you create/delete/recreate the same statefulset or you create a stetefulset, you do a rolling update of the worker node (for instance in order to patch) .

when trying to redeploy or re-start the statefulset, the pod stays in pending status and "describe" shows error 500.

with
#kubectl ko nbctl show | grep namespac> i can get all the pods of the affected namesapce and then with lsp-del command i can manually cleanup, for instance it happended with Harbor helm deploy and redeploy.

kubectl ko nbctl lsp-del harbor-database-0.harbor-prod
kubectl ko nbctl lsp-del harbor-redis-0.harbor-prod

Thank you

Steps To Reproduce

already mentioned before.

Current Behavior

lsp leftover

Expected Behavior

lsp cleaned up

dosubot · 2024-11-16T19:15:17Z

Hey @patriziobassi,

I'm here to help with any bugs, questions, or contributions you have. Let's work together to tackle this issue!

To address the issue of dead logical switch ports (LSPs) when redeploying or restarting a StatefulSet in Kube-OVN version 1.12.6, you can use the markAndCleanLSP function. This function is designed to automatically clean up LSPs that are no longer associated with any active pods or nodes. It works by:

Listing all pods and nodes to determine the current state of the cluster.
Building a set of logical switch port names that are currently in use.
Retrieving all logical switch ports from the OVN database.
Identifying and deleting LSPs that are not in use and not marked as arp_proxy.
Releasing IP addresses associated with deleted LSPs if they are not reserved.

Integrating this function into your controller logic will help ensure that dead LSPs are cleaned up automatically, preventing issues like pods staying in a pending state due to leftover LSPs ^[1]^[2].

To continue talking to Dosu, mention @dosu.

patriziobassi · 2024-11-16T19:46:42Z

Hi

i'm wondering if #4304 is pretty the same.

oilbeater · 2024-11-18T09:20:07Z

Hi, @patriziobassi what did you mean by do a rolling update of the worker node? Did you meet this issue when upgrade kubelet?

patriziobassi · 2024-11-18T09:51:45Z

i meant just a normal "apt upgrade" of the machine in order to patch the kernel, and reboot. i'm not sure what's the root cause at the moment, but we are experiencing this several times. I wonder if a sort of garbage collector of dead lsp may be implemented. in the issue 4304 you mention a controller restart. when we do rolling reboot of the nodes the situation may be pretty similar.

oilbeater · 2024-11-20T07:25:31Z

It seems to be the same problem; if the kube-ovn-controller stops during a Pod recreate, this issue will occur.

patriziobassi · 2024-11-20T15:08:12Z

yes i agree and i think there are two cases:

as mentioned the pod cannot get the IP, ending with error 500.
even worse: the pod get ip but there is already an lsp with a different mac address, creating a blackhole in the traffic.

we experimented both, fix was always to manually find and delete the orphaned lsp

oilbeater · 2024-11-21T03:34:13Z

@patriziobassi do you have any detailed log and error messages. The reason might be kube-ovn-controller doesn't gc lsps when starts. We meet long time gc (over half an hour) in a large scale cluster incident, no workload can go into running status because kube-ovn-controller is busy deleting zombie lsps, so we move the gc after start up.

We only meet this conflict issue for static IP pods with different name. For random allocated IP and statefulset the IP CRs should have enough information to avoid conflict. I am afraid there are unknown reasons.

patriziobassi added the bug Something isn't working label Nov 16, 2024

patriziobassi changed the title ~~[BUG] Statefulset deployment leave zombie lsp~~ [BUG] Statefulset deployment leaves zombie lsp Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Statefulset deployment leaves zombie lsp #4738

[BUG] Statefulset deployment leaves zombie lsp #4738

patriziobassi commented Nov 16, 2024 •

edited

Loading

dosubot bot commented Nov 16, 2024

patriziobassi commented Nov 16, 2024

oilbeater commented Nov 18, 2024

patriziobassi commented Nov 18, 2024

oilbeater commented Nov 20, 2024

patriziobassi commented Nov 20, 2024

oilbeater commented Nov 21, 2024

[BUG] Statefulset deployment leaves zombie lsp #4738

[BUG] Statefulset deployment leaves zombie lsp #4738

Comments

patriziobassi commented Nov 16, 2024 • edited Loading

Kube-OVN Version

Kubernetes Version

Operation-system/Kernel Version

Description

Steps To Reproduce

Current Behavior

Expected Behavior

dosubot bot commented Nov 16, 2024

patriziobassi commented Nov 16, 2024

oilbeater commented Nov 18, 2024

patriziobassi commented Nov 18, 2024

oilbeater commented Nov 20, 2024

patriziobassi commented Nov 20, 2024

oilbeater commented Nov 21, 2024

patriziobassi commented Nov 16, 2024 •

edited

Loading