Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Statefulset deployment leaves zombie lsp #4738

Open
patriziobassi opened this issue Nov 16, 2024 · 7 comments
Open

[BUG] Statefulset deployment leaves zombie lsp #4738

patriziobassi opened this issue Nov 16, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@patriziobassi
Copy link

patriziobassi commented Nov 16, 2024

Kube-OVN Version

1.12.6

Kubernetes Version

1.30

Operation-system/Kernel Version

Ubuntu 22.04

Description

Hi,

when deploying some statefulset (at least it is happened several time with statefulset, maybe even with other resources) it is pretty easy to have dead lsps when trying to redeploy.

this happens when for instance you create/delete/recreate the same statefulset or you create a stetefulset, you do a rolling update of the worker node (for instance in order to patch) .

when trying to redeploy or re-start the statefulset, the pod stays in pending status and "describe" shows error 500.

with
#kubectl ko nbctl show | grep namespac> i can get all the pods of the affected namesapce and then with lsp-del command i can manually cleanup, for instance it happended with Harbor helm deploy and redeploy.

kubectl ko nbctl lsp-del harbor-database-0.harbor-prod
kubectl ko nbctl lsp-del harbor-redis-0.harbor-prod

Thank you

Steps To Reproduce

already mentioned before.

Current Behavior

lsp leftover

Expected Behavior

lsp cleaned up

@patriziobassi patriziobassi added the bug Something isn't working label Nov 16, 2024
Copy link

dosubot bot commented Nov 16, 2024

Hey @patriziobassi,

I'm here to help with any bugs, questions, or contributions you have. Let's work together to tackle this issue!

To address the issue of dead logical switch ports (LSPs) when redeploying or restarting a StatefulSet in Kube-OVN version 1.12.6, you can use the markAndCleanLSP function. This function is designed to automatically clean up LSPs that are no longer associated with any active pods or nodes. It works by:

  1. Listing all pods and nodes to determine the current state of the cluster.
  2. Building a set of logical switch port names that are currently in use.
  3. Retrieving all logical switch ports from the OVN database.
  4. Identifying and deleting LSPs that are not in use and not marked as arp_proxy.
  5. Releasing IP addresses associated with deleted LSPs if they are not reserved.

Integrating this function into your controller logic will help ensure that dead LSPs are cleaned up automatically, preventing issues like pods staying in a pending state due to leftover LSPs [1][2].

To continue talking to Dosu, mention @dosu.

@patriziobassi patriziobassi changed the title [BUG] Statefulset deployment leave zombie lsp [BUG] Statefulset deployment leaves zombie lsp Nov 16, 2024
@patriziobassi
Copy link
Author

Hi

i'm wondering if #4304 is pretty the same.

@oilbeater
Copy link
Collaborator

Hi, @patriziobassi what did you mean by do a rolling update of the worker node? Did you meet this issue when upgrade kubelet?

@patriziobassi
Copy link
Author

i meant just a normal "apt upgrade" of the machine in order to patch the kernel, and reboot. i'm not sure what's the root cause at the moment, but we are experiencing this several times. I wonder if a sort of garbage collector of dead lsp may be implemented. in the issue 4304 you mention a controller restart. when we do rolling reboot of the nodes the situation may be pretty similar.

@oilbeater
Copy link
Collaborator

It seems to be the same problem; if the kube-ovn-controller stops during a Pod recreate, this issue will occur.

@patriziobassi
Copy link
Author

yes i agree and i think there are two cases:

  1. as mentioned the pod cannot get the IP, ending with error 500.
  2. even worse: the pod get ip but there is already an lsp with a different mac address, creating a blackhole in the traffic.

we experimented both, fix was always to manually find and delete the orphaned lsp

@oilbeater
Copy link
Collaborator

@patriziobassi do you have any detailed log and error messages. The reason might be kube-ovn-controller doesn't gc lsps when starts. We meet long time gc (over half an hour) in a large scale cluster incident, no workload can go into running status because kube-ovn-controller is busy deleting zombie lsps, so we move the gc after start up.

We only meet this conflict issue for static IP pods with different name. For random allocated IP and statefulset the IP CRs should have enough information to avoid conflict. I am afraid there are unknown reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants