Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missing interface in NSE after relocation #9863

Closed
ljkiraly opened this issue Sep 19, 2023 · 9 comments · Fixed by networkservicemesh/cmd-registry-k8s#406
Closed

missing interface in NSE after relocation #9863

ljkiraly opened this issue Sep 19, 2023 · 9 comments · Fixed by networkservicemesh/cmd-registry-k8s#406
Assignees
Labels
ASAP The issue should be completed as soon as possible bug Something isn't working

Comments

@ljkiraly
Copy link
Contributor

ljkiraly commented Sep 19, 2023

Expected Behavior

The restoration time of interfaces in NSE should be more deterministic.

Current Behavior

When more clients are deployed with two registries some interfaces are missing on NSE after NSE relocation.
The node where the NSE pod runs was cordoned and the NSE was relocated. After NSE is restarted on another node one connection to an NSC fails to be restored. Based on logs the deleted NSE remains stored in registry/etcd.

Failure Information (for bugs)

The restoration time varies and sometimes takes more then 150 seconds, sometimes restores promptly, or never restored. Note that two instance of registry-k8s are used. IPv6 address range is set into NSE because seemingly the issue pops up frequently with IPv6 addresses.

What is the expected NSE expiration time? How that depend on NSM_MAX_TOKEN_LIFETIME? (is the description in this PR still valid? networkservicemesh/sdk#1404)

Steps to Reproduce

  • Create kind cluster with 4 worker
cluster-config
 > cat cluster-config-4worker.yaml 
---
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
  - role: worker
  - role: worker
  - role: worker
  - role: worker
kind create cluster --config cluster-config-4worker.yaml --wait 300s
  • Basic setup should be executed
  • Start a second registry-k8s pod
kubectl scale --replicas=2 -n nsm-system deployment/registry-k8s
  • The remote-nse-death test case from heal feature was modified:
NSE, NSC yamls
> git diff
diff --git a/examples/heal/remote-nse-death/base/kustomization.yaml b/examples/heal/remote-nse-death/base/kustomization.yaml
index 86c9b5db475..babceecab6b 100644
--- a/examples/heal/remote-nse-death/base/kustomization.yaml
+++ b/examples/heal/remote-nse-death/base/kustomization.yaml
@@ -7,4 +7,6 @@ namespace: ns-remote-nse-death
 resources:
 - netsvc.yaml
 - client.yaml
+- client1.yaml
+- client2.yaml
 - ../../../../apps/nse-kernel
diff --git a/examples/heal/remote-nse-death/nse-before-death/patch-nse.yaml b/examples/heal/remote-nse-death/nse-before-death/patch-nse.yaml
index a57dc99c9cf..90e6eff49a0 100644
--- a/examples/heal/remote-nse-death/nse-before-death/patch-nse.yaml
+++ b/examples/heal/remote-nse-death/nse-before-death/patch-nse.yaml
@@ -10,7 +10,7 @@ spec:
         - name: nse
           env:
             - name: NSM_CIDR_PREFIX
-              value: 172.16.1.100/31
+              value: 2001:db8::/110
             - name: NSM_SERVICE_NAMES
               value: "remote-nse-death"
             - name: NSM_REGISTER_SERVICE
> cat base/client1.yaml
---
apiVersion: v1
kind: Pod
metadata:
  name: alpine1
  labels:
    app: alpine1
  annotations:
    networkservicemesh.io: kernel://remote-nse-death/nsm-1
spec:
  containers:
  - name: alpine
    image: alpine:3.15.0
    imagePullPolicy: IfNotPresent
    # simple `sleep` command would work
    # but we need `trap` to be able to delete pods quckly
    command: ["/bin/sh", "-c", "trap : TERM INT; sleep infinity & wait"]

> cat base/client2.yaml
---
apiVersion: v1
kind: Pod
metadata:
  name: alpine2
  labels:
    app: alpine2
  annotations:
    networkservicemesh.io: kernel://remote-nse-death/nsm-1
spec:
  containers:
  - name: alpine
    image: alpine:3.15.0
    imagePullPolicy: IfNotPresent
    # simple `sleep` command would work
    # but we need `trap` to be able to delete pods quckly
    command: ["/bin/sh", "-c", "trap : TERM INT; sleep infinity & wait"]
  • Execute the steps as follows:
README.md > cat README.md # Remote NSE death

This example shows that NSM keeps working after the remote NSE death.

NSC and NSE are using the kernel mechanism to connect to its local forwarder.
Forwarders are using the vxlan mechanism to connect with each other.

Requires

Make sure that you have completed steps from basic or memory setup.

Run

Deploy NSC and NSE:

kubectl apply -k ./nse-before-death

Wait for applications ready:

kubectl wait --for=condition=ready --timeout=1m pod -l app=alpine -n ns-remote-nse-death
kubectl wait --for=condition=ready --timeout=1m pod -l app=alpine1 -n ns-remote-nse-death
kubectl wait --for=condition=ready --timeout=1m pod -l app=alpine2 -n ns-remote-nse-death
kubectl wait --for=condition=ready --timeout=1m pod -l app=nse-kernel -n ns-remote-nse-death

Find NSE pod by label:

NSE=$(kubectl get pods -l app=nse-kernel -n ns-remote-nse-death --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')

NSE:

LINKS=$(kubectl exec -n ns-remote-nse-death $NSE -- ip a | grep remote-nse | wc -l)
test $LINKS -eq 3
kubectl exec -n ns-remote-nse-death $NSE -- ip a

Cordon NSE node:

NSE_NODE=$(kubectl get pods -l app=nse-kernel -n ns-remote-nse-death --template '{{range .items}}{{.spec.nodeName}}{{"\n"}}{{end}}')
kubectl cordon $NSE_NODE
kubectl scale --replicas=0 -n ns-remote-nse-death deployment/nse-kernel

New NSE:

kubectl scale --replicas=1 -n ns-remote-nse-death deployment/nse-kernel

Wait for new NSE to start:

kubectl wait --for=condition=ready --timeout=2m pod -l app=nse-kernel -n ns-remote-nse-death

Find new NSE pod:

NSE=$(kubectl get pods -l app=nse-kernel -n ns-remote-nse-death --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')

Check the new NSE:

kubectl exec -n ns-remote-nse-death $NSE -- ip a
LINKS=$(kubectl exec -n ns-remote-nse-death $NSE -- ip a | grep remote-nse | wc -l)
test $LINKS -eq 3

Cleanup

Delete ns:

kubectl delete ns ns-remote-nse-death
kubectl uncordon $NSE_NODE

Failure Logs

missing-interface.zip

@denis-tingaikin denis-tingaikin added bug Something isn't working ASAP The issue should be completed as soon as possible labels Sep 19, 2023
@NikitaSkrynnik
Copy link
Collaborator

@ljkiraly Hi, I tried to reproduce the problem several times, but I didn't see any errors. I tested this setup on main branch. Could you please test it using main branch too?

@ljkiraly
Copy link
Contributor Author

@NikitaSkrynnik My appologies, I missed that I made a change in my basic setup. I mentioned it on the description, but I forget to add to the reproduction sreps that two registry-k8s instances are used.

> git diff ../../../apps/registry-k8s/registry-k8s.yaml
diff --git a/apps/registry-k8s/registry-k8s.yaml b/apps/registry-k8s/registry-k8s.yaml
index 54efa62aaa4..1682b00bc2f 100644
--- a/apps/registry-k8s/registry-k8s.yaml
+++ b/apps/registry-k8s/registry-k8s.yaml
@@ -6,6 +6,7 @@ metadata:
   labels:
     app: registry
 spec:
+  replicas: 2
   selector:
     matchLabels:
       app: registry

I will update the description also.

Maybe this behavior has the same root cause as the issue described at last community call. On that setup also two registry-k8s pods are running and the unregister request was sent to the registry which can not handle the request.

@denis-tingaikin
Copy link
Member

@ljkiraly Should be fixed in v1.11.0. Please let us know if it's still reproduciable.

@ljkiraly
Copy link
Contributor Author

@denis-tingaikin Verification result is good. Thanks.

@ljkiraly
Copy link
Contributor Author

ljkiraly commented Jan 5, 2024

@denis-tingaikin @NikitaSkrynnik Unfortunately with NSM v1.11.2 the problem occurred again. I was able to reproduce it also. I was using the steps described above. I attached the logs:
test-nse-cordon.zip

Earlier our nightly tests succeeded with previous NSM version, until now when using NSM v1.11.2.

@ljkiraly
Copy link
Contributor Author

Hi @denis-tingaikin , @NikitaSkrynnik ,
This issue is really a corner case and not a really sever problem, just to clarify. It has middle priority just because it was meant that was fixed before.
I was able to reproduce the issue in v1.12.0-rc.1. Logs attached.
testCordon-1_12.zip
I will try to increase the QPS and the number of registry-k8s instances and note here the results.

@denis-tingaikin denis-tingaikin moved this to In Progress in Release v1.12.0 Jan 10, 2024
@glazychev-art
Copy link
Contributor

Thanks @ljkiraly ,
Could you please recheck attached logs? They are empty for me

@ljkiraly
Copy link
Contributor Author

@glazychev-art , Compressed and updating again. Thanks for the notice.
testCordon-1_12.zip

@denis-tingaikin
Copy link
Member

Seems like this one is fixed. Please feel free to reopen if it's reproducing.

@github-project-automation github-project-automation bot moved this from Under review to Done in Release v1.12.0 Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ASAP The issue should be completed as soon as possible bug Something isn't working
Projects
Status: Done
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants