missing interface in NSE after relocation #9863

ljkiraly · 2023-09-19T13:31:46Z

Expected Behavior

The restoration time of interfaces in NSE should be more deterministic.

Current Behavior

When more clients are deployed with two registries some interfaces are missing on NSE after NSE relocation.
The node where the NSE pod runs was cordoned and the NSE was relocated. After NSE is restarted on another node one connection to an NSC fails to be restored. Based on logs the deleted NSE remains stored in registry/etcd.

Failure Information (for bugs)

The restoration time varies and sometimes takes more then 150 seconds, sometimes restores promptly, or never restored. Note that two instance of registry-k8s are used. IPv6 address range is set into NSE because seemingly the issue pops up frequently with IPv6 addresses.

What is the expected NSE expiration time? How that depend on NSM_MAX_TOKEN_LIFETIME? (is the description in this PR still valid? networkservicemesh/sdk#1404)

Steps to Reproduce

Create kind cluster with 4 worker

cluster-config

 > cat cluster-config-4worker.yaml 
---
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
  - role: worker
  - role: worker
  - role: worker
  - role: worker

kind create cluster --config cluster-config-4worker.yaml --wait 300s

Basic setup should be executed
Start a second registry-k8s pod

kubectl scale --replicas=2 -n nsm-system deployment/registry-k8s

The remote-nse-death test case from heal feature was modified:

NSE, NSC yamls

> git diff
diff --git a/examples/heal/remote-nse-death/base/kustomization.yaml b/examples/heal/remote-nse-death/base/kustomization.yaml
index 86c9b5db475..babceecab6b 100644
--- a/examples/heal/remote-nse-death/base/kustomization.yaml
+++ b/examples/heal/remote-nse-death/base/kustomization.yaml
@@ -7,4 +7,6 @@ namespace: ns-remote-nse-death
 resources:
 - netsvc.yaml
 - client.yaml
+- client1.yaml
+- client2.yaml
 - ../../../../apps/nse-kernel
diff --git a/examples/heal/remote-nse-death/nse-before-death/patch-nse.yaml b/examples/heal/remote-nse-death/nse-before-death/patch-nse.yaml
index a57dc99c9cf..90e6eff49a0 100644
--- a/examples/heal/remote-nse-death/nse-before-death/patch-nse.yaml
+++ b/examples/heal/remote-nse-death/nse-before-death/patch-nse.yaml
@@ -10,7 +10,7 @@ spec:
         - name: nse
           env:
             - name: NSM_CIDR_PREFIX
-              value: 172.16.1.100/31
+              value: 2001:db8::/110
             - name: NSM_SERVICE_NAMES
               value: "remote-nse-death"
             - name: NSM_REGISTER_SERVICE

> cat base/client1.yaml
---
apiVersion: v1
kind: Pod
metadata:
  name: alpine1
  labels:
    app: alpine1
  annotations:
    networkservicemesh.io: kernel://remote-nse-death/nsm-1
spec:
  containers:
  - name: alpine
    image: alpine:3.15.0
    imagePullPolicy: IfNotPresent
    # simple `sleep` command would work
    # but we need `trap` to be able to delete pods quckly
    command: ["/bin/sh", "-c", "trap : TERM INT; sleep infinity & wait"]

> cat base/client2.yaml
---
apiVersion: v1
kind: Pod
metadata:
  name: alpine2
  labels:
    app: alpine2
  annotations:
    networkservicemesh.io: kernel://remote-nse-death/nsm-1
spec:
  containers:
  - name: alpine
    image: alpine:3.15.0
    imagePullPolicy: IfNotPresent
    # simple `sleep` command would work
    # but we need `trap` to be able to delete pods quckly
    command: ["/bin/sh", "-c", "trap : TERM INT; sleep infinity & wait"]

Execute the steps as follows:

README.md

> cat README.md # Remote NSE death

This example shows that NSM keeps working after the remote NSE death.

NSC and NSE are using the kernel mechanism to connect to its local forwarder.
Forwarders are using the vxlan mechanism to connect with each other.

Requires

Make sure that you have completed steps from basic or memory setup.

Run

Deploy NSC and NSE:

kubectl apply -k ./nse-before-death

Wait for applications ready:

kubectl wait --for=condition=ready --timeout=1m pod -l app=alpine -n ns-remote-nse-death

kubectl wait --for=condition=ready --timeout=1m pod -l app=alpine1 -n ns-remote-nse-death

kubectl wait --for=condition=ready --timeout=1m pod -l app=alpine2 -n ns-remote-nse-death

kubectl wait --for=condition=ready --timeout=1m pod -l app=nse-kernel -n ns-remote-nse-death

Find NSE pod by label:

NSE=$(kubectl get pods -l app=nse-kernel -n ns-remote-nse-death --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')

NSE:

LINKS=$(kubectl exec -n ns-remote-nse-death $NSE -- ip a | grep remote-nse | wc -l)
test $LINKS -eq 3

kubectl exec -n ns-remote-nse-death $NSE -- ip a

Cordon NSE node:

NSE_NODE=$(kubectl get pods -l app=nse-kernel -n ns-remote-nse-death --template '{{range .items}}{{.spec.nodeName}}{{"\n"}}{{end}}')
kubectl cordon $NSE_NODE
kubectl scale --replicas=0 -n ns-remote-nse-death deployment/nse-kernel

New NSE:

kubectl scale --replicas=1 -n ns-remote-nse-death deployment/nse-kernel

Wait for new NSE to start:

kubectl wait --for=condition=ready --timeout=2m pod -l app=nse-kernel -n ns-remote-nse-death

Find new NSE pod:

NSE=$(kubectl get pods -l app=nse-kernel -n ns-remote-nse-death --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')

Check the new NSE:

kubectl exec -n ns-remote-nse-death $NSE -- ip a
LINKS=$(kubectl exec -n ns-remote-nse-death $NSE -- ip a | grep remote-nse | wc -l)
test $LINKS -eq 3

Cleanup

Delete ns:

kubectl delete ns ns-remote-nse-death

kubectl uncordon $NSE_NODE

Failure Logs

missing-interface.zip

The text was updated successfully, but these errors were encountered:

NikitaSkrynnik · 2023-09-20T11:47:31Z

@ljkiraly Hi, I tried to reproduce the problem several times, but I didn't see any errors. I tested this setup on main branch. Could you please test it using main branch too?

ljkiraly · 2023-09-21T08:17:13Z

@NikitaSkrynnik My appologies, I missed that I made a change in my basic setup. I mentioned it on the description, but I forget to add to the reproduction sreps that two registry-k8s instances are used.

> git diff ../../../apps/registry-k8s/registry-k8s.yaml
diff --git a/apps/registry-k8s/registry-k8s.yaml b/apps/registry-k8s/registry-k8s.yaml
index 54efa62aaa4..1682b00bc2f 100644
--- a/apps/registry-k8s/registry-k8s.yaml
+++ b/apps/registry-k8s/registry-k8s.yaml
@@ -6,6 +6,7 @@ metadata:
   labels:
     app: registry
 spec:
+  replicas: 2
   selector:
     matchLabels:
       app: registry

I will update the description also.

Maybe this behavior has the same root cause as the issue described at last community call. On that setup also two registry-k8s pods are running and the unregister request was sent to the registry which can not handle the request.

denis-tingaikin · 2023-10-08T23:43:59Z

@ljkiraly Should be fixed in v1.11.0. Please let us know if it's still reproduciable.

ljkiraly · 2023-10-10T09:52:48Z

@denis-tingaikin Verification result is good. Thanks.

ljkiraly · 2024-01-05T15:09:30Z

@denis-tingaikin @NikitaSkrynnik Unfortunately with NSM v1.11.2 the problem occurred again. I was able to reproduce it also. I was using the steps described above. I attached the logs:
test-nse-cordon.zip

Earlier our nightly tests succeeded with previous NSM version, until now when using NSM v1.11.2.

ljkiraly · 2024-01-10T09:33:31Z

Hi @denis-tingaikin , @NikitaSkrynnik ,
This issue is really a corner case and not a really sever problem, just to clarify. It has middle priority just because it was meant that was fixed before.
I was able to reproduce the issue in v1.12.0-rc.1. Logs attached.
testCordon-1_12.zip
I will try to increase the QPS and the number of registry-k8s instances and note here the results.

glazychev-art · 2024-01-10T09:42:44Z

Thanks @ljkiraly ,
Could you please recheck attached logs? They are empty for me

ljkiraly · 2024-01-10T09:59:52Z

@glazychev-art , Compressed and updating again. Thanks for the notice.
testCordon-1_12.zip

denis-tingaikin · 2024-01-30T14:46:19Z

Seems like this one is fixed. Please feel free to reopen if it's reproducing.

denis-tingaikin assigned NikitaSkrynnik Sep 19, 2023

denis-tingaikin added this to Release v1.11.0 Sep 19, 2023

denis-tingaikin added bug ASAP labels Sep 19, 2023

d-uzlov moved this to In Progress in Release v1.11.0 Sep 21, 2023

denis-tingaikin moved this from In Progress to Under review in Release v1.11.0 Sep 23, 2023

dualBreath moved this from Under review to In Progress in Release v1.11.0 Sep 26, 2023

NikitaSkrynnik mentioned this issue Sep 29, 2023

Increase QPS for kubernetes client networkservicemesh/cmd-registry-k8s#406

Merged

denis-tingaikin closed this as completed in networkservicemesh/cmd-registry-k8s#406 Oct 3, 2023

github-project-automation bot moved this from In Progress to Done in Release v1.11.0 Oct 3, 2023

denis-tingaikin reopened this Oct 3, 2023

denis-tingaikin closed this as completed Oct 8, 2023

ljkiraly reopened this Jan 5, 2024

denis-tingaikin added this to Release v1.12.0 Jan 9, 2024

denis-tingaikin assigned glazychev-art and unassigned NikitaSkrynnik Jan 9, 2024

denis-tingaikin moved this to In Progress in Release v1.12.0 Jan 10, 2024

glazychev-art mentioned this issue Jan 11, 2024

Check connections received from monitoring networkservicemesh/cmd-nsc#600

Merged

denis-tingaikin moved this from In Progress to Under review in Release v1.12.0 Jan 12, 2024

denis-tingaikin closed this as completed Jan 30, 2024

github-project-automation bot moved this from Under review to Done in Release v1.12.0 Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

missing interface in NSE after relocation #9863

missing interface in NSE after relocation #9863

ljkiraly commented Sep 19, 2023 •

edited

Loading

Requires

Run

Cleanup

NikitaSkrynnik commented Sep 20, 2023

ljkiraly commented Sep 21, 2023

denis-tingaikin commented Oct 8, 2023

ljkiraly commented Oct 10, 2023

ljkiraly commented Jan 5, 2024

ljkiraly commented Jan 10, 2024

glazychev-art commented Jan 10, 2024

ljkiraly commented Jan 10, 2024

denis-tingaikin commented Jan 30, 2024

missing interface in NSE after relocation #9863

missing interface in NSE after relocation #9863

Comments

ljkiraly commented Sep 19, 2023 • edited Loading

Expected Behavior

Current Behavior

Failure Information (for bugs)

Steps to Reproduce

Requires

Run

Cleanup

Failure Logs

NikitaSkrynnik commented Sep 20, 2023

ljkiraly commented Sep 21, 2023

denis-tingaikin commented Oct 8, 2023

ljkiraly commented Oct 10, 2023

ljkiraly commented Jan 5, 2024

ljkiraly commented Jan 10, 2024

glazychev-art commented Jan 10, 2024

ljkiraly commented Jan 10, 2024

denis-tingaikin commented Jan 30, 2024

ljkiraly commented Sep 19, 2023 •

edited

Loading