Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interface leak on initial Request cancel #1133

Open
zolug opened this issue Jul 5, 2024 · 16 comments
Open

interface leak on initial Request cancel #1133

zolug opened this issue Jul 5, 2024 · 16 comments
Assignees
Labels
bug Something isn't working

Comments

@zolug
Copy link
Contributor

zolug commented Jul 5, 2024

Interface can be leaked if the initial Request's context gets cancelled before the connection could be established.

The problem might occur in kernelTap if the context gets cancelled during the vppConn tapCreate call. In which case the call returns with an error, hence the interface won't get saved into the ifindex storage.

Therefore, even though the Request should initiate a Close due to the failed kernelTap create (e.g. in kernelTapServer), del won't be able to lookup the interface from the storage.

NSM version: v1.13.2-rc.1

Logs:

Follow the following request:

Jul  5 14:59:32.903 [TRAC] [id:9a29a6db-aed2-48d4-b5fa-c75f7337328e] [type:networkService] (1.1)   request={"connection":{"id":"9a29a6db-aed2-48d4-b5fa-c75f7337328e","network_service":"load-balancer-a1.trench-a.default","context":{"ip_context":{"src_ip_addrs":["172.16.0.31/24","fd00::1f/64"],"dst_ip_addrs":["172.16.0.30/24","fd00::1e/64"],"excluded_prefixes":["10.96.0.0/16","10.244.0.0/16"]},"MTU":1500},"labels":{"nodeName":"kind-worker3"},"path":{"index":1,"path_segments":[{"name":"proxy-load-balancer-a1-xvq6h","id":"f161b4b1-2ffa-4095-ace0-9d2027f4e356-proxy-load-balancer-a1-xvq6h-load-balancer-a1.trench-a.default-stateless-lb-frontend-attr-a1-f5864fc94-59nlh","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL2RlZmF1bHQiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL25zbWdyLXNhIl0sImV4cCI6MTcyMDE5MjE3Mn0.Xz6pp29LEzAhDBFG4KBhQBMZ5BuL-GNQOP5e3l4NHPwEYRuLTg-FWlhAW2PO3SqQB67TLO6LRXzJ5rXlhV3dwA","expires":{"seconds":1720192172,"nanos":876126486},"metrics":{"client_interface":"KERNEL/unknown"}},{"name":"nsmgr-pbj2f","id":"9a29a6db-aed2-48d4-b5fa-c75f7337328e","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvbnNtZ3Itc2EiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvZGVmYXVsdC9zYS9kZWZhdWx0Il0sImV4cCI6MTcyMDE5MjE3Mn0.QamUnYKNhPziEGeN2hbFpNpKq2W8rtm0gZBiLaZ05lp0HSH9oVLsy54-vRblf5cUV5GK7gWlrbJdIc63sLDFaw","expires":{"seconds":1720192172,"nanos":877054244}}]},"network_service_endpoint_name":"stateless-lb-frontend-attr-a1-f5864fc94-59nlh","payload":"ETHERNET"},"mechanism_preferences":[{"cls":"LOCAL","type":"KERNEL","parameters":{"inodeURL":"inode://4/4026535834","name":"load-balan-2ezO"}}]}

Use filter string:
load-balan-2ezO|9a29a6db-aed2-48d4-b5fa-c75f7337328e|95b833d9-10fc-43d4-9c30-54f2ad7f1c2b|load-balan-tZer

NSC side load-balan-2ezO interface will not be removed.
nsm_forwarder-vpp-mwbc2_forwarder-vpp.log

@zolug zolug added the bug Something isn't working label Jul 5, 2024
@zolug
Copy link
Contributor Author

zolug commented Jul 5, 2024

@zolug
Copy link
Contributor Author

zolug commented Jul 7, 2024

Another example where the NSE side interface load-balan-t6nu got stuck.

Jul  5 14:59:33.372 [TRAC] [id:9a5ea784-40f2-4788-b286-c27f28eb9747] [type:networkService] (1.1)   request={"connection":{"id":"9a5ea784-40f2-4788-b286-c27f28eb9747","network_service":"load-balancer-a1.trench-a.default","context":{"ip_context":{"src_ip_addrs":["172.16.2.37/24","fd00:0:0:2::21/64"],"dst_ip_addrs":["172.16.2.36/24","fd00:0:0:2::20/64"],"excluded_prefixes":["10.96.0.0/16","10.244.0.0/16"]},"MTU":1500},"labels":{"nodeName":"kind-worker2"},"path":{"index":3,"path_segments":[{"name":"proxy-load-balancer-a1-77zlr","id":"991fe339-7d91-49d4-ba7f-2d4c1d7a01bd-proxy-load-balancer-a1-77zlr-load-balancer-a1.trench-a.default-stateless-lb-frontend-attr-a1-f5864fc94-59nlh","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL2RlZmF1bHQiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL25zbWdyLXNhIl0sImV4cCI6MTcyMDE5MjE3Mn0.UUxJKoSOP-ee0Du3DstHxgecmKYzmpEnieVEBOdZiex8eUOpmA5P-KNr45hxsw-8PccGVUnNWcduISetmnQTNA","expires":{"seconds":1720192172,"nanos":880397227},"metrics":{"client_interface":"KERNEL/unknown"}},{"name":"nsmgr-pfss7","id":"a9211a4f-500d-4d7d-8a3b-ed44a173762e","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvbnNtZ3Itc2EiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL2RlZmF1bHQiXSwiZXhwIjoxNzIwMTkyMTczfQ.9GqG4MtdzWz_F157_V0AF6CiCsl0rZxwNIEtattssADqMzl-An7EGGZV-WPqzZ5NqijLH0uq_-zmp1YZ9ii6nw","expires":{"seconds":1720192173,"nanos":76868291}},{"name":"forwarder-vpp-772b9","id":"789260f9-be2e-401a-b601-40b3a913c067","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvZGVmYXVsdCIsImF1ZCI6WyJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvbnNtZ3Itc2EiXSwiZXhwIjoxNzIwMTkyMTczfQ.KA1890RkZBfrFcYBiytD6HLZp-DctqnJ27yJY-eOVqVE4bjsX-VPQ44Vn5mRnbos85sXxngjec2sxFtoUwbStQ","expires":{"seconds":1720192173,"nanos":183239052}},{"name":"nsmgr-pbj2f","id":"9a5ea784-40f2-4788-b286-c27f28eb9747","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvbnNtZ3Itc2EiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL2RlZmF1bHQiXSwiZXhwIjoxNzIwMTkyMTczfQ.cqMK_9_P-J9tkEkCATe0-lZgej5FO7OJg-918c_MRaj70aB1SFK86sWvb1VQW_5fbbpFxfl9jjUA-rkrF2svOQ","expires":{"seconds":1720192173,"nanos":184038754}}]},"network_service_endpoint_name":"stateless-lb-frontend-attr-a1-f5864fc94-59nlh","payload":"ETHERNET"},"mechanism_preferences":[{"cls":"REMOTE","type":"VXLAN","parameters":{"MTU":"1446","src_ip":"172.18.0.6","src_port":"4789"}},{"cls":"REMOTE","type":"VLAN"}]}

filter string:
load-balan-t6nu|667aa022-1a1a-49ec-a938-37b9dddb2da6|9a5ea784-40f2-4788-b286-c27f28eb9747

(same forwarder logs)

@zolug
Copy link
Contributor Author

zolug commented Jul 9, 2024

I was able to reproduce it on a 4 node KIND cluster relying on basic deployment with kernel-nse and kernel-nsc by tweaking NSM_REQUEST_TIMEOUT parameter.

Double checked that in nsc the requests kept failing due to context timeout, and left the system running for a couple minutes and was monitoring forwarder logs for "vppapi TapCreateV3 returned error" printouts.

cat <<EOF | kubectl apply -f -
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nsc-kernel
  labels:
    app: nsc-kernel
spec:
  selector:
    matchLabels:
      app: nsc-kernel
  template:
    metadata:
      labels:
        app: nsc-kernel
        "spiffe.io/spiffe-id": "true"
    spec:
      initContainers:
        - name: sysctl-init
          image: registry.nordix.org/cloud-native/meridio/busybox:1.29
          securityContext:
            privileged: true
          command: ["/bin/sh"]
          args:
            - -c
            - sysctl -w net.ipv4.ping_group_range='0 1' ;
      containers:
        - name: nsc
          image: ghcr.io/networkservicemesh/cmd-nsc:v1.13.2-rc.1
          imagePullPolicy: Always
          env:
            - name: SPIFFE_ENDPOINT_SOCKET
              value: unix:///run/spire/sockets/agent.sock
            - name: NSM_LOG_LEVEL
              value: TRACE
            - name: NSM_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: NSM_NETWORK_SERVICES
              value: kernel://icmp-responder
            - name: NSM_LIVENESSCHECKENABLED
              value: "false"
            - name: NSM_REQUEST_TIMEOUT
              value: "180ms"
          volumeMounts:
            - name: spire-agent-socket
              mountPath: /run/spire/sockets
              readOnly: true
            - name: nsm-socket
              mountPath: /var/lib/networkservicemesh
              readOnly: true
          resources:
            requests:
              cpu: 100m
              memory: 40Mi
            limits:
              memory: 80Mi
              cpu: 200m
      volumes:
        - name: spire-agent-socket
          hostPath:
            path: /run/spire/sockets
            type: Directory
        - name: nsm-socket
          hostPath:
            path: /var/lib/networkservicemesh
            type: DirectoryOrCreate
EOF



cat <<EOF | kubectl apply -f -
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nse-kernel
  labels:
    app: nse-kernel
spec:
  selector:
    matchLabels:
      app: nse-kernel
  template:
    metadata:
      labels:
        app: nse-kernel
        "spiffe.io/spiffe-id": "true"
    spec:
      containers:
        - name: nse
          image: ghcr.io/networkservicemesh/cmd-nse-icmp-responder:v1.13.2-rc.1
          imagePullPolicy: IfNotPresent
          env:
            - name: SPIFFE_ENDPOINT_SOCKET
              value: unix:///run/spire/sockets/agent.sock
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: NSM_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: NSM_LOG_LEVEL
              value: TRACE
            - name: NSM_CONNECT_TO
              value: unix:///var/lib/networkservicemesh/nsm.io.sock
            - name: NSM_CIDR_PREFIX
              value: 172.16.0.0/16
            - name: NSM_SERVICE_NAMES
              value: "icmp-responder"
          volumeMounts:
            - name: spire-agent-socket
              mountPath: /run/spire/sockets
              readOnly: true
            - name: nsm-socket
              mountPath: /var/lib/networkservicemesh
              readOnly: true
          resources:
            requests:
              cpu: 100m
              memory: 40Mi
            limits:
              memory: 80Mi
              cpu: 200m
      volumes:
        - name: spire-agent-socket
          hostPath:
            path: /run/spire/sockets
            type: Directory
        - name: nsm-socket
          hostPath:
            path: /var/lib/networkservicemesh
            type: DirectoryOrCreate
EOF
NAME                          READY   STATUS    RESTARTS   AGE   IP            NODE           NOMINATED NODE   READINESS GATES
nsc-kernel-7cc77f58c8-ftl7f   1/1     Running   0          30m   10.244.1.12   kind-worker2   <none>           <none>
nse-kernel-85fbcffb44-ghz5s   1/1     Running   0          30m   10.244.2.9    kind-worker    <none>           <none>


kubectl get pods -l app=nsc-kernel -o=jsonpath='{.items[*].metadata.name}'| tr ' ' '\n' | xargs -I{} sh -c 'echo === {} === && kubectl exec -t {}  -- ip a';kubectl get pods -l app=nse-kernel -o=jsonpath='{.items[*].metadata.name}'| tr ' ' '\n' | xargs -I{} sh -c 'echo === {} === && kubectl exec -t {}  -- ip a'
=== nsc-kernel-7cc77f58c8-ftl7f ===
Defaulted container "nsc" out of: nsc, sysctl-init (init)
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@if15: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP 
    link/ether 1e:5f:01:00:e9:70 brd ff:ff:ff:ff:ff:ff
    inet 10.244.1.12/24 brd 10.244.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::1c5f:1ff:fe00:e970/64 scope link 
       valid_lft forever preferred_lft forever
=== nse-kernel-85fbcffb44-ghz5s ===
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0@icmp-respo-Rw5D: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP 
    link/ether 9a:ea:0c:ea:b3:2e brd ff:ff:ff:ff:ff:ff
    inet 10.244.2.9/24 brd 10.244.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::98ea:cff:feea:b32e/64 scope link 
       valid_lft forever preferred_lft forever
3: icmp-respo-CTW4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:03:96:b0:a1 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:3ff:fe96:b0a1/64 scope link 
       valid_lft forever preferred_lft forever
4: icmp-respo-SSwm: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:5b:6a:f2:04 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:5bff:fe6a:f204/64 scope link 
       valid_lft forever preferred_lft forever
5: icmp-respo-ePzm: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:11:74:cb:fa brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:11ff:fe74:cbfa/64 scope link 
       valid_lft forever preferred_lft forever
6: icmp-respo-yJZR: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:6b:54:49:71 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:6bff:fe54:4971/64 scope link 
       valid_lft forever preferred_lft forever
7: icmp-respo-QShs: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:57:ae:f2:13 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:57ff:feae:f213/64 scope link 
       valid_lft forever preferred_lft forever
8: icmp-respo-rFaQ: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:19:d3:7a:09 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:19ff:fed3:7a09/64 scope link 
       valid_lft forever preferred_lft forever
9: icmp-respo-rg7S: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:f3:44:29:86 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:f3ff:fe44:2986/64 scope link 
       valid_lft forever preferred_lft forever
10: icmp-respo-A6cp: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:fb:2f:3e:1a brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:fbff:fe2f:3e1a/64 scope link 
       valid_lft forever preferred_lft forever
11: icmp-respo-Rw5D: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:01:11:de:76 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:1ff:fe11:de76/64 scope link 
       valid_lft forever preferred_lft forever
12: icmp-respo-r4Db: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:97:4a:57:2f brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:97ff:fe4a:572f/64 scope link 
       valid_lft forever preferred_lft forever
13: icmp-respo-KvCJ: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:29:d6:43:50 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:29ff:fed6:4350/64 scope link 
       valid_lft forever preferred_lft forever
14: icmp-respo-z0kO: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:3d:95:f7:58 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:3dff:fe95:f758/64 scope link 
       valid_lft forever preferred_lft forever
15: icmp-respo-V3EG: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:bd:60:0b:8e brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:bdff:fe60:b8e/64 scope link 
       valid_lft forever preferred_lft forever
16: icmp-respo-CRvI: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:21:ad:cc:f0 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:21ff:fead:ccf0/64 scope link 
       valid_lft forever preferred_lft forever
17: icmp-respo-ipuI: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:f9:e7:d9:db brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:f9ff:fee7:d9db/64 scope link 
       valid_lft forever preferred_lft forever
18: icmp-respo-3zFH: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:b9:f8:db:d7 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:b9ff:fef8:dbd7/64 scope link 
       valid_lft forever preferred_lft forever
19: icmp-respo-sDW0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:e7:d7:c3:17 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:e7ff:fed7:c317/64 scope link 
       valid_lft forever preferred_lft forever
20: icmp-respo-GLAw: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:49:d9:c5:99 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:49ff:fed9:c599/64 scope link 
       valid_lft forever preferred_lft forever
21: icmp-respo-shCm: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:2f:2f:e2:d3 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:2fff:fe2f:e2d3/64 scope link 
       valid_lft forever preferred_lft forever
22: icmp-respo-erCg: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:d1:20:9a:a2 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:d1ff:fe20:9aa2/64 scope link 
       valid_lft forever preferred_lft forever
23: icmp-respo-AztN: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:e7:69:12:57 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:e7ff:fe69:1257/64 scope link 
       valid_lft forever preferred_lft forever
24: icmp-respo-u8VJ: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:f1:49:85:66 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:f1ff:fe49:8566/64 scope link 
       valid_lft forever preferred_lft forever
25: icmp-respo-xviy: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:b1:fa:1b:ca brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:b1ff:fefa:1bca/64 scope link 
       valid_lft forever preferred_lft forever
26: icmp-respo-MCO7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:19:b2:6d:09 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:19ff:feb2:6d09/64 scope link 
       valid_lft forever preferred_lft forever
27: icmp-respo-wqOy: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:9f:8f:eb:19 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:9fff:fe8f:eb19/64 scope link 
       valid_lft forever preferred_lft forever
28: icmp-respo-DalE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:3f:d4:d8:b1 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:3fff:fed4:d8b1/64 scope link 
       valid_lft forever preferred_lft forever
29: icmp-respo-iryd: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:73:72:8d:3b brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:73ff:fe72:8d3b/64 scope link 
       valid_lft forever preferred_lft forever
30: icmp-respo-GNxN: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:b7:97:05:0d brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:b7ff:fe97:50d/64 scope link 
       valid_lft forever preferred_lft forever
31: icmp-respo-c957: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:6b:05:d8:ad brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:6bff:fe05:d8ad/64 scope link 
       valid_lft forever preferred_lft forever
32: icmp-respo-uxJE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:7f:55:07:47 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:7fff:fe55:747/64 scope link 
       valid_lft forever preferred_lft forever
33: icmp-respo-jXu5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:3f:85:51:95 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:3fff:fe85:5195/64 scope link 
       valid_lft forever preferred_lft forever
34: icmp-respo-uX9z: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:ed:17:52:30 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:edff:fe17:5230/64 scope link 
       valid_lft forever preferred_lft forever
35: icmp-respo-ctZo: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:79:82:1b:3f brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:79ff:fe82:1b3f/64 scope link 
       valid_lft forever preferred_lft forever
36: icmp-respo-x7yM: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:47:c8:fe:7b brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:47ff:fec8:fe7b/64 scope link 
       valid_lft forever preferred_lft forever
37: icmp-respo-bW5f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:1b:99:19:32 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:1bff:fe99:1932/64 scope link 
       valid_lft forever preferred_lft forever
38: icmp-respo-oO3g: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:f7:81:66:7a brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:f7ff:fe81:667a/64 scope link 
       valid_lft forever preferred_lft forever
39: icmp-respo-NPQR: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:01:07:78:69 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:1ff:fe07:7869/64 scope link 
       valid_lft forever preferred_lft forever
40: icmp-respo-qrmC: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:17:89:31:f0 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:17ff:fe89:31f0/64 scope link 
       valid_lft forever preferred_lft forever
41: icmp-respo-CRCU: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:4b:51:7a:13 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:4bff:fe51:7a13/64 scope link 
       valid_lft forever preferred_lft forever
42: icmp-respo-GdjO: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:43:56:de:1a brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:43ff:fe56:de1a/64 scope link 
       valid_lft forever preferred_lft forever
43: icmp-respo-uNRG: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:bf:22:a5:e5 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:bfff:fe22:a5e5/64 scope link 
       valid_lft forever preferred_lft forever
44: icmp-respo-U7wo: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:c3:80:c3:b1 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:c3ff:fe80:c3b1/64 scope link 
       valid_lft forever preferred_lft forever
45: icmp-respo-k3p8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UNKNOWN qlen 1000
    link/ether 02:fe:ad:ee:8c:db brd ff:ff:ff:ff:ff:ff
    inet6 fe80::fe:adff:feee:8cdb/64 scope link tentative 
       valid_lft forever preferred_lft forever


@ljkiraly
Copy link
Contributor

I checked two ideas to handle the interface cleanup:
1/ fetch the interfaces by SwInterfaceDump govpp API call and compare the interface index lists before and after the TAP interface creation. When creation fails by context cancel error remove the added interfaces (diff of the swIfIndex list) - this approach can impact other timing problems.
2/ try to remove the interface by name in Close. This should work if we generate and store the names of TAP interfaces in NSM.

Other possibilities to check:

  • The response of API call should be returned in case of context cancel error
  • Add rollback callback function to vpp API to handle the cleanup.

@zolug
Copy link
Contributor Author

zolug commented Jul 26, 2024

I`m not in favor of option 1, since IMHO parallel actions could alter the interface index list in the meantime.

What if create would employ an independent (background) context to execute the VPP API call TapCreateV3?
A hard deadline of let's say 10 seconds could be also set just in case. Otherwise, IMHO the API call itself is not costly. It might be mostly bad luck and timing (input context is close to its expiration) why it might fail due to context expired.
An independent context would significantly lower the probability of leaked interfaces due to context timeout. (Considering that a proper solution seems to be rather complex to implement, this might be a good enough fix for the time.)

@denis-tingaikin
Copy link
Member

The problem should be resolved in v1.14.0-rc.3. Could you please chec kit?

@zolug
Copy link
Contributor Author

zolug commented Sep 13, 2024

Hi,

Regarding this one: #1133 (comment)
Using NSM v1.14.0-rc.3 did not resolve the problem. I still saw numerous excess interfaces in the nsc and nse in no time. Moreover forwarder collocated with nsc or nse encountered frequent OOM kills (memory limit and request both set to 500Mi). (The nse got OOM killed as well occasionally.)
It did not matter whether nsc and nse were using sdk 1.13.2 or 1.14.0-rc.3.

Also, run a test with NSM 1.14.0-rc.4 which looked much better than rc.3, but there were still excess intefaces in the NSE but way less than with rc.3. Couldn't spot forwarder restarts due to OOM either.

@NikitaSkrynnik
Copy link
Contributor

Hi, @zolug. Could you please also test v1.14.0-rc.2?

@zolug
Copy link
Contributor Author

zolug commented Sep 17, 2024

Hi, @zolug. Could you please also test v1.14.0-rc.2?

Hi,

In case of #1133 (comment):
Some interfaces also piled up when running with v1.14.0-rc.2 using a small enough request timeout in the NSC. However, in my tests the NSC did manage to establish a connection eventually. Then, after 10 minutes the excess interfaces were gone from both the NSC and NSE.
Btw, if the NSC fails to establish a connection anytime soon (mostly happens when I only have 1 registry POD instance), then there's a huge steep jump in memory in case of forwarder-vpp and registry-k8s.

When running Meridio where the LB NSE was removed by a script continuously, thus was only available intermittently for a short period of time the excess interfaces were not cleaned-up after stopping the script that interfered with the availability of the NSE.
while [ 1 ]; do kubectl delete nses -n nsm stateless-lb-frontend-attr-a1-f5864fc94-9r7mb; sleep 1; done

What I observed in the second case was, that despite the request context's timeout/cancel the begin let the "connect" to continue. Thus from the perspective of nsmgr and forwarder a connection including interfaces was established. Thus the NSC upon a subsequent try/request was able to recover a connection via Monitor Connection via nsmgr. But as soon as part of one of the "reconnect" the nsmgr got failure response from the forwarder due to the NSE not being around, the connection state was changed to Down. Which in the NSC triggered a reconnect with reselect instead an ordinary request, promting the nsmgr to close the "old connection first" Now, this also failed in my case despite the 1 minute timeout in the beginServer (due to NSE not being found), yet the cleanup could not remove the interface because the context was already expired.

Sep 17 12:01:29.524 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (9)         ⎆ sdk/pkg/networkservice/common/mechanisms/recvfd/recvFDServer.Close()
...
ep 17 12:02:29.494 [ERRO] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (2.2)    context deadline exceeded;  Error returned from sdk/pkg/networkservice/common/begin/
beginServer.Close;      github.com/networkservicemesh/sdk/pkg/networkservice/core/trace/traceverbose.logError;          /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.14.0-rc.2/
pkg/networkservice/core/trace/traceverbose/common.go:211;       github.com/networkservicemesh/sdk/pkg/networkservice/core/trace/traceverbose.(*beginTraceServer).Close;         /root/go
/pkg/mod/github.com/networkservicemesh/sdk@v1.14.0-rc.2/pkg/networkservice/core/trace/traceverbose/server.go:71;        github.com/networkservicemesh/sdk/pkg/networkservice/core/next.(
*nextServer).Close;             /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.14.0-rc.2/pkg/networkservice/core/next/server.go:65;      github.com/networkservicemesh/sdk/pkg/ne
tworkservice/core/trace.(*traceServer).Close;           /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.14.0-rc.2/pkg/networkservice/core/trace/server.go:54;     github.com/netwo
rkservicemesh/sdk/pkg/networkservice/core/next.(*nextServer).Close;             /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.14.0-rc.2/pkg/networkservice/core/next/server.go:6
5;      github.com/networkservicemesh/sdk/pkg/networkservice/core/trace/traceverbose.(*endTraceServer).Close;           /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.14.0-rc.2/
pkg/networkservice/core/trace/traceverbose/server.go:86;        github.com/networkservicemesh/sdk/pkg/networkservice/core/next.(*nextServer).Close;             /root/go/pkg/mod/github.
com/networkservicemesh/sdk@v1.14.0-rc.2/pkg/networkservice/core/next/server.go:65;      github.com/networkservicemesh/sdk/pkg/networkservice/common/updatepath.(*updatePathServer).Close
;               /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.14.0-rc.2/pkg/networkservice/common/updatepath/server.go:72;      github.com/networkservicemesh/sdk/pkg/networkser
vice/core/trace/traceverbose.(*beginTraceServer).Close;         /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.14.0-rc.2/pkg/networkservice/core/trace/traceverbose/server.go:69;
        github.com/networkservicemesh/sdk/pkg/networkservice/core/next.(*nextServer).Close;             /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.14.0-rc.2/pkg/networkservi
ce/core/next/server.go:65;      github.com/networkservicemesh/sdk/pkg/networkservice/core/trace.(*traceServer).Close;           /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.14
.0-rc.2/pkg/networkservice/core/trace/server.go:54;     github.com/networkservicemesh/sdk/pkg/networkservice/core/next.(*nextServer).Close;             /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.14.0-rc.2/pkg/networkservice/core/next/server.go:65;      github.com/networkservicemesh/sdk/pkg/networkservice/core/next.(*nextServer).Close;             /root/go/pkg/mod/github.com/networkservicemesh/sdk@v1.14.0-rc.2/pkg/networkservice/core/next/server.go:65;      github.com/networkservicemesh/api/pkg/api/networkservice._NetworkService_Close_Handler;         /root/go/pkg/mod/github.com/networkservicemesh/api@v1.14.0-rc.2/pkg/api/networkservice/networkservice.pb.go:295;        google.golang.org/grpc.(*Server).processUnaryRPC;               /root/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1372; google.golang.org/grpc.(*Server).handleStream;          /root/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1783; google.golang.org/grpc.(*Server).serveStreams.func2.1;          /root/go/pkg/mod/google.golang.org/grpc@v1.60.1/server.go:1016; runtime.goexit;         /go/src/runtime/asm_amd64.s:1598;
Sep 17 12:02:29.495 [TRAC] [id:d18b3ea8-7849-4697-a0b9-1bba3fe8b09c] [type:networkService] (1.3)   close-response={"id":"3de28dd1-3e5c-4196-8b04-98fbb2ed795a","network_service":"load-balancer-a1.trench-a.default","mechanism":{"cls":"LOCAL","type":"KERNEL","parameters":{"inodeURL":"inode://4/4026536324","name":"load-balan-HHUW"}},"context":{"ip_context":{"src_ip_addrs":["172.16.2.3/24","fd00:0:0:2::3/64"],"dst_ip_addrs":["172.16.2.2/24","fd00:0:0:2::2/64"],"excluded_prefixes":["10.96.0.0/16","10.244.0.0/16"]},"MTU":1446},"labels":{"nodeName":"kind-worker3"},"path":{"index":2,"path_segments":[{"name":"proxy-load-balancer-a1-zl9cl","id":"proxy-load-balancer-a1-zl9cl-load-balancer-a1.trench-a.default-stateless-lb-frontend-attr-a1-f5864fc94-9r7mb","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL2RlZmF1bHQiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL25zbWdyLXNhIl0sImV4cCI6MTcyNjU3NDkzOX0.1mWNaxuEXL7IBG5yB2CfUbxLp1pc8Trg7inRXVZS0tSQ0JZdpdo2w0GUpXwhLbK_Rq7UqZG56PWm0n9bFGgnuQ","expires":{"seconds":1726574939,"nanos":176004176},"metrics":{"client_interface":"KERNEL/unknown"}},{"name":"nsmgr-2qljn","id":"d18b3ea8-7849-4697-a0b9-1bba3fe8b09c","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvbnNtZ3Itc2EiLCJleHAiOjE3MjY1NzUwODl9.DzQ4fxJq4yvbKqYEn4iPd-RhQSnd32H-gvyenlXIErrrkis5CJneRuvSEXhsbk5QWeshF3hzH9oy7RKLeTistQ","expires":{"seconds":1726575089,"nanos":494725910}},{"name":"forwarder-vpp-8wjlw","id":"3de28dd1-3e5c-4196-8b04-98fbb2ed795a","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvZGVmYXVsdCIsImF1ZCI6WyJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvbnNtZ3Itc2EiXSwiZXhwIjoxNzI2NTc0OTM5fQ.aXRYktq-HHZITWk-184u0sde9WW0Sr_KKMmyxc_lCgm4BwK4ocUdRz6o57ftPOkaq423rJPcCcaa5pXLKml1IQ","expires":{"seconds":1726574939,"nanos":311850181},"metrics":{"client_drops":"2","client_interface":"VXLAN/vxlan_tunnel3","client_rx_bytes":"450","client_rx_packets":"3","client_tx_bytes":"866","client_tx_packets":"5","server_drops":"0","server_interface":"VIRTIO/tap0","server_rx_bytes":"686","server_rx_packets":"5","server_tx_bytes":"150","server_tx_packets":"1"}},{"name":"nsmgr-l8t2d","id":"ec2047bf-40b3-434f-9801-10ff48ba402e","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvbnNtZ3Itc2EiLCJhdWQiOlsic3BpZmZlOi8vZXhhbXBsZS5vcmcvbnMvbnNtL3NhL2RlZmF1bHQiXSwiZXhwIjoxNzI2NTc0OTM5fQ.m0_Y7J1STG00924kcdT19U4-W0BW7gXu4oFrOqcZezY0KICUyj9uX1Zt4AEZUP_WRLyVbAEI57yZbcrZFhcsAQ","expires":{"seconds":1726574939,"nanos":343117634}},{"name":"forwarder-vpp-wjhj7","id":"39f6dab6-30d7-414a-88cb-b7311437e505","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9uc20vc2EvZGVmYXVsdCIsImF1ZCI6WyJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL21lcmlkaW8tZmVzIl0sImV4cCI6MTcyNjU3NDkzOX0.TFBMOqE0L35zFKH1YJAa6v7D9boic6TCPGpZz1-NL8ZvWA77sctIAIn1JKDk1oVWsmviFMQIQJrLrpyVy2L_zw","expires":{"seconds":1726574939,"nanos":431882120},"metrics":{"client_drops":"2","client_interface":"VIRTIO/tap2","client_rx_bytes":"796","client_rx_packets":"6","client_tx_bytes":"686","client_tx_packets":"5","server_drops":"0","server_interface":"VXLAN/vxlan_tunnel23","server_rx_bytes":"686","server_rx_packets":"5","server_tx_bytes":"680","server_tx_packets":"4"}},{"name":"stateless-lb-frontend-attr-a1-f5864fc94-9r7mb","id":"116d4bd9-42d1-4345-b6b1-9b898fe248a5","token":"eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJzcGlmZmU6Ly9leGFtcGxlLm9yZy9ucy9kZWZhdWx0L3NhL21lcmlkaW8tZmVzIiwiYXVkIjpbInNwaWZmZTovL2V4YW1wbGUub3JnL25zL25zbS9zYS9kZWZhdWx0Il0sImV4cCI6MTcyNjU3NDkzOX0._A5CN3H7803hxNbgzxygVlZDLnqMjtwf67_9U_BrX1-7HlGMgSgLDBYhEU9aDVk7-_G-DkDFt_ucWws_uRWgug","expires":{"seconds":1726574939,"nanos":432567121},"metrics":{"server_interface":"KERNEL/load-balan-ahl6"}}]},"network_service_endpoint_name":"stateless-lb-frontend-attr-a1-f5864fc94-9r7mb","payload":"ETHERNET"}
Sep 17 12:02:29.495 [ERRO] [id:d18b3ea8-7849-4697-a0b9-1bba3fe8b09c] [type:networkService] (1.4)   Error returned from sdk/pkg/networkservice/common/begin/beginServer.Close: context deadline exceeded
Sep 17 12:02:29.497 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (10)          ⎆ sdk/pkg/networkservice/common/mechanisms/sendfd/sendFDServer.Close()
Sep 17 12:02:29.497 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (11)           ⎆ sdk/pkg/networkservice/common/discover/discoverCandidatesServer.Close()
...
Sep 17 12:02:29.517 [ERRO] [type:registry] (25.1)                           Error returned from sdk/pkg/registry/common/connect/connectNSEClient.Recv: rpc error: code = DeadlineExceede
d desc = context deadline exceeded
Sep 17 12:02:29.517 [ERRO] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [discoverCandidatesServer:Close] [type:networkService] (11.1)             endpoint is not found: stateless-lb-front
end-attr-a1-f5864fc94-9r7mb: network service endpoint stateless-lb-frontend-attr-a1-f5864fc94-9r7mb not found
Sep 17 12:02:29.517 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (12)            ⎆ sdk/pkg/networkservice/common/roundrobin/selectEndpointServer.Close()
Sep 17 12:02:29.517 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (13)             ⎆ sdk/pkg/networkservice/core/next/nextServer.Close()
Sep 17 12:02:29.517 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (14)              ⎆ sdk-vpp/pkg/networkservice/metrics/stats/statsServer.Close()
Sep 17 12:02:29.518 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (15)               ⎆ sdk-vpp/pkg/networkservice/metrics/ifacename/ifaceNamesServer.Close()
Sep 17 12:02:29.518 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (16)                ⎆ sdk-vpp/pkg/networkservice/up/upServer.Close()
Sep 17 12:02:29.518 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (17)                 ⎆ sdk/pkg/networkservice/core/next/nextServer.Close()
Sep 17 12:02:29.518 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (18)                  ⎆ sdk-vpp/pkg/networkservice/xconnect/l2xconnect/l2XconnectServer.Close
()
Sep 17 12:02:29.518 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (19)                   ⎆ sdk-vpp/pkg/networkservice/xconnect/l3xconnect/l3XconnectServer.Close()
time="2024-09-17T12:02:29Z" level=error msg="Channel ID not known, ignoring the message." channel=404 msg_id=851
Sep 17 12:02:29.518 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (20)                    ⎆ sdk-vpp/pkg/networkservice/xconnect/l2bridgedomain/l2BridgeDomainServer.Close()
Sep 17 12:02:29.519 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (21)                     ⎆ sdk/pkg/networkservice/core/next/nextServer.Close()
Sep 17 12:02:29.519 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (22)                      ⎆ sdk-kernel/pkg/kernel/networkservice/connectioncontextkernel/mtu/mtuServer.Close()
Sep 17 12:02:29.519 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (23)                       ⎆ sdk-kernel/pkg/kernel/networkservice/connectioncontextkernel/ipcontext/ipneighbors/ipNeighborsServer.Close()
Sep 17 12:02:29.519 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (24)                        ⎆ sdk-kernel/pkg/kernel/networkservice/connectioncontextkernel/ipcontext/iprule/ipruleServer.Close()
Sep 17 12:02:29.520 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (25)                         ⎆ sdk-kernel/pkg/kernel/networkservice/connectioncontextkernel/ipcontext/routes/routesServer.Close()
Sep 17 12:02:29.520 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (26)                          ⎆ sdk-kernel/pkg/kernel/networkservice/connectioncontextkernel/ipcontext/ipaddress/ipaddressServer.Close()
Sep 17 12:02:29.520 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (27)                           ⎆ sdk-kernel/pkg/kernel/networkservice/connectioncontextkernel/pinggrouprange/pinggrouprangeServer.Close()
Sep 17 12:02:29.520 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (28)                            ⎆ sdk-kernel/pkg/kernel/networkservice/ethernetcontext/vfEthernetContextServer.Close()
Sep 17 12:02:29.520 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (29)                             ⎆ sdk-vpp/pkg/networkservice/tag/tagServer.Close()
Sep 17 12:02:29.521 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (30)                              ⎆ sdk-vpp/pkg/networkservice/connectioncontext/mtu/mtuServer.Close()
Sep 17 12:02:29.521 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (31)                               ⎆ sdk/pkg/networkservice/common/mechanisms/mechanismsServer.Close()
Sep 17 12:02:29.521 [TRAC] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (32)                                ⎆ sdk-vpp/pkg/networkservice/mechanisms/kernel/kerneltap/kernelTapServer.Close()
Sep 17 12:02:29.521 [ERRO] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] (32.1)                                  unable to delete connection with SwIfIndex 11: context deadline exceeded
...

IMHO, it would be preferable to have separate context for tap cleanup (and probably for create as well).
Same applies to the vxlan conn:

Sep 17 12:02:29.530 [ERRO] [id:3de28dd1-3e5c-4196-8b04-98fbb2ed795a] [type:networkService] [vxlan:client] (68.1)                                                                      error while deleting vxlan connection: vppapi AddNodeNext returned error: context deadline exceeded

@denis-tingaikin denis-tingaikin moved this from Under review to Moved to next release in Release v1.14.0 Sep 24, 2024
@NikitaSkrynnik
Copy link
Contributor

Hi, @zolug! The problem is related to small context timeout before any create call in sdk-vpp. We've decided to use separate contexts with good timeouts for all create operations.

@denis-tingaikin denis-tingaikin moved this to In Progress in Release v1.14.1 Oct 24, 2024
@denis-tingaikin
Copy link
Member

@Ex4amp1e Could you please share the current technical solution and decomposition to cover the issue? Thanks.

@Ex4amp1e
Copy link
Contributor

Ex4amp1e commented Nov 5, 2024

@denis-tingaikin Currently we have tried some draft solution to override contexts, where vpp operations are being used - networkservicemesh/sdk-vpp#867. But we decided to use another solution: to override vpp connection instead of overriding create/delete function contexts in general. It will help to cover and make safe all vpp operations.

Decomposition and plan:

  • Investigate interface behaviour with short request timeout (recreation) - 1h
  • Try to override less vpp connection functions, test - 3h
  • Investigate important operations in the create/del functions with safe context, test - 4h
  • Investigate new solution if previous steps do not help, testing - 4h
  • Final testings - 4h
    ROM: 16h

@denis-tingaikin
Copy link
Member

This problem is looking resolved.

@zolug if you get a chance, could you check this image ghcr.io/networkservicemesh/ci/cmd-forwarder-vpp:4d41019?

@zolug
Copy link
Contributor Author

zolug commented Dec 4, 2024

This problem is looking resolved.

@zolug if you get a chance, could you check this image ghcr.io/networkservicemesh/ci/cmd-forwarder-vpp:4d41019?

Unfortunately, I can still see the problem.
That's because IMHO it's not just about whether the input context has a timeout (or low timeout), rather that the context might get cancelled independently (prematurely before the deadline). This might not be applicable to the example network service clients in nsm repos, but e.g. some of our clients might need to cancel connection requests depending on the availability (or more like unavailability) of the nse instance it wants to connect to.
Meaning that from the perspective of the vpp-forwarder the context can have plenty of time until its deadline, yet can get cancelled prematurely from the client side, in which case the original problem can still arise.

I have attached a forwarder log file where the problem is visible. Check id: 8142320f-733e-4f99-915e-631d5516b6eb
The interface load-balan-adRC got stuck in my client POD.

nsm_forwarder-vpp-jw4fp_forwarder-vpp.log

@denis-tingaikin
Copy link
Member

@NikitaSkrynnik It looks like we missed branches when the incoming request had enough time for the request, but it was cancelled. At this moment we are just returning the same context, so it can be cancelled.

I think these lines should be covered:

  1. https://github.com/networkservicemesh/vpphelper/blob/main/extendtimeout/connection.go#L60
  2. https://github.com/networkservicemesh/vpphelper/blob/main/extendtimeout/connection.go#L65

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Dec 9, 2024

@zolug We've managed the mentioned context case, so please check ghcr.io/networkservicemesh/cmd-forwarder-vpp:v1.14.2-rc.1 when you get a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Moved to next release
Status: In Progress
Status: Under review
Development

No branches or pull requests

5 participants