After component restart there are more interfaces in NSE than expected #11371

szvincze · 2024-03-12T10:53:08Z

Expected Behavior

We are testing restart of different combination of NSM components with traffic.
It would be desirable besides the traffic recovery after the components are up and running again that the same number of interfaces exist in NSE.

Current Behavior

I attached logs from a case where we restarted an NSC and a forwarder-vpp in the system and there was one more interface (5) in the NSE than the expected 4.

We observed the same behavior several times with different combination of the restarted components but NSC was always one of them. So, it seems that the NSC restart has influence on this.

Failure Information (for bugs)

Described above in the Current behavior section.

Steps to Reproduce

The setup is based on NSM v1.11.2 basic kernel to ethernet to kernel example with 4 worker nodes.
There are an NSC on each node and 2 NSEs (one for IPv4 and one for IPv6).
Started traffic between the NSCs and NSEs.
An NSC and a forwarder-vpp were restarted. After both came up, there were an additional interface in the NSE (nse-ipv4-c9cd8cf77-gwnth) besides the original 4.

Context

Kubernetes Version: 1.27.1
NSM Version: v1.11.2 / v 1.12.1-rc.1

Failure Logs

logs.tar.gz

edwarnicke · 2024-03-26T14:39:43Z

Related to networkservicemesh/sdk#1020

szvincze · 2024-04-25T13:12:18Z

@edwarnicke, @denis-tingaikin: I got the information that the extra allocated IPs are not cleaned up even after 10 minutes.

denis-tingaikin · 2024-04-25T13:34:27Z

@szvincze It is not expected and looks unhealthy. So the problem will be considered. Also, it would be nice to have logs from the NSE right after the request and after 10 minutes when the forwarder died if possible.

Ex4amp1e · 2024-07-30T14:29:47Z

I have tried several cluster configurations on the Kubernetes Version: 1.27.1 and NSM v1.13.2
Some results:
1 NSE 1 NSC

local - on restart cleanup
remote - reproduced original issue, but interface cleaned up after timeout

1 NSC 2 NSE

local - on restart cleanup
remote - on restart cleanup

2 NSC 2 NSE

local - on restart cleanup
remote - on restart cleanup

So, the most similar steps for me were:

Setup 1 NSC and NSE on different nodes
Restart forwarder on the same node with client
Right after restart NSC
Check NSE interfaces

As a result: an additional interface was on the NSE, but there was no additional refresh requests for it and it has been cleaned up after 10 mins. And it looks like expected.

So, it is required to get additional steps to reproduce the issue, when the leaking interface is not cleaning up after 10 mins.

szvincze · 2024-08-22T13:54:07Z

@Ex4amp1e: I have checked the current situation with my colleagues if the same behavior can be reproduced using NSM v1.13.2 release. It seems the leaking interface is always removed within 10 mins. We will have a meeting to clarify and validate the used test cases. I will come back with the feedback soon, so please keep this issue open until that.

szvincze · 2024-08-26T09:17:13Z

Hi @Ex4amp1e,
I have checked the situation with my colleagues and it seems there are cases when the whole thing is working as expected and deletes the extra interface within 10 minutes but they still have seen some occurrences when the interface remains.
The last one that they mentioned was when the NSE and the registry-k8s pods were restarted. No any particular trick was needed just run this test several times.
We check the frequency of the occurrences and I will share it soon.

szvincze · 2024-08-27T10:35:55Z

As far as the test results show in the vast majority of the successful reproductions the combination of the components was registry-k8s and NSE, in few cases registry-k8s and NSC.

Ex4amp1e · 2024-08-27T14:19:15Z

Hi @szvincze
I tried to restart NSE and registry-k8s pods many times and haven’t reproduced the issue. Also, restart of NSE/Registry is more about healing, are you sure, that in such case we are getting interface leak?
We need to know what is overall use case
to get more details of reproducing, may be we are missing some preconditions or everything is going as expected.
We are still blocked, so could you please provide more details:

exact steps to reproduce
observed behaviour
expected behaviour

szvincze · 2024-09-10T13:52:15Z

Here I send the reproduction, the respective logs and analysis of the issue.
Note that there are several different component restart combinations in the tests, but as we experienced the failing cases happened when registry-k8s was one of the components and it was the first in the row.

OK 	- "After 10 min, TC shows the expected num of interfaces"
FAIL 	- "After 10 min, TC num of interfaces still doesn't match"

Test #1363:
robustness-multiple-component-restart.sh nse + registry-k8s 		FAIL
robustness-multiple-component-restart.sh nsc + forwarder-vpp		OK
robustness-multiple-component-restart.sh nsc + nsmgr			OK

Investigation of "nse + registry-k8s FAIL" scenario

The test VM had 6 worker nodes between "n4-n9":

NAME                                 STATUS   ROLES                  AGE    VERSION   INTERNAL-IP
pool1-n185-vpod4-pool1-n4            Ready    worker                 325d   v1.27.1   10.0.40.104
pool1-n185-vpod4-pool1-n5            Ready    worker                 325d   v1.27.1   10.0.40.105
pool1-n185-vpod4-pool1-n6            Ready    worker                 325d   v1.27.1   10.0.40.103
pool1-n185-vpod4-pool1-n7            Ready    worker                 325d   v1.27.1   10.0.40.107
pool1-n185-vpod4-pool1-n8            Ready    worker                 325d   v1.27.1   10.0.40.106
pool1-n185-vpod4-pool1-n9            Ready    worker                 318d   v1.27.1   10.0.40.108

Registry-k8s had 2 replicas running on node "n4" and "n8":

[2024-08-28T02:21:36.242Z] NAME                               READY   STATUS    RESTARTS         AGE    IP                NODE
[2024-08-28T02:21:36.242Z] registry-k8s-cc57559db-h7cqw       2/2     Running   1 (20m ago)      58m    192.168.167.28    pool1-n185-vpod4-pool1-n8
[2024-08-28T02:21:36.242Z] registry-k8s-cc57559db-pt5gj       2/2     Running   1 (23m ago)      58m    192.168.172.49    pool1-n185-vpod4-pool1-n4

There were 6-NSC and 2-NSE endpoints were running:

[2024-08-28T02:21:36.497Z] NAME                        READY   STATUS    RESTARTS      AGE   IP                NODE
[2024-08-28T02:21:36.497Z] nsc-8dbb45d97-9d9wp         1/1     Running   0             57m   192.168.172.76    pool1-n185-vpod4-pool1-n6
[2024-08-28T02:21:36.497Z] nsc-8dbb45d97-hxv9s         1/1     Running   0             57m   192.168.5.86      pool1-n185-vpod4-pool1-n9
[2024-08-28T02:21:36.497Z] nsc-8dbb45d97-mbh92         1/1     Running   1 (23m ago)   57m   192.168.172.54    pool1-n185-vpod4-pool1-n4
[2024-08-28T02:21:36.497Z] nsc-8dbb45d97-nhstj         1/1     Running   1 (20m ago)   57m   192.168.167.55    pool1-n185-vpod4-pool1-n8
[2024-08-28T02:21:36.497Z] nsc-8dbb45d97-spgkc         1/1     Running   0             57m   192.168.71.242    pool1-n185-vpod4-pool1-n5
[2024-08-28T02:21:36.497Z] nsc-8dbb45d97-vl9zq         1/1     Running   0             57m   192.168.235.18    pool1-n185-vpod4-pool1-n7
[2024-08-28T02:21:36.497Z] nse-ipv4-77c55d5b94-s8r87   1/1     Running   1 (28m ago)   57m   192.168.172.113   pool1-n185-vpod4-pool1-n6
[2024-08-28T02:21:36.497Z] nse-ipv6-6ccb65cdf7-djvvt   1/1     Running   1 (25m ago)   57m   192.168.167.15    pool1-n185-vpod4-pool1-n8

1 NSE and 1 Registry-k8s pod running on the node was chosen:

[2024-08-28T02:21:37.315Z] robustness-multiple-component-restart.sh: The following pods will be killed: 
nse-ipv6-6ccb65cdf7-djvvt and registry-k8s-cc57559db-h7cqw which are located on the node: pool1-n185-vpod4-pool1-n8

NSC that also on the same node "n8":

[2024-08-28T02:21:37.315Z] robustness-multiple-component-restart.sh: These endpoint pods share the same worker:
[2024-08-28T02:21:37.315Z] nsc-8dbb45d97-nhstj
[2024-08-28T02:21:37.315Z] nse-ipv6-6ccb65cdf7-djvvt


[2024-08-28T02:21:37.315Z] robustness-multiple-component-restart.sh: Starting traffic
[2024-08-28T02:21:38.640Z] robustness-multiple-component-restart.sh: IPv6 server IP in nsc-8dbb45d97-9d9wp: 100:100::7
[2024-08-28T02:21:39.199Z] robustness-multiple-component-restart.sh: kubectl exec -n nsm-ep nsc-8dbb45d97-hxv9s -- ip -o addr show dev nsm-ipv4 scope global
[2024-08-28T02:21:39.474Z] robustness-multiple-component-restart.sh: kubectl exec -n nsm-ep nsc-8dbb45d97-hxv9s -- ip -o addr show dev nsm-ipv6 scope global
[2024-08-28T02:21:39.749Z] robustness-multiple-component-restart.sh: IPv4 server IP in nsc-8dbb45d97-hxv9s: 172.16.1.99
[2024-08-28T02:21:40.314Z] robustness-multiple-component-restart.sh: IPv6 server IP in nsc-8dbb45d97-hxv9s: 100:100::1
[2024-08-28T02:21:42.502Z] robustness-multiple-component-restart.sh: kubectl exec -n nsm-ep nsc-8dbb45d97-nhstj -- ip -o addr show dev nsm-ipv4 scope global
[2024-08-28T02:21:42.757Z] robustness-multiple-component-restart.sh: kubectl exec -n nsm-ep nsc-8dbb45d97-nhstj -- ip -o addr show dev nsm-ipv6 scope global
[2024-08-28T02:21:43.012Z] robustness-multiple-component-restart.sh: IPv4 server IP in nsc-8dbb45d97-nhstj: 172.16.1.101
[2024-08-28T02:21:43.571Z] robustness-multiple-component-restart.sh: IPv6 server IP in nsc-8dbb45d97-nhstj: 100:100::1
[2024-08-28T02:21:45.316Z] robustness-multiple-component-restart.sh: IPv6 server IP in nsc-8dbb45d97-spgkc: 100:100::5
[2024-08-28T02:21:46.946Z] robustness-multiple-component-restart.sh: IPv6 server IP in nsc-8dbb45d97-vl9zq: 100:100::3
...
[2024-08-28T02:21:48.429Z] robustness-multiple-component-restart.sh: sending traffic to nsc-8dbb45d97-9d9wp(100:100::7) from nse-ipv6-6ccb65cdf7-djvvt
[2024-08-28T02:21:48.989Z] robustness-multiple-component-restart.sh: sending traffic to nsc-8dbb45d97-hxv9s(172.16.1.99) from nse-ipv4-77c55d5b94-s8r87
[2024-08-28T02:21:49.549Z] robustness-multiple-component-restart.sh: sending traffic to nsc-8dbb45d97-hxv9s(100:100::1) from nse-ipv6-6ccb65cdf7-djvvt
[2024-08-28T02:21:51.596Z] robustness-multiple-component-restart.sh: sending traffic to nsc-8dbb45d97-nhstj(172.16.1.101) from nse-ipv4-77c55d5b94-s8r87
[2024-08-28T02:21:52.156Z] robustness-multiple-component-restart.sh: sending traffic to nsc-8dbb45d97-nhstj(100:100::1) from nse-ipv6-6ccb65cdf7-djvvt
[2024-08-28T02:21:53.641Z] robustness-multiple-component-restart.sh: sending traffic to nsc-8dbb45d97-spgkc(100:100::5) from nse-ipv6-6ccb65cdf7-djvvt
[2024-08-28T02:21:55.125Z] robustness-multiple-component-restart.sh: sending traffic to nsc-8dbb45d97-vl9zq(100:100::3) from nse-ipv6-6ccb65cdf7-djvvt
...
[2024-08-28T02:21:55.685Z] ctraffic: starting(nsc-8dbb45d97-9d9wp_nsc_86b): ./ctraffic -address [100:100::7]:5003 -server
[2024-08-28T02:21:55.940Z] ctraffic: server(nsc-8dbb45d97-9d9wp_nsc_86b): 2024/08/28 02:21:55 Listen on address;  [100:100::7]:5003
[2024-08-28T02:21:55.940Z] ctraffic: starting(nsc-8dbb45d97-hxv9s_nsc_f43): ./ctraffic -address 172.16.1.99:5003 -server
[2024-08-28T02:21:55.940Z] ctraffic: starting(nsc-8dbb45d97-hxv9s_nsc_1d0): ./ctraffic -address [100:100::1]:5003 -server
[2024-08-28T02:21:56.195Z] ctraffic: server(nsc-8dbb45d97-hxv9s_nsc_f43): 2024/08/28 02:21:55 Listen on address;  172.16.1.99:5003
[2024-08-28T02:21:56.195Z] ctraffic: server(nsc-8dbb45d97-hxv9s_nsc_1d0): 2024/08/28 02:21:56 Listen on address;  [100:100::1]:5003
[2024-08-28T02:21:56.195Z] ctraffic: starting(nsc-8dbb45d97-nhstj_nsc_b2b): ./ctraffic -address 172.16.1.101:5003 -server
[2024-08-28T02:21:56.449Z] ctraffic: starting(nsc-8dbb45d97-nhstj_nsc_01a): ./ctraffic -address [100:100::1]:5003 -server
[2024-08-28T02:21:56.449Z] ctraffic: server(nsc-8dbb45d97-nhstj_nsc_b2b): 2024/08/28 02:21:56 Listen on address;  172.16.1.101:5003
[2024-08-28T02:21:56.449Z] ctraffic: server(nsc-8dbb45d97-nhstj_nsc_01a): 2024/08/28 02:21:56 Listen on address;  [100:100::1]:5003
[2024-08-28T02:21:56.704Z] ctraffic: starting(nsc-8dbb45d97-vl9zq_nsc_2c7): ./ctraffic -address [100:100::3]:5003 -server
[2024-08-28T02:21:56.704Z] ctraffic: starting(nsc-8dbb45d97-spgkc_nsc_5a0): ./ctraffic -address [100:100::5]:5003 -server
[2024-08-28T02:21:56.704Z] ctraffic: server(nsc-8dbb45d97-spgkc_nsc_5a0): 2024/08/28 02:21:56 Listen on address;  [100:100::5]:5003
[2024-08-28T02:21:56.961Z] ctraffic: server(nsc-8dbb45d97-vl9zq_nsc_2c7): 2024/08/28 02:21:56 Listen on address;  [100:100::3]:5003
[2024-08-28T02:21:57.473Z] ctraffic: starting(nse-ipv6-6ccb65cdf7-djvvt_nse_30c): ./ctraffic -address [100:100::7]:5003 -nconn 100 -rate 50 -timeout 240s -monitor -stats all
[2024-08-28T02:21:57.727Z] ctraffic: starting(nse-ipv6-6ccb65cdf7-djvvt_nse_070): ./ctraffic -address [100:100::1]:5003 -nconn 100 -rate 50 -timeout 240s -monitor -stats all
[2024-08-28T02:21:57.982Z] ctraffic: starting(nse-ipv4-77c55d5b94-s8r87_nse_415): ./ctraffic -address 172.16.1.101:5003 -nconn 100 -rate 50 -timeout 240s -monitor -stats all
[2024-08-28T02:21:58.237Z] ctraffic: starting(nse-ipv4-77c55d5b94-s8r87_nse_32e): ./ctraffic -address 172.16.1.101:5003 -nconn 100 -rate 50 -timeout 240s -monitor -stats all
[2024-08-28T02:21:58.497Z] ctraffic: starting(nse-ipv6-6ccb65cdf7-djvvt_nse_0e7): ./ctraffic -address [100:100::1]:5003 -nconn 100 -rate 50 -timeout 240s -monitor -stats all
[2024-08-28T02:21:58.752Z] ctraffic: starting(nse-ipv6-6ccb65cdf7-djvvt_nse_6df): ./ctraffic -address [100:100::5]:5003 -nconn 100 -rate 50 -timeout 240s -monitor -stats all
[2024-08-28T02:21:59.006Z] ctraffic: starting(nse-ipv6-6ccb65cdf7-djvvt_nse_53e): ./ctraffic -address [100:100::3]:5003 -nconn 100 -rate 50 -timeout 240s -monitor -stats all

Kill the chosen pods at the same time:

[2024-08-28T02:22:04.637Z] robustness-multiple-component-restart.sh: kubectl exec nse-ipv6-6ccb65cdf7-djvvt -n nsm-ep -c nse -- kill 1
[2024-08-28T02:22:05.148Z] robustness-multiple-component-restart.sh: kubectl exec registry-k8s-cc57559db-h7cqw -n nsm -c registry-k8s -- bash -c 'kill -9 14'
[2024-08-28T02:22:05.403Z] robustness-multiple-component-restart.sh: kubectl wait --for=condition=ready --timeout=3m pod -n nsm-ep -l endpoint-type=nse
[2024-08-28T02:22:06.679Z] robustness-multiple-component-restart.sh: kubectl wait --for=condition=ready --timeout=3m pod -n nsm -l app=registry-k8s

NSE was up after 1 sec of kill:

[2024-08-28T02:22:05.658Z] robustness-multiple-component-restart.sh: nse-ipv6-6ccb65cdf7-djvvt latest container was started at null
[2024-08-28T02:22:06.679Z] robustness-multiple-component-restart.sh: nse-ipv6-6ccb65cdf7-djvvt latest container was started at 2024-08-28T02:22:05Z

Registry-k8s was up after 6 sec of kill:

[2024-08-28T02:22:11.872Z] pod/registry-k8s-cc57559db-h7cqw condition met
[2024-08-28T02:22:11.872Z] pod/registry-k8s-cc57559db-pt5gj condition met
[2024-08-28T02:22:11.872Z] robustness-multiple-component-restart.sh: registry-k8s-cc57559db-h7cqw latest container was started at 2024-08-28T02:22:05Z

Traffic stopped smoothly:

[2024-08-28T02:22:11.872Z] robustness-multiple-component-restart.sh: Stopping traffic due to NSE/NSC was rebooted, currently the evaluation is not advanced enough


[2024-08-28T02:22:19.125Z] robustness-multiple-component-restart.sh: Executing "check_num_of_endpoint_IPs nsm-ep 2>&1" until return is 0 or 150 second passed
[2024-08-28T02:25:10.516Z] robustness-multiple-component-restart.sh: WARNING: Time is up, giving up (after 11 retries)


[2024-08-28T02:37:38.499Z] DEBUG: nse-ipv6-6ccb65cdf7-djvvt
[2024-08-28T02:37:38.499Z] 100:100::
[2024-08-28T02:37:38.499Z] 100:100::2
[2024-08-28T02:37:38.499Z] 100:100::
[2024-08-28T02:37:38.499Z] 100:100::4
[2024-08-28T02:37:38.499Z] 100:100::2
[2024-08-28T02:37:38.499Z] 100:100::6
[2024-08-28T02:37:38.499Z] 100:100::4
[2024-08-28T02:37:38.499Z] 100:100::8
[2024-08-28T02:37:38.499Z] 100:100::a
[2024-08-28T02:37:38.499Z] 100:100::6


[2024-08-28T02:25:10.516Z] robustness-multiple-component-restart.sh: Errors detected:
[2024-08-28T02:25:10.516Z] nse-ipv6-6ccb65cdf7-djvvt has not the same number of IPs as the number of NSCs (6)

Following shows which ip was added to which interface.
interfaces_nsm-ep_nse-ipv6-6ccb65cdf7-djvvt.txt

3: eth0@if2518: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default 
    link/ether 5e:31:de:48:97:64 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.167.15/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fd00:eccd:18:ffff:6c5e:414c:2f14:a70f/128 scope global 
       valid_lft forever preferred_lft forever
46: icmp-respo-9083: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:38:33:ea:99 brd ff:ff:ff:ff:ff:ff
    inet6 100:100::/128 scope global nodad 
       valid_lft forever preferred_lft forever
47: icmp-respo-87f4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8946 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:10:63:51:9e brd ff:ff:ff:ff:ff:ff
    inet6 100:100::2/128 scope global nodad 
       valid_lft forever preferred_lft forever
    inet6 100:100::/128 scope global nodad 
       valid_lft forever preferred_lft forever
48: icmp-respo-0c07: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8946 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:fc:c6:63:4f brd ff:ff:ff:ff:ff:ff
    inet6 100:100::4/128 scope global nodad 
       valid_lft forever preferred_lft forever
    inet6 100:100::2/128 scope global nodad 
       valid_lft forever preferred_lft forever
49: icmp-respo-7315: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1446 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:de:83:d8:12 brd ff:ff:ff:ff:ff:ff
    inet6 100:100::6/128 scope global nodad 
       valid_lft forever preferred_lft forever
    inet6 100:100::4/128 scope global nodad 
       valid_lft forever preferred_lft forever
50: icmp-respo-54b6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8946 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:d6:25:5a:87 brd ff:ff:ff:ff:ff:ff
    inet6 100:100::8/128 scope global nodad 
       valid_lft forever preferred_lft forever
51: icmp-respo-d2d2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8946 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:84:03:76:9d brd ff:ff:ff:ff:ff:ff
    inet6 100:100::a/128 scope global nodad 
       valid_lft forever preferred_lft forever
    inet6 100:100::6/128 scope global nodad 
       valid_lft forever preferred_lft forever

The expectation is to have 6 ip addresses in total as indicated in the error.

Part 2:

Followings NSCs were also failed with their ifconfigs:

NSC #1
ctraffic: nse-ipv6-6ccb65cdf7-djvvt_nse_30c was not running
[2024-08-28T02:37:38.499Z] DEBUG: nsc-8dbb45d97-9d9wp
[2024-08-28T02:37:38.499Z] 172.16.1.97
[2024-08-28T02:37:38.499Z] 100:100::b
[2024-08-28T02:37:38.499Z] 100:100::7
[2024-08-28T02:25:10.516Z] nsc-8dbb45d97-9d9wp has not the same number of IPs as the number of NSEs (2)

pnes/endpoint_pod_info//interfaces_nsm-ep_nsc-8dbb45d97-9d9wp_nsc.txt
3: eth0@if8538: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default 
    link/ether be:ec:c2:ca:a5:93 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.172.76/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fd00:eccd:18:ffff:552e:f937:bfc6:3cee/128 scope global 
       valid_lft forever preferred_lft forever
13: nsm-ipv4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:4e:43:00:59 brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.97/32 scope global nsm-ipv4
       valid_lft forever preferred_lft forever
16: nsm-ipv6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8946 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:58:8f:be:41 brd ff:ff:ff:ff:ff:ff
    inet6 100:100::b/128 scope global nodad 
       valid_lft forever preferred_lft forever
    inet6 100:100::7/128 scope global nodad 
       valid_lft forever preferred_lft forever

On nsm-ipv6 interface address 100:100::7, it was receiving traffic from restarted nse: nse-ipv6-6ccb65cdf7-djvvt

NSC #2

[2024-08-28T02:37:38.499Z] DEBUG: nsc-8dbb45d97-hxv9s
[2024-08-28T02:37:38.499Z] 172.16.1.99
[2024-08-28T02:37:38.499Z] 100:100::3
[2024-08-28T02:37:38.499Z] 100:100::1
[2024-08-28T02:25:10.516Z] nsc-8dbb45d97-hxv9s has not the same number of IPs as the number of NSEs (2)

pnes/endpoint_pod_info/interfaces_nsm-ep_nsc-8dbb45d97-hxv9s_nsc.txt
3: eth0@if22674: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default 
    link/ether fa:eb:95:7c:33:ac brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.5.86/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fd00:eccd:18:ffff:7a1b:ca35:be07:141b/128 scope global 
       valid_lft forever preferred_lft forever
14: nsm-ipv4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8946 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:24:98:49:19 brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.99/32 scope global nsm-ipv4
       valid_lft forever preferred_lft forever
17: nsm-ipv6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8946 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:d0:db:49:6d brd ff:ff:ff:ff:ff:ff
    inet6 100:100::3/128 scope global nodad 
       valid_lft forever preferred_lft forever
    inet6 100:100::1/128 scope global nodad 
       valid_lft forever preferred_lft forever

On nsm-ipv6 interface address 100:100::1, it was receiving traffic from restarted nse: nse-ipv6-6ccb65cdf7-djvvt

NSC #3

[2024-08-28T02:37:38.499Z] DEBUG: nsc-8dbb45d97-spgkc
[2024-08-28T02:37:38.499Z] 172.16.1.107
[2024-08-28T02:37:38.499Z] 100:100::7
[2024-08-28T02:37:38.499Z] 100:100::5
[2024-08-28T02:25:10.516Z] nsc-8dbb45d97-spgkc has not the same number of IPs as the number of NSEs (2)

pnes/endpoint_pod_info/interfaces_nsm-ep_nsc-8dbb45d97-spgkc_nsc.txt
3: eth0@if332: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default 
    link/ether 36:cc:81:b5:81:2e brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.71.242/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fd00:eccd:18:ffff:4ab8:5e1b:ecf1:cbb2/128 scope global 
       valid_lft forever preferred_lft forever
14: nsm-ipv4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1446 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:04:c6:94:7a brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.107/32 scope global nsm-ipv4
       valid_lft forever preferred_lft forever
17: nsm-ipv6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1446 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:88:34:a0:80 brd ff:ff:ff:ff:ff:ff
    inet6 100:100::7/128 scope global nodad 
       valid_lft forever preferred_lft forever
    inet6 100:100::5/128 scope global nodad 
       valid_lft forever preferred_lft forever

On nsm-ipv6 interface address 100:100::5, it was receiving traffic from restarted nse: nse-ipv6-6ccb65cdf7-djvvt

NSC #4

[2024-08-28T02:37:38.499Z] DEBUG: nsc-8dbb45d97-vl9zq
[2024-08-28T02:37:38.499Z] 172.16.1.103
[2024-08-28T02:37:38.499Z] 100:100::5
[2024-08-28T02:37:38.499Z] 100:100::3
[2024-08-28T02:25:10.516Z] nsc-8dbb45d97-vl9zq has not the same number of IPs as the number of NSEs (2)

pnes/endpoint_pod_info/interfaces_nsm-ep_nsc-8dbb45d97-vl9zq_nsc.txt
3: eth0@if9490: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default 
    link/ether ba:da:a7:f3:e6:65 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 192.168.235.18/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fd00:eccd:18:ffff:10a:1ba0:a209:2fa8/128 scope global 
       valid_lft forever preferred_lft forever
14: nsm-ipv4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8946 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:a2:86:b9:eb brd ff:ff:ff:ff:ff:ff
    inet 172.16.1.103/32 scope global nsm-ipv4
       valid_lft forever preferred_lft forever
17: nsm-ipv6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8946 qdisc mq state UNKNOWN group default qlen 1000
    link/ether 02:fe:5c:47:e1:4a brd ff:ff:ff:ff:ff:ff
    inet6 100:100::5/128 scope global nodad 
       valid_lft forever preferred_lft forever
    inet6 100:100::3/128 scope global nodad 
       valid_lft forever preferred_lft forever

On nsm-ipv6 interface address 100:100::3, it was receiving traffic from restarted nse: nse-ipv6-6ccb65cdf7-djvvt

SUMMARY:

On the NSC side these were the interfaces that were expected to be removed after the NSE (nse-ipv6-6ccb65cdf7-djvvt) went down

nsc-8dbb45d97-hxv9s    100:100::1
nsc-8dbb45d97-vl9zq      100:100::3
nsc-8dbb45d97-spgkc     100:100::5
nsc-8dbb45d97-9d9wp    100:100::7

On the NSE side the expected interface addresses were followings (taken from successfull result in previous tc)

[2024-08-28T01:24:15.428Z] 100:100::
[2024-08-28T01:24:15.428Z] 100:100::2
[2024-08-28T01:24:15.428Z] 100:100::4
[2024-08-28T01:24:15.428Z] 100:100::6
[2024-08-28T01:24:15.428Z] 100:100::8
[2024-08-28T01:24:15.428Z] 100:100::a

But the ipv6 addresses were neither removed nor replaced.

nse-ipv6-6ccb65cdf7-djvvt ifconfig:
46: icmp-respo-9083: ( 	100:100::			)
47: icmp-respo-87f4: ( 	100:100::2  	100:100::  	)
48: icmp-respo-0c07: ( 	100:100::4  	100:100::2 	)
49: icmp-respo-7315: ( 	100:100::6  	100:100::4 	)
50: icmp-respo-54b6: ( 	100:100::8			)
51: icmp-respo-d2d2: (	100:100::a	100:100::6	)

Ex4amp1e · 2024-09-19T10:48:12Z

@szvincze Hello, we have gotten some questions:

Is it correct, that before restart 2 NSEs had 12 interfaces and after that we had 12 also?
Is datapass working correctly for the all 12 interfaces?

szvincze · 2024-09-23T07:21:34Z

In the test case, we had only 1 NSE with 2 network interfaces IPv4 and IPv6 and 6 NSCs.

So, you are right, the number of interfaces were the same, but the number of IP addresses for the interfaces were not correct. And the 4 NSCs that were interacting with the restarted NSE, had 3 addresses which was supposed to be 2 (as the number of NSEs).

The datapass worked properly during the test without interrupts or packet drops.

Ex4amp1e · 2024-09-24T07:07:43Z

@szvincze Hello,
Could you answer new questions:

Are you facing any problems in the current configuration?
If you want to avoid such behaviour, could you try to pass env variable NSM_IPAM_POLICY=strict to NSE? See example here: https://github.com/networkservicemesh/deployments-k8s/pull/12259/files#diff-0b132d08281a44a9d5d126bb154725aa05b4a1057c07158fdba858653c513c7cR31-R32

szvincze · 2024-09-24T07:29:48Z

Hi @Ex4amp1e,
As far as I know there is not other problem right now just the discrepancy of the number of the addresses on the interfaces. We will give a try with the IPAM policy setting and come up with the feedback soon.

Ex4amp1e · 2024-10-01T11:28:20Z

In the scope of local tests we have found a bugs that blocked us:

Ex4amp1e · 2024-10-14T05:39:51Z

Hi @szvincze ,
Please try to use ex4ample/nse:issue-11371 image for testing, still expecting to configure strict IPAM policy setting.

Ex4amp1e · 2024-10-18T09:49:35Z

@szvincze Hello
FYI image for testing can be taken from v1.14.1-rc.2: ghcr.io/networkservicemesh/ci/cmd-nse-icmp-responder:3ad4703

szvincze · 2024-11-05T13:34:36Z

Hi @Ex4amp1e,
Is this solution included in NSM v1.14.1 release? Our tests with that version and strict IPAM policy still show the problem.

Ex4amp1e · 2024-11-05T16:06:17Z

Hi @szvincze ,
Yes, it is included. Cluster logs are required for the further investigation.

szvincze · 2024-11-06T15:17:05Z

Here I add some details about the test and attached the logs.

The following pods were killed: nse-ipv4-6957646495-t8xxg and nsmgr-sxx5d, which were located on the node: ceph-n185-vpod2-ceph-n4.

[2024-11-05T10:01:38.411Z] kubectl exec nse-ipv4-6957646495-t8xxg -n nsm-ep -c nse -- kill 1
[2024-11-05T10:01:38.921Z] kubectl exec nsmgr-sxx5d -n nsm -c nsmgr -- pkill -9 nsmgr

Containers were restarted at [2024-11-05T10:01:40.762Z]

Container was started at 2024-11-05T10:01:39Z in nse-ipv4-6957646495-t8xxg
Container was started at 2024-11-05T10:01:40Z in nsmgr-sxx5d

[2024-11-05T10:01:44.441Z] Restarted pods are ready again

[2024-11-05T10:04:26.617Z] Evaluation:
DEBUG: nse-ipv4-6957646495-t8xxg
172.16.1.98
172.16.1.102
172.16.1.98
172.16.1.102
172.16.1.100
172.16.1.96

After 10 minutes the check gave the same result.

2024-11-05T10:05:48.011Z] Let's see if the number of interfaces normalizes after some time or not
[2024-11-05T10:05:48.011Z] Executing "check_num_of_endpoint_IPs endpoint nsm-ep 2>&1" until return is 0 or 600 second passed

[2024-11-05T10:16:09.597Z] Evaluation:
DEBUG: nse-ipv4-6957646495-t8xxg
172.16.1.98
172.16.1.102
172.16.1.98
172.16.1.102
172.16.1.100
172.16.1.96

We found 6 IP addresses instead of 4.

Ex4amp1e · 2024-11-11T17:08:19Z

Hello @szvincze ,

We previously tested three restart scenarios :

nse + registry-k8s
nsc + forwarder-vpp
nsc + nsmgr

In the last comment you discovered a new combination: nse + nsmgr. That makes sense to me, because it finds new problems.

So we have prepared a full component testing plan to verify the restart functionality for all NSM components:

№	Test Case	Status
1	nse + registry-k8s	OK
2	nse + nsmgr	FAIL
3	nse + forwarder-vpp	UNKNOWN
4	nsc + registry-k8s	UNKNOWN
5	nsc + nsmgr	OK
6	nsc + forwarder-vpp	OK

Could you please review/approve the proposed testing plan for component restarts?

szvincze · 2024-11-11T20:07:17Z

Hi @Ex4amp1e,

In the original post I mentioned "different combinations" of NSM components, however the attached logs came from only some of them. Let me check with my colleagues the test scope if it covers our cases. Also, I would like to review the latest results, since recently we are just talking about additional IP addresses and not interfaces.

szvincze · 2024-11-13T10:44:36Z

Hi @Ex4amp1e,

Here I paste the complete list of the combination tested on our side:

#	Test case
1	NSE + NSMgr
2	NSC + registry-k8s
3	spire-server + spire-agent
4	NSMgr + forwarder-vpp
5	NSE + registry-k8s
6	NSC + spire-agent
7	NSC + forwarder-vpp
8	NSE + spire-agent
9	registry-k8s + NSMgr
10	NSE + forwarder-vpp
11	NSMgr + spire-agent
12	registry-k8s + forwarder-vpp
13	registry-k8s + spire-agent
14	NSC + NSMgr

Ex4amp1e · 2024-11-19T14:33:54Z

Current plan: test provided scenarios and provide details

Ex4amp1e · 2024-11-25T09:56:59Z

Hi @szvincze,

Could you provide details, do interfaces or IP addresses have been leaked on the latest run results?

[2024-11-05T10:16:09.597Z] Evaluation:
DEBUG: nse-ipv4-6957646495-t8xxg
172.16.1.98
172.16.1.102
172.16.1.98
172.16.1.102
172.16.1.100
172.16.1.96

szvincze · 2024-11-25T12:53:09Z

Hi @Ex4amp1e,

Could you provide details, do interfaces or IP addresses have been leaked on the latest run results?

With the recent NSM versions, after several fixes, only IP addresses.

Ex4amp1e · 2024-11-27T10:53:33Z

Hi @szvincze ,
We have found, that in the latest provided logs there are forwarder restarts during test run, is it expected? Can any other component restart influence to the test results?
Also, is it possible to attach a full test script, that you are using? This will help a lot to avoid misunderstandings during log analysis.

Current statistics with tries to reproduce on our side:

№	Test Case	Runs
1	NSE + NSMgr	500+
2	NSC + registry-k8s	5
3	spire-server + spire-agent	5
4	NSMgr + forwarder-vpp	5
5	NSE + registry-k8s	5
6	NSC + spire-agent	5
7	NSC + forwarder-vpp	5
8	NSE + spire-agent	5
9	registry-k8s + NSMgr	5
10	NSE + forwarder-vpp	5
11	NSMgr + spire-agent	5
12	registry-k8s + forwarder-vpp	5
13	registry-k8s + spire-agent	5
14	NSC + NSMgr	5

We want to mention, that in your NSE + NSMgr test case forwarders also have been restarted, so it is 3-component restart, but in the provided test case list there are only 2-component combination.

denis-tingaikin assigned NikitaSkrynnik Mar 14, 2024

denis-tingaikin added this to Release v1.13.0 Mar 14, 2024

NikitaSkrynnik moved this to Todo in Release v1.13.0 Mar 15, 2024

glazychev-art moved this from Todo to In Progress in Release v1.13.0 Mar 18, 2024

glazychev-art moved this from In Progress to Blocked in Release v1.13.0 Mar 19, 2024

denis-tingaikin removed this from Release v1.13.0 Apr 9, 2024

denis-tingaikin added this to Release v1.14.0 Apr 9, 2024

denis-tingaikin unassigned NikitaSkrynnik Apr 9, 2024

denis-tingaikin moved this to Blocked in Release v1.14.0 Apr 9, 2024

denis-tingaikin added the bug Something isn't working label Apr 25, 2024

denis-tingaikin moved this from Blocked to Todo in Release v1.14.0 Apr 25, 2024

denis-tingaikin assigned Ex4amp1e Jul 24, 2024

Ex4amp1e moved this from Todo to In Progress in Release v1.14.0 Jul 25, 2024

denis-tingaikin moved this from In Progress to Blocked in Release v1.14.0 Jul 30, 2024

NikitaSkrynnik moved this from Blocked to In Progress in Release v1.14.0 Sep 17, 2024

Ex4amp1e added this to Release v1.14.1 Sep 24, 2024

Ex4amp1e moved this to In Progress in Release v1.14.1 Sep 24, 2024

denis-tingaikin moved this from In Progress to Moved to next release in Release v1.14.0 Sep 24, 2024

Ex4amp1e moved this from In Progress to Blocked in Release v1.14.1 Oct 1, 2024

denis-tingaikin mentioned this issue Oct 4, 2024

Add DualStack IPPool that support both IPv4 and IPv6 addresses networkservicemesh/sdk#1659

Closed

9 tasks

Ex4amp1e moved this from Blocked to Under Review in Release v1.14.1 Oct 18, 2024

NikitaSkrynnik added this to Release v1.14.2 Nov 7, 2024

NikitaSkrynnik moved this to In Progress in Release v1.14.2 Nov 7, 2024

denis-tingaikin moved this from In Progress to Blocked in Release v1.14.2 Nov 25, 2024

NikitaSkrynnik moved this from Blocked to In Progress in Release v1.14.2 Nov 27, 2024

denis-tingaikin moved this from In Progress to Blocked in Release v1.14.2 Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After component restart there are more interfaces in NSE than expected #11371

After component restart there are more interfaces in NSE than expected #11371

szvincze commented Mar 12, 2024 •

edited

Loading

edwarnicke commented Mar 26, 2024

szvincze commented Apr 25, 2024

denis-tingaikin commented Apr 25, 2024

Ex4amp1e commented Jul 30, 2024

szvincze commented Aug 22, 2024

szvincze commented Aug 26, 2024

szvincze commented Aug 27, 2024

Ex4amp1e commented Aug 27, 2024

szvincze commented Sep 10, 2024

Ex4amp1e commented Sep 19, 2024

szvincze commented Sep 23, 2024

Ex4amp1e commented Sep 24, 2024

szvincze commented Sep 24, 2024

Ex4amp1e commented Oct 1, 2024

Ex4amp1e commented Oct 14, 2024

Ex4amp1e commented Oct 18, 2024

szvincze commented Nov 5, 2024

Ex4amp1e commented Nov 5, 2024

szvincze commented Nov 6, 2024

Ex4amp1e commented Nov 11, 2024

szvincze commented Nov 11, 2024

szvincze commented Nov 13, 2024

Ex4amp1e commented Nov 19, 2024

Ex4amp1e commented Nov 25, 2024

szvincze commented Nov 25, 2024

Ex4amp1e commented Nov 27, 2024 •

edited

Loading

After component restart there are more interfaces in NSE than expected #11371

After component restart there are more interfaces in NSE than expected #11371

Comments

szvincze commented Mar 12, 2024 • edited Loading

Expected Behavior

Current Behavior

Failure Information (for bugs)

Steps to Reproduce

Context

Failure Logs

edwarnicke commented Mar 26, 2024

szvincze commented Apr 25, 2024

denis-tingaikin commented Apr 25, 2024

Ex4amp1e commented Jul 30, 2024

szvincze commented Aug 22, 2024

szvincze commented Aug 26, 2024

szvincze commented Aug 27, 2024

Ex4amp1e commented Aug 27, 2024

szvincze commented Sep 10, 2024

Ex4amp1e commented Sep 19, 2024

szvincze commented Sep 23, 2024

Ex4amp1e commented Sep 24, 2024

szvincze commented Sep 24, 2024

Ex4amp1e commented Oct 1, 2024

Ex4amp1e commented Oct 14, 2024

Ex4amp1e commented Oct 18, 2024

szvincze commented Nov 5, 2024

Ex4amp1e commented Nov 5, 2024

szvincze commented Nov 6, 2024

Ex4amp1e commented Nov 11, 2024

szvincze commented Nov 11, 2024

szvincze commented Nov 13, 2024

Ex4amp1e commented Nov 19, 2024

Ex4amp1e commented Nov 25, 2024

szvincze commented Nov 25, 2024

Ex4amp1e commented Nov 27, 2024 • edited Loading

szvincze commented Mar 12, 2024 •

edited

Loading

Ex4amp1e commented Nov 27, 2024 •

edited

Loading