-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After component restart there are more interfaces in NSE than expected #11371
Comments
Related to networkservicemesh/sdk#1020 |
@edwarnicke, @denis-tingaikin: I got the information that the extra allocated IPs are not cleaned up even after 10 minutes. |
@szvincze It is not expected and looks unhealthy. So the problem will be considered. Also, it would be nice to have logs from the NSE right after the request and after 10 minutes when the forwarder died if possible. |
I have tried several cluster configurations on the Kubernetes Version: 1.27.1 and NSM v1.13.2
1 NSC 2 NSE
2 NSC 2 NSE
So, the most similar steps for me were:
As a result: an additional interface was on the NSE, but there was no additional refresh requests for it and it has been cleaned up after 10 mins. And it looks like expected. So, it is required to get additional steps to reproduce the issue, when the leaking interface is not cleaning up after 10 mins. |
@Ex4amp1e: I have checked the current situation with my colleagues if the same behavior can be reproduced using NSM v1.13.2 release. It seems the leaking interface is always removed within 10 mins. We will have a meeting to clarify and validate the used test cases. I will come back with the feedback soon, so please keep this issue open until that. |
Hi @Ex4amp1e, |
As far as the test results show in the vast majority of the successful reproductions the combination of the components was |
Hi @szvincze
|
Here I send the reproduction, the respective logs and analysis of the issue.
Investigation of "nse + registry-k8s FAIL" scenario The test VM had 6 worker nodes between "n4-n9":
Registry-k8s had 2 replicas running on node "n4" and "n8":
There were 6-NSC and 2-NSE endpoints were running:
1 NSE and 1 Registry-k8s pod running on the node was chosen:
NSC that also on the same node "n8":
Kill the chosen pods at the same time:
NSE was up after 1 sec of kill:
Registry-k8s was up after 6 sec of kill:
Traffic stopped smoothly:
Following shows which ip was added to which interface.
The expectation is to have 6 ip addresses in total as indicated in the error. Part 2: Followings NSCs were also failed with their ifconfigs:
On nsm-ipv6 interface address 100:100::7, it was receiving traffic from restarted nse: nse-ipv6-6ccb65cdf7-djvvt
On nsm-ipv6 interface address 100:100::1, it was receiving traffic from restarted nse: nse-ipv6-6ccb65cdf7-djvvt
On nsm-ipv6 interface address 100:100::5, it was receiving traffic from restarted nse: nse-ipv6-6ccb65cdf7-djvvt
On nsm-ipv6 interface address 100:100::3, it was receiving traffic from restarted nse: nse-ipv6-6ccb65cdf7-djvvt SUMMARY: On the NSC side these were the interfaces that were expected to be removed after the NSE (nse-ipv6-6ccb65cdf7-djvvt) went down
On the NSE side the expected interface addresses were followings (taken from successfull result in previous tc)
But the ipv6 addresses were neither removed nor replaced.
|
@szvincze Hello, we have gotten some questions:
|
In the test case, we had only 1 NSE with 2 network interfaces IPv4 and IPv6 and 6 NSCs. So, you are right, the number of interfaces were the same, but the number of IP addresses for the interfaces were not correct. And the 4 NSCs that were interacting with the restarted NSE, had 3 addresses which was supposed to be 2 (as the number of NSEs). The datapass worked properly during the test without interrupts or packet drops. |
@szvincze Hello,
|
Hi @Ex4amp1e, |
In the scope of local tests we have found a bugs that blocked us: |
Hi @szvincze , |
@szvincze Hello |
Hi @Ex4amp1e, |
Hi @szvincze , |
Here I add some details about the test and attached the logs. The following pods were killed: nse-ipv4-6957646495-t8xxg and nsmgr-sxx5d, which were located on the node: ceph-n185-vpod2-ceph-n4.
Containers were restarted at [2024-11-05T10:01:40.762Z]
After 10 minutes the check gave the same result.
We found 6 IP addresses instead of 4. |
Hello @szvincze , We previously tested three restart scenarios :
In the last comment you discovered a new combination: nse + nsmgr. That makes sense to me, because it finds new problems. So we have prepared a full component testing plan to verify the restart functionality for all NSM components:
Could you please review/approve the proposed testing plan for component restarts? |
Hi @Ex4amp1e, In the original post I mentioned "different combinations" of NSM components, however the attached logs came from only some of them. Let me check with my colleagues the test scope if it covers our cases. Also, I would like to review the latest results, since recently we are just talking about additional IP addresses and not interfaces. |
Hi @Ex4amp1e, Here I paste the complete list of the combination tested on our side:
|
Current plan: test provided scenarios and provide details |
Hi @szvincze, Could you provide details, do interfaces or IP addresses have been leaked on the latest run results?
|
Hi @Ex4amp1e,
With the recent NSM versions, after several fixes, only IP addresses. |
Hi @szvincze , Current statistics with tries to reproduce on our side:
We want to mention, that in your NSE + NSMgr test case forwarders also have been restarted, so it is 3-component restart, but in the provided test case list there are only 2-component combination. |
Expected Behavior
We are testing restart of different combination of NSM components with traffic.
It would be desirable besides the traffic recovery after the components are up and running again that the same number of interfaces exist in NSE.
Current Behavior
I attached logs from a case where we restarted an NSC and a forwarder-vpp in the system and there was one more interface (5) in the NSE than the expected 4.
We observed the same behavior several times with different combination of the restarted components but NSC was always one of them. So, it seems that the NSC restart has influence on this.
Failure Information (for bugs)
Described above in the Current behavior section.
Steps to Reproduce
The setup is based on NSM v1.11.2 basic kernel to ethernet to kernel example with 4 worker nodes.
There are an NSC on each node and 2 NSEs (one for IPv4 and one for IPv6).
Started traffic between the NSCs and NSEs.
An NSC and a forwarder-vpp were restarted. After both came up, there were an additional interface in the NSE (nse-ipv4-c9cd8cf77-gwnth) besides the original 4.
Context
Failure Logs
logs.tar.gz
The text was updated successfully, but these errors were encountered: