Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traffic disturbance 2 minutes after node restart #11787

Open
ljkiraly opened this issue Apr 12, 2024 · 11 comments
Open

Traffic disturbance 2 minutes after node restart #11787

ljkiraly opened this issue Apr 12, 2024 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@ljkiraly
Copy link
Contributor

ljkiraly commented Apr 12, 2024

Expected Behavior

The node restart should not have impact on traffic between elements running on other nodes.

Current Behavior

Two minutes after a worker restart there was a traffic outage.

Failure Information

Can not reproduce this, but fails often in nightly tests. Logs from a failed test run in
traffic_outage_after_node_reboot_log.tar.gz

The node reboot is at:
[2024-04-05T13:45:03.923Z] robustness-node-restart-test.sh: Rebooting node: worker-pool1-1dn6k2vc-n121-vpod1-pnes8010-ipv4

The traffic has been stopped between: [2024-04-05T13:47:06.910Z] and [2024-04-05T13:48:12.064Z]

Context

  • NSM Version: v1.13.0-rc2
    The issue can be seen with NSM v1.12.1-rc.1 also.
@denis-tingaikin
Copy link
Member

NSM Version: v1.13.0-rc1
The issue can be seen with NSM v1.12.1-rc.1 also.

Hm, as far as I know, we fixed something similar in v1.13.0.
Have you tried it on v1.13.0? 

@denis-tingaikin denis-tingaikin self-assigned this Apr 12, 2024
@denis-tingaikin denis-tingaikin moved this to In Progress in Release v1.14.0 Apr 12, 2024
@denis-tingaikin denis-tingaikin moved this from In Progress to Blocked in Release v1.14.0 Apr 12, 2024
@denis-tingaikin denis-tingaikin added the bug Something isn't working label Apr 12, 2024
@ljkiraly
Copy link
Contributor Author

NSM Version: v1.13.0-rc1
The issue can be seen with NSM v1.12.1-rc.1 also.

Hm, as far as I know, we fixed something similar in v1.13.0. Have you tried it on v1.13.0?

The logs are from a test run with v1.13.0-rc1. Is there a difference between v1.13.0 and v1.13.0-rc1? Just mentioned NSM v1.12.1-rc.1 to clarify that is not a new bug. It is considered as a medium priority issue.

@denis-tingaikin
Copy link
Member

It is considered as a medium priority issue.

OK, good that it's not crirical.

The logs are from a test run with v1.13.0-rc1. Is there a difference between v1.13.0 and v1.13.0-rc1?

Yes, it has a difference. We have fixed a few bugs, like #11372 in v1.13.0 and 1.13.0- rc.1 doesn't contain the fix. 1.13.0-rc.2 contains the fix.

@ljkiraly
Copy link
Contributor Author

Ah, I missed the version, sorry: the logs are from a test run with NSM v1.13.0-rc2. Fixing in description.

@denis-tingaikin denis-tingaikin moved this from Blocked to Todo in Release v1.14.0 Apr 16, 2024
@NikitaSkrynnik NikitaSkrynnik moved this from Todo to In Progress in Release v1.14.0 Jul 5, 2024
@denis-tingaikin denis-tingaikin removed their assignment Jul 9, 2024
@NikitaSkrynnik
Copy link
Collaborator

We've checked several NSM versions and all of them has the same problem:

  1. v1.13.2-rc.1
  2. v1.13.0
  3. v1.12.0
  4. v1.11.2

@NikitaSkrynnik
Copy link
Collaborator

NikitaSkrynnik commented Jul 22, 2024

Current State

We found several problems that may occur after restarting a node:

1. Ping doesn't work periodically (periods are of the same length)

This issue is related to some bugs in point2point IPAM. Here is the draft fix for this: networkservicemesh/sdk#1647

2. begin queues requests but never executes them

The bug is the same as in networkservicemesh/cmd-forwarder-vpp#1134

3. A lot of additional unused routes on clients

After several node restarts some of the clients can have additional routes. Example:

default via 10.244.2.1 dev eth0 
10.244.2.0/24 via 10.244.2.1 dev eth0  src 10.244.2.12 
10.244.2.1 dev eth0 scope link  src 10.244.2.12 
172.16.0.34 dev nsm-v4 
172.16.0.40 dev nsm-v4 
172.16.0.56 dev nsm-v4 
172.16.0.92 dev nsm-v4 

Only one of these addresses can be pinged. This problem is also related to point2point IPAM.

4. Sometimes ping doesn't work for a while after node restart but after some time starts to work again

The cause of this behaviour is still unknown. It might be related to some missing events after node is restarted. Still in progress.

@NikitaSkrynnik
Copy link
Collaborator

NikitaSkrynnik commented Jul 23, 2024

Found the solution for the fourth bug. It's in point2point IPAM again. Here is the PR that fixes issues 1 and 4 for point2point IPAM: networkservicemesh/sdk#1647

@Ex4amp1e Ex4amp1e moved this from In Progress to Blocked in Release v1.14.0 Jul 23, 2024
@NikitaSkrynnik
Copy link
Collaborator

NikitaSkrynnik commented Jul 23, 2024

NSE Image with fixes: nikitaxored/cmd-nse-icmp-responder:ipam-fix

@szvincze
Copy link
Contributor

@NikitaSkrynnik: We tested the node restart scenario with this image and it was successful each time, so it seems this fix solves the problem.

@NikitaSkrynnik
Copy link
Collaborator

NikitaSkrynnik commented Aug 27, 2024

@szvincze to check if this issue is resolved in v1.14.0-rc.1 you can pass env variable NSM_IPAM_POLICY=strict to NSE. See example here: https://github.com/networkservicemesh/deployments-k8s/pull/12259/files#diff-0b132d08281a44a9d5d126bb154725aa05b4a1057c07158fdba858653c513c7cR31-R32

@szvincze
Copy link
Contributor

@NikitaSkrynnik: We have verified it in an environment where we evaluate NSM releases and use NSE/NSC from NSM releases. There we had several issues, like traffic disturbance after worker node restart when the pods are back, temporary traffic outage for longer than 30 seconds for one NSE instance and several outages on the other traffic instances. Based on our tests we can say that with the latest release we haven't observe these issues.

But the @ljkiraly reported this issue from an environment where we use custom endpoints and clients, where unfortunately we still experience the same behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Under review
Development

No branches or pull requests

4 participants