Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increased NSE expiration time might cause traffic disturbance #438

Closed
zolug opened this issue Jul 10, 2023 · 0 comments
Closed

Increased NSE expiration time might cause traffic disturbance #438

zolug opened this issue Jul 10, 2023 · 0 comments

Comments

@zolug
Copy link
Collaborator

zolug commented Jul 10, 2023

Describe the bug
The NSE expiration time calculation has changed in NSM:
With the changes basically it's NSM MaxTokenLifetime that determines the lifetime of an NSE.
ExpirationTime time parameter in the NetworkServiceEndpoint structure that has been used during the registration procedure is ignored right after the first refresh (falls back using token lifetime).

This might cause traffic disturbances if for example a node hosting a LB-FE gets rebooted.
That's because the related NSE Custom Resource could remain in etcd much longer. NSM_MAX_TOKEN_LIFETIME defaults to 10 minutes, while default expiration time was 1 minute before.
Until the NSE lifetime of an unavailable LB expires, proxies consider it a valid egress next hop (assuming datapath monitoring is off).

To Reproduce
Steps to reproduce the behavior:

  1. Use a recent NSM version that contains the NSM changes mentioned above. (e.g. NSM 1.9)
  2. Check the expiration time of the related LB NSE in etcd.
  3. Reboot a node with an LB-FE.
  4. The proxy will get informed about the loss of LB-FE with severe delay (when the CR expires). (Next hop route for the rebooted LB will be maintained by the proxy until that.)

Expected behavior
We should be able to control NSE expiration time independent from the MaxTokenLifetime.

A possible way forward could be to introduce a custom chain element that sets the expiration time to 1 minute, thus ensuring backward compatibility.

Check if datapath monitoring should be enabled (e.g. through env variable) between Proxy and LB.

Context

  • Network Service Mesh: v1.9
  • Meridio: 1.0.7

Logs
NA

@zolug zolug added the kind/bug Something isn't working label Jul 10, 2023
@zolug zolug added this to Meridio Jul 10, 2023
@zolug zolug self-assigned this Jul 10, 2023
@zolug zolug moved this to 🏗 In progress in Meridio Jul 10, 2023
@zolug zolug moved this from 🏗 In progress to 👀 In review in Meridio Jul 10, 2023
@zolug zolug moved this from 👀 In review to ✅ Done in Meridio Jul 10, 2023
@zolug zolug closed this as completed Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

1 participant