Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy process restart might cause traffic outage when using v1.1.2 #542

Closed
zolug opened this issue Sep 19, 2024 · 0 comments
Closed

Proxy process restart might cause traffic outage when using v1.1.2 #542

zolug opened this issue Sep 19, 2024 · 0 comments

Comments

@zolug
Copy link
Collaborator

zolug commented Sep 19, 2024

Describe the bug
After proxy process crash the proxy since Meridio 1.1.2 can recover former connections with LBs.
Without continuing former NSM connections without reselect request the related interfaces in LB and proxy will remain intact.
While this might sound as an improvement, the NSM connection between any TAPA and the affected proxy will be tried to be fixed by NSM heal ordering a reselect due to no datapath monitoring being enabled. This will replace the interfaces of the TAPA connection causing a MAC address update but normally keeping the old IPs. Thus, the LB might end up with outdated neighbor entries related to Targets.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy a fully fledged trench with 1 conduit, stream, flow etc. Configure Attractor to have 1 replica.
  2. Deploy target application with 1 replica.
  3. Note the proxy running on the same node as the single target POD.
  4. Send external traffic using ctraffic for 10 seconds to the VIP address hosted by the Target.
  5. Monitor neighbor cache and links in the single LB POD:
    ip monitor neigh|while read -r line; do echo "=> $(date) | $(hostname) | $line"; done &
    ip monitor link|while read -r line; do echo "=> $(date) | $(hostname) | $line"; done &
  6. Kill the proxy process. And wait until the proxy recovers.
  7. Run ctraffic again shortly. There could be some outage due to the old invalid neighbor entry being around.

Expected behavior
No invalid neighbor entry should linger on in LB after proxy process restart.

Context

  • Network Service Mesh: v1.13.2
  • Meridio: v1.1.2
    ...

Logs
NA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 👀 In review
Development

No branches or pull requests

1 participant