Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVED] Routing: reduce chances of duplicate implicit routes #5602

Merged
merged 1 commit into from
Jul 22, 2024

Conversation

kozlovic
Copy link
Member

This is an alternate approach to the PR #5484 from @wjordan.

Using the code in that PR with the test added in this PR, I could still see duplicate routes (up to 125 in one of the matrix), and still had a data race (that could have easily be fixed). The main issue is that the increment happens in connectToRoute, which is running from a go routine, so there were still chances for duplicates.

Instead, I took the approach that those duplicates were the result of way too many gossip protocols. Suppose that you have servers A and B already connected. C connects to A. A gossips to B that it should connect to C. When that happened, B would gossip to A the server C and C would gossip to A the server B, which all that was unnecessary. It would grow quite fast with the size of the cluster (that is, several thousands for a cluster size of 15 or so).

Resolves #5483

Signed-off-by: Ivan Kozlovic ivan@synadia.com

This is an alternate approach to the PR #5484 from @wjordan.

Using the code in that PR with the test added in this PR, I could
still see duplicate routes (up to 125 in one of the matrix), and
still had a data race (that could have easily be fixed). The main
issue is that the increment happens in connectToRoute, which is
running from a go routine, so there were still chances for duplicates.

Instead, I took the approach that those duplicates were the result
of way too many gossip protocols. Suppose that you have servers A
and B already connected. C connects to A. A gossips to B that it
should connect to C. When that happened, B would gossip to A the
server C and C would gossip to A the server B, which all that was
unnecessary. It would grow quite fast with the size of the cluster
(that is, several thousands for a cluster size of 15 or so).

Resolves #5483

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
@kozlovic kozlovic requested a review from a team as a code owner June 27, 2024 17:47
@kozlovic kozlovic requested a review from derekcollison June 27, 2024 17:48
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison derekcollison merged commit 5b5857f into main Jul 22, 2024
3 checks passed
@derekcollison derekcollison deleted the routes_gossip branch July 22, 2024 15:37
bruth added a commit that referenced this pull request Jul 30, 2024
Includes the following:

* #5602
* #5672
* #5668
* #5607
* #5687
* #5695
* #5697
* #5704
* #5706
* #5709
* #5710
* #5713
* #5719

Some PRs specifically excluded:

* #5707 — based on a 2.11 NRG
PR
* #5690 — continue to allow
Go 1.20 for 2.10.x
* Various other NRG PRs which are higher risk and destined for 2.11
instead
@neilalexander
Copy link
Member

@kozlovic We noticed that TestStressChainedSolicitWorks fails pretty reliably on our local machines now and have bisected it back to this change, would you mind taking a look?

kozlovic added a commit that referenced this pull request Aug 3, 2024
The PR#5602 solved the issue for typical seed setups, but it was
found that the test `TestStressChainedSolicitWorks` would sometimes
fail. This is a situation where we have S4->S3->S2->S1 and all
servers start at the same type. This is not a typical setup, but
still the regular gossip would allow the creation of the full mesh,
so this PR helps in those situations.

Related to #5602

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
derekcollison added a commit that referenced this pull request Aug 3, 2024
#5746)

The PR#5602 solved the issue for typical seed setups, but it was found
that the test `TestStressChainedSolicitWorks` would sometimes fail. This
is a situation where we have S4->S3->S2->S1 and all servers start at the
same type. This is not a typical setup, but still the regular gossip
would allow the creation of the full mesh, so this PR helps in those
situations.

Related to #5602

Signed-off-by: Ivan Kozlovic <ivan@synadia.com>
neilalexander added a commit that referenced this pull request Aug 6, 2024
Includes the following:

- #5690
- #5725
- #5729
- #5734
- #5735
- #5736
- #5743
- #5744
- #5751
- #5755
- #5754
- #5732
- #5750
- #5756

The following were reverted before this PR:
- #5602

Signed-off-by: Neil Twigg <neil@nats.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'Duplicate Route' connection floods on implicit routes
3 participants