Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update probe-interval and stale contact point timeout calculation #2601

Merged
merged 7 commits into from
Jul 1, 2024

Conversation

Arkatufus
Copy link
Contributor

@Arkatufus Arkatufus commented Jun 27, 2024

Fixes #2571

Changes

  • Change probe-interval default value to 5s
  • Move all effective value calculation out of the settings file
  • Removed stale-contact-point-timeout HOCON setting
  • Change calculation for stale contact point timeout inside BootstrapCoordinator to ( ProbeInterval + ProbeFailureTimeout ), it was ProbeFailureTimeout before.

Copy link
Member

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - if we have a bad probe value between intervals, i.e.

  1. 0:05 - hit node, it's good
  2. 0:10 - hit node, it's good
  3. 0:15 - no response from node
  4. 0:16 - join decider runs

Will that bad contact in step 3 invalidate the good contacts from earlier? It should, otherwise this will damage the algorithm. What happens there?

}
else
{
_staleContactPointTimeout = new TimeSpan((cps.ProbeInterval.Ticks + cps.ProbingFailureTimeout.Ticks) * 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Arkatufus
Copy link
Contributor Author

First of all, ContactPoint HTTP probing is done in parallel, so the first and second would happen pretty close to each other, then the failure happened.

Second, Depends on the RequiredContactPointNr setting and ContactWithAllContactPoints setting.

  • If RequiredContactPointNr is 2 and ContactWithAllContactPoints is false, the third failure is ignored and cluster formed immediately after the first 2 succeeded.
  • If RequiredContactPointNr is 2 and ContactWithAllContactPoints is true, the third failure blocks cluster formation until that entry is removed from either being stale or removed from the next round of Discovery resolution.

@Arkatufus
Copy link
Contributor Author

Join decider ticks, discovery resolution ticks, and HTTP probing ticks happens in their own timers.

  • Decider ticks and HTTP probing tick timer interval is set to contact-point.probe-interval,
  • decider ticks runs on the BootstrapCoordinator,
  • HTTP probing runs in the HttpContactPointBootstrap,
  • HTTP probing have a failure timeout of 3s

Discovery resolution tick timer interval is set to contact-point-discovery.interval, it is run in the BootstrapCoordinator

@Aaronontheweb
Copy link
Member

the third failure blocks cluster formation until that entry is removed from either being stale or removed from the next round of Discovery resolution.

Question I'm really asking is that if a node somehow dies shortly after being successfully pinged, before a cluster is formed, is it possible to form a cluster under the RequiredContactPointNr due to a stale entry?

@Arkatufus
Copy link
Contributor Author

I guess this is where our and scala implementation diverge with this update.
In the scala implementation, there is no stale contact point timeout setting, possible dead node entry is being guarded and pruned by a TTL that is the same as the probe-failure-timeout value. If a HTTP probe died without updating its entry in the contact point list, the coordinator is guaranteed to remove the entry at the same interval as the probe timeout interval.

Here, we're lenghtening that value so there is actually a possibility that there will be a race condition where a contact point is actually dead but its not being pruned out of the contact point list, if that makes any sense.

@Aaronontheweb
Copy link
Member

possible dead node entry is being guarded and pruned by a TTL that is the same as the probe-failure-timeout value. If a HTTP probe died without updating its entry in the contact point list, the coordinator is guaranteed to remove the entry at the same interval as the probe timeout interval.

So we should redesign this feature then - no need for a separate stale value, just always slide the timeout with the interval at all times (i.e it's never a hard timeout, it's always a +n value.) That'd be even simpler.

@Arkatufus
Copy link
Contributor Author

Arkatufus commented Jun 28, 2024

We'd need to remove 1.5.26 from NuGet then
Nevermind, I somehow crossed Akka.Management and Akka.NET, again... geeze

@Arkatufus
Copy link
Contributor Author

ok, done

Copy link
Member

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

{
_staleContactPointTimeout = new TimeSpan((cps.ProbeInterval.Ticks + cps.ProbingFailureTimeout.Ticks) * 2);
}
_staleContactPointTimeout = cps.ProbeInterval + cps.ProbingFailureTimeout;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Aaronontheweb Aaronontheweb merged commit c7328c6 into akkadotnet:dev Jul 1, 2024
3 checks passed
@Arkatufus Arkatufus deleted the update-probe-interval branch July 15, 2024 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster.Bootstrap causes network socket port exhaustion due to socket leak during cluster formation
2 participants