-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] possible fix for #2492 #2498
Conversation
It sounds super! ... if we actually can verify it somehow :) I've seen, that people from #2492 have shown a possible repro steps for this issue. Can we verify it somehow? |
@Horusiath in progress of reproducing this case |
also Akka.Cluster.Tests.ClusterSpec.A_cluster_must_complete_LeaveAsync_task_upon_being_removed racy spec |
@maxim-s Have you tried to implement the whole PR from Akka Scala? akka/akka#21859 |
@alexvaluyskiy I didn't see this PR, but looks like we have the same issue with random gossip |
@Aaronontheweb looks like test Akka.Cluster.Tests.ClusterSpec.A_cluster_must_complete_LeaveAsync_task_upon_being_removed is not correct now, could you review it? |
namespace Akka.Util.Internal.Collections | ||
{ | ||
public static class ListExtensions | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Internal should be internal, not public. Use InternalVisibleTo attribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've requested some individual changes line-by-line, but a few high level notes:
Akka.Cluster.Tests.ClusterSpec.A_cluster_must_complete_LeaveAsync_task_upon_being_removed racy spec
- if this is still failing, it means that #2492 is still a problem. That spec fails when the node doesn't receive a MemberRemoved
event for itself when it gracefully leaves the cluster. So that test remains valid.
I'd follow @alexvaluyskiy's suggestion and implement akka/akka#21859 while you're at it. Seems to address some of the same issues we're observing.
You've already said you're working on a spec to help verify and I think that's one of the other things I'd add.
@@ -935,7 +936,7 @@ internal class ClusterCoreDaemon : UntypedActor, IRequiresMessageQueue<IUnbounde | |||
/// TBD | |||
/// </summary> | |||
protected readonly UniqueAddress SelfUniqueAddress; | |||
private const int NumberOfGossipsBeforeShutdownWhenLeaderExits = 3; | |||
private const int NumberOfGossipsBeforeShutdownWhenLeaderExits = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the purpose of increasing this value? Not questioning the decision to do it, but just want to understand the reason for the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I get it now - increases the number of random nodes we might gossip to when the leader exits.
if (!IsSingletonCluster && n > 0) | ||
{ | ||
var localGossip = _latestGossip; | ||
var possibleTargets = new List<UniqueAddress>(localGossip.Members.Where(m => ValidNodeForGossip(m.UniqueAddress)).Select(m => m.UniqueAddress)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just call ToList
here
@@ -2164,7 +2180,7 @@ public void ReapUnreachableMembers() | |||
_cluster.SelfAddress, nonExiting.Select(m => m.ToString()).Aggregate((a, b) => a + ", " + b), string.Join(",", _cluster.SelfRoles)); | |||
|
|||
if (!exiting.IsEmpty) | |||
_log.Warning("Marking exiting node(s) as UNREACHABLE [{0}]. This is expected and they will be removed.", | |||
_log.Warning("Cluster Node [{0}] - Marking exiting node(s) as UNREACHABLE [{1}]. This is expected and they will be removed.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Aaronontheweb Can we do this one as INFO if it is a normal exit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxcherednik @maxim-s yep, I think that's a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Akka (JVM) does not have "Cluster Node" parameter in this log entry. Maybe we could just remove the parameter, instead of extending the log message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine
{ | ||
public static class ListExtensions | ||
{ | ||
public static List<T> Shuffle<T>(this List<T> @this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which shuffle algorithm are you using here? Fischer-Yates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not yet, just extracted class
public static List<T> Shuffle<T>(this List<T> @this) | ||
{ | ||
var list = new List<T>(@this); | ||
var rng = new Random(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ThreadLocalRandom
here - avoids allocations and also spreads out the seeds (as ThreadLocalRandom
gets used gradually across available threads)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you but I thought to replace with Math.round and Math.random as Java implementation
@Aaronontheweb racy spec I mean, that some sort of race in the LeaveAsync task, MemberRemoved has been received success, problem with callback and I'm working on it |
@maxim-s oh I see what you mean. I must have misunderstood. So you think it's an issue with how the callback is implemented and not the spec itself? |
@maxim-s The better way is: implement akka/akka#21859 first in a one PR. And only after merge try to fix #2492 |
@Aaronontheweb yes this PR #2501 should solve problem with |
@@ -2164,7 +2183,7 @@ public void ReapUnreachableMembers() | |||
_cluster.SelfAddress, nonExiting.Select(m => m.ToString()).Aggregate((a, b) => a + ", " + b), string.Join(",", _cluster.SelfRoles)); | |||
|
|||
if (!exiting.IsEmpty) | |||
_log.Warning("Marking exiting node(s) as UNREACHABLE [{0}]. This is expected and they will be removed.", | |||
_log.Warning("Cluster Node [{0}] - Marking exiting node(s) as UNREACHABLE [{1}]. This is expected and they will be removed.", | |||
_cluster.SelfAddress, exiting.Select(m => m.ToString()).Aggregate((a, b) => a + ", " + b)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could simplify this code and use something like this
_log.Info("Marking exiting node(s) as UNREACHABLE [{0}]. This is expected and they will be removed.", string.Join(", ", exiting));
@@ -1125,6 +1126,7 @@ private void TryingToJoin(object message, Address joinWith, Deadline deadline) | |||
else if (message is InternalClusterAction.JoinSeedNodes) | |||
{ | |||
var js = message as InternalClusterAction.JoinSeedNodes; | |||
BecomeUninitialized(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
I approve the changes. @alexvaluyskiy how do you feel? Are there changes you still want to see made? |
@Aaronontheweb I've already reviewed the code and approved it. |
No description provided.