Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] possible fix for #2492 #2498

Merged
merged 3 commits into from
Feb 19, 2017
Merged

[WIP] possible fix for #2492 #2498

merged 3 commits into from
Feb 19, 2017

Conversation

maxim-s
Copy link
Contributor

@maxim-s maxim-s commented Feb 2, 2017

No description provided.

@Horusiath
Copy link
Contributor

It sounds super! ... if we actually can verify it somehow :) I've seen, that people from #2492 have shown a possible repro steps for this issue. Can we verify it somehow?

@maxim-s
Copy link
Contributor Author

maxim-s commented Feb 2, 2017

@Horusiath in progress of reproducing this case

@maxim-s
Copy link
Contributor Author

maxim-s commented Feb 2, 2017

also Akka.Cluster.Tests.ClusterSpec.A_cluster_must_complete_LeaveAsync_task_upon_being_removed racy spec

@alexvaluyskiy
Copy link
Contributor

@maxim-s Have you tried to implement the whole PR from Akka Scala? akka/akka#21859

@maxim-s
Copy link
Contributor Author

maxim-s commented Feb 2, 2017

@alexvaluyskiy I didn't see this PR, but looks like we have the same issue with random gossip

@maxim-s
Copy link
Contributor Author

maxim-s commented Feb 2, 2017

@Aaronontheweb looks like test Akka.Cluster.Tests.ClusterSpec.A_cluster_must_complete_LeaveAsync_task_upon_being_removed is not correct now, could you review it?

namespace Akka.Util.Internal.Collections
{
public static class ListExtensions
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal should be internal, not public. Use InternalVisibleTo attribute

Copy link
Member

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've requested some individual changes line-by-line, but a few high level notes:

Akka.Cluster.Tests.ClusterSpec.A_cluster_must_complete_LeaveAsync_task_upon_being_removed racy spec - if this is still failing, it means that #2492 is still a problem. That spec fails when the node doesn't receive a MemberRemoved event for itself when it gracefully leaves the cluster. So that test remains valid.

I'd follow @alexvaluyskiy's suggestion and implement akka/akka#21859 while you're at it. Seems to address some of the same issues we're observing.

You've already said you're working on a spec to help verify and I think that's one of the other things I'd add.

@@ -935,7 +936,7 @@ internal class ClusterCoreDaemon : UntypedActor, IRequiresMessageQueue<IUnbounde
/// TBD
/// </summary>
protected readonly UniqueAddress SelfUniqueAddress;
private const int NumberOfGossipsBeforeShutdownWhenLeaderExits = 3;
private const int NumberOfGossipsBeforeShutdownWhenLeaderExits = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the purpose of increasing this value? Not questioning the decision to do it, but just want to understand the reason for the change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I get it now - increases the number of random nodes we might gossip to when the leader exits.

if (!IsSingletonCluster && n > 0)
{
var localGossip = _latestGossip;
var possibleTargets = new List<UniqueAddress>(localGossip.Members.Where(m => ValidNodeForGossip(m.UniqueAddress)).Select(m => m.UniqueAddress));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just call ToList here

@@ -2164,7 +2180,7 @@ public void ReapUnreachableMembers()
_cluster.SelfAddress, nonExiting.Select(m => m.ToString()).Aggregate((a, b) => a + ", " + b), string.Join(",", _cluster.SelfRoles));

if (!exiting.IsEmpty)
_log.Warning("Marking exiting node(s) as UNREACHABLE [{0}]. This is expected and they will be removed.",
_log.Warning("Cluster Node [{0}] - Marking exiting node(s) as UNREACHABLE [{1}]. This is expected and they will be removed.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Aaronontheweb Can we do this one as INFO if it is a normal exit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxcherednik @maxim-s yep, I think that's a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Akka (JVM) does not have "Cluster Node" parameter in this log entry. Maybe we could just remove the parameter, instead of extending the log message?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine

{
public static class ListExtensions
{
public static List<T> Shuffle<T>(this List<T> @this)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which shuffle algorithm are you using here? Fischer-Yates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not yet, just extracted class

public static List<T> Shuffle<T>(this List<T> @this)
{
var list = new List<T>(@this);
var rng = new Random();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ThreadLocalRandom here - avoids allocations and also spreads out the seeds (as ThreadLocalRandom gets used gradually across available threads)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you but I thought to replace with Math.round and Math.random as Java implementation

@maxim-s
Copy link
Contributor Author

maxim-s commented Feb 3, 2017

@Aaronontheweb racy spec I mean, that some sort of race in the LeaveAsync task, MemberRemoved has been received success, problem with callback and I'm working on it

@Aaronontheweb
Copy link
Member

@maxim-s oh I see what you mean. I must have misunderstood. So you think it's an issue with how the callback is implemented and not the spec itself?

@alexvaluyskiy
Copy link
Contributor

@maxim-s The better way is: implement akka/akka#21859 first in a one PR. And only after merge try to fix #2492
We have to very carefully to add something which is not exists in Akka JVM

@maxim-s
Copy link
Contributor Author

maxim-s commented Feb 6, 2017

@Aaronontheweb yes this PR #2501 should solve problem with Akka.Cluster.Tests.ClusterSpec.A_cluster_must_complete_LeaveAsync_task_upon_being_removed

@@ -2164,7 +2183,7 @@ public void ReapUnreachableMembers()
_cluster.SelfAddress, nonExiting.Select(m => m.ToString()).Aggregate((a, b) => a + ", " + b), string.Join(",", _cluster.SelfRoles));

if (!exiting.IsEmpty)
_log.Warning("Marking exiting node(s) as UNREACHABLE [{0}]. This is expected and they will be removed.",
_log.Warning("Cluster Node [{0}] - Marking exiting node(s) as UNREACHABLE [{1}]. This is expected and they will be removed.",
_cluster.SelfAddress, exiting.Select(m => m.ToString()).Aggregate((a, b) => a + ", " + b));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could simplify this code and use something like this

_log.Info("Marking exiting node(s) as UNREACHABLE [{0}]. This is expected and they will be removed.",  string.Join(", ", exiting));

@@ -1125,6 +1126,7 @@ private void TryingToJoin(object message, Address joinWith, Deadline deadline)
else if (message is InternalClusterAction.JoinSeedNodes)
{
var js = message as InternalClusterAction.JoinSeedNodes;
BecomeUninitialized();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@Aaronontheweb
Copy link
Member

I approve the changes. @alexvaluyskiy how do you feel? Are there changes you still want to see made?

@alexvaluyskiy
Copy link
Contributor

@Aaronontheweb I've already reviewed the code and approved it.

@Aaronontheweb Aaronontheweb merged commit e628f73 into akkadotnet:dev Feb 19, 2017
@Aaronontheweb Aaronontheweb added this to the 1.1.4 milestone Feb 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants