-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster startup hard block #5515
base: dev
Are you sure you want to change the base?
Conversation
@@ -140,7 +140,11 @@ public Cluster(ActorSystemImpl system) | |||
_readView = new ClusterReadView(this); | |||
|
|||
// force the underlying system to start | |||
_clusterCore = GetClusterCoreRef().Result; | |||
// and hard block the current thread | |||
var clusterCoreTask = Task.Run(GetClusterCoreRef); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any flags we should consider adding here? i.e. would it be better to do Task.Factory.StartNew
with LongRunning
and DenyChildAttach
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_clusterCore = GetClusterCoreRef().Result
was always locking and I think its still is.
I think what happend is that after the Ask got improved
the current thread gets used for the actorcell dispatcher.
I don't know how/where exactly, but the child-actors of cluster-daemon are using it.
´Task.Run(GetClusterCoreRef)´ should force the ask on a different thread,
this is the only thing that matters, that it the current thread does not leak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what happend is that after the Ask got improved the current thread gets used for the actorcell dispatcher.
Interesting. Would there be other implications if this is true?
cc: @Aaronontheweb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actor system extensions implementation and the locking cluster constructor are both just anti-patterns
and we need to rework it in the future.
#5447
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe its just the "ConfigureAwait(false)" of
akka.net/src/core/Akka.Cluster/Cluster.cs
Line 155 in 44c51f6
return await _clusterDaemons.Ask<IActorRef>(new InternalClusterAction.GetClusterCoreRef(this), timeout).ConfigureAwait(false); |
The Cluster Extension is called by the ClusterActorRefProvider and it is created in the ActorSystem Startup.
The thread is the same as of the ActorSystem creator and with the normal ForkJoinExecutor they will never mix.
But with the ChannelExecutor that uses the normal ThreadPool of dotnet, the awaiting thread can be used for an ActorCell.
var task = Task.Run(...);
task.Wait();
I hope that simply resolve the problem.
task.Result
blocked at .net 4.5 for sure,
I don't know why/how the thread gets now reused.
I have not had time to review this - hopefully I'll be able to come up for air next week. |
{ | ||
case "internal-dispatcher": | ||
case "default-remote-dispatcher": | ||
Priority = TaskSchedulerPriority.High; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These default values should be baked in the default HOCON config, not in code; i.e. they have to be transparent to the user, user should only read the config file to figure out the default value of a setting, not dig through the source code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are the default in the config.
But the testkit does not load the default config
@@ -120,8 +119,26 @@ internal sealed class ChannelExecutorConfigurator : ExecutorServiceConfigurator | |||
{ | |||
public ChannelExecutorConfigurator(Config config, IDispatcherPrerequisites prerequisites) : base(config, prerequisites) | |||
{ | |||
var cfg = config.GetConfig("channel-executor"); | |||
Priority = (TaskSchedulerPriority)Enum.Parse(typeof(TaskSchedulerPriority), cfg.GetString("priority", "normal"), true); | |||
var priorityName = config.GetString("channel-executor.priority", "None") ?? "None"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you're accessing ActorSystem.Settings.Config
directly, always assume that a Config object passed into a method can be null. Need to check for null config instance here.
@@ -120,8 +119,26 @@ internal sealed class ChannelExecutorConfigurator : ExecutorServiceConfigurator | |||
{ | |||
public ChannelExecutorConfigurator(Config config, IDispatcherPrerequisites prerequisites) : base(config, prerequisites) | |||
{ | |||
var cfg = config.GetConfig("channel-executor"); | |||
Priority = (TaskSchedulerPriority)Enum.Parse(typeof(TaskSchedulerPriority), cfg.GetString("priority", "normal"), true); | |||
var priorityName = config.GetString("channel-executor.priority", "None") ?? "None"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes does not have anything to do with making cluster startup to block, is it? Can you remove these and move it to a new PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is required to use the channel-executor in the spec.
I added one spec that simple starts the cluster with the channel-executor and discovered this problems.
The spec itself will not trigger the original failure, because the testkit is using a SynchronisationContext
for the akka system startup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but this PR can easily be broken into 2 PR, one depending to the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be a problem anymore, fixed in #5568
config.GetInt("parallelism-min"), | ||
config.GetDouble("parallelism-factor", 1.0D), // the scalar-based factor to scale the threadpool size to | ||
config.GetInt("parallelism-max")); | ||
config?.GetInt("parallelism-min", 4) ?? 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here, these should be placed in a different PR
Fixes #5498
Fix startup with missing config for channel-executor
Changes
The constructor thread of cluster startup got reused by the default TaskScheduler used by the ChannelExecutor.