Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It takes Cluster Singleton 1 minute to move to another node #23

Open
vasily-kirichenko opened this issue Mar 27, 2018 · 1 comment
Open

Comments

@vasily-kirichenko
Copy link
Contributor

  1. Consul discovery, the settings are:
akka.cluster {
  discovery {
    provider = akka.cluster.discovery.consul
    consul {
      listener-url = "http://127.0.0.1:8500"
      class = "Akka.Cluster.Discovery.Consul.ConsulDiscoveryService, Akka.Cluster.Discovery.Consul"
      dispatcher = "consul-dispatcher"
      alive-interval = 10s
      alive-timeout = 1m
      refresh-interval = 1m
      join-retries = 3
      lock-retry-interval = 250ms
      datacenter = "dc"
      token = ""
      wait-time = 30s
    }
  }               
}
  1. Three nodes cluster, a singleton is running on a node.
  2. Kill the node on which the singleton is running.
  3. A new singleton is launched after ~1 minute delay, which is unacceptable, the docs promise that it should take few seconds at most.
@Horusiath
Copy link
Owner

Cluster singleton migration depends on the time of down node detection - if node is just unreachable, we cannot assume it's dead, since it may be just temporary network issue and we don't want to end with 2 singletons. Therefore we need to determine if a node is down:

  • In graceful scenario it's fast (as downing node can announce this to others).
  • In hard failure it's slow, since the rest of the cluster must detect if node is actually dead or if it just disconnected for some reason and will come back up shortly. And this takes time.

Docs probably refer to time required to migrate, once a down node has been detected. In case of consul cluster discovery, you can play with alive-timeout and refresh-interval settings to try to lower that time frame. However if I'm right consul itself requires at least 30-60s to detect an unhealty node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants