Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the use and behavior of the --join flag #3395

Closed
8 of 9 tasks
rmloveland opened this issue Jul 17, 2018 · 8 comments · Fixed by #7893
Closed
8 of 9 tasks

Clarify the use and behavior of the --join flag #3395

rmloveland opened this issue Jul 17, 2018 · 8 comments · Fixed by #7893
Assignees
Labels
O-external Origin: Issue comes from external users. T-incorrect-or-unclear-info T-missing-info

Comments

@rmloveland
Copy link
Contributor

rmloveland commented Jul 17, 2018

We need to clarify:

  • Why do we recommend specifying 3-5 join addresses?

    • This should answer what happens when one of the join target nodes goes down or can't be reached (What happens if --join host goes down? #320), i.e., no problem; we just try the next node in the list. From @bdarnell: "it's useful in environments where new nodes can be started automatically but it's inconvenient to change the command line. (this may sound like an unusual scenario, but google's borg works this way). You can set the command line up with multiple hosts so that new nodes will be able to find someone to talk to even if the original node has gone away."
  • What is the best practice for --join lists in a multi-region cluster? Include a few nodes from each region?

  • Why is it bad to specify too many nodes in --join? (see below)

  • What to do if you don't know fixed IP addresses on startup?

  • On restart, is --join required (no) and why not? As of 19.2, --join is always required when using cockroach start.

    • This should explain how, once a node has connected to an existing cluster, it gets the list of nodes already in the cluster via gossip, stores the list locally, and uses that list on subsequent startups.
  • When is a new cluster initialized on node startup. And what exactly happens during initialization.

    • I think this is probably already clear in the docs, but we should verify.
    • Ideally, this would cover updates to the cockroach init docs to include more detail from the init RFC.
  • Prior to 20.1, if you run cockroach start without a --join
    flag, and it does not find the bootstrapping info it needs on
    disk, it will create a brand new cluster.

    • If it does find the info on disk, the node will rejoin a
      cluster on node startup even if you restart it without a
      --join flag.
  • In 20.1+ cockroach start without --join will no longer
    create new clusters, once
    cli: remove auto-init with cockroach start without --join cockroach#44112 is merged.

  • If you point a --join flag at a load balancer, it's not good,
    for the following reasons:

    • Using a LB will defeat CockroachDB's internal algorithm to
      minimize the number of peer-to-peer connections for gossip.

    • It will create a dependency cycle between the load balancer
      configuration and the database, which is not good.

Previous description:

In
Start a Node
we do mention that

When starting a multi-node cluster for the first time, set this flag
to the addresses of 3-5 of the initial nodes. Then run the
cockroach init command …

However, apparently some weird/bad behavior can occur related to the
resulting gossip network graph. A doc update that fixes this issue
should probably:

  1. Make it clear that you shouldn't/don't need to do this (add every
    node on the --join flag)

  2. Clarify why not/what that weird/bad behavior is that can result
    (at a high-ish level ideally)

Related:

@rmloveland
Copy link
Contributor Author

@bdarnell do you (1) agree that we should update the docs in this spirit, and if so (2) have a recommendation for what we should tell users about why they should avoid doing more than 3-5 nodes in the --join args?

@bdarnell
Copy link
Contributor

  1. Yeah, we should probably update the docs.
  2. We recommend using a limited number of nodes in the join flag because the more join hosts are given, the longer it can take the cluster to reach a stable state on startup. By limiting the join flag to a few hosts, those hosts can act as brokers to distribute the rest of the addresses and help the cluster get into a connected and balanced state faster. Note that it is important that the same join hosts be given to every node; if each node is given a different set of join hosts then the cluster may not be able to converge on a stable configuration.

@jseldess jseldess changed the title Clarify that specifying too many nodes in the --join flag is not good Clarify the use and purpose of the --join flag Sep 21, 2018
@jseldess jseldess changed the title Clarify the use and purpose of the --join flag Clarify the use and behavior of the --join flag Sep 21, 2018
@jseldess jseldess added O-external Origin: Issue comes from external users. A-ops&tools labels Nov 9, 2018
@lnhsingh lnhsingh added the P-3 Low priority; nice-to-have label Nov 14, 2018
@rmloveland rmloveland self-assigned this Jan 30, 2020
@jseldess jseldess assigned taroface and unassigned rmloveland Jul 23, 2020
@jseldess jseldess added A-deploy-ops and removed A-io P-3 Low priority; nice-to-have labels Jul 23, 2020
@jseldess
Copy link
Contributor

@taroface, @johnrk, this is an older issue that was mis-assigned. I think we should address this one soonish.

@a-entin
Copy link

a-entin commented Jul 23, 2020

I understand the --join flag is mandatory since 19.2 (per doc) and it's a good thing. Since a node joining a cluster needs to find just 1 node already in the cluster (from general raft standpoint, not our implementation knowledge) and since we need redundancy, it seem to make sense in a multi-region setup to include 2 nodes from each region (presuming enough nodes and cluster configured to survive a region failure). Speaking more generally... the redundancy level is determined [by user] based on examination of all failure permutations/scenarios they want the cluster to tolerate. Shouldn't the --join list be constructed based on exactly same exercise? Pardon a trivial point - every operation in the cluster has to be at the same level of FT to eliminate the weakest link.

It's somewhat counter-intuitive why limiting #nodes in --join would be a good practice, perhaps it's our implementation caveat.

The latter, and more general question - what's a good order / pattern of entries in the --join list (doesn't matter, interleaving in some fashion, my-region first, other) - begs understanding how we walk the --join list. E.g. always start with first on the list or perhaps look at all entries an pick ones to connect first on the same subnet? And if fails to connect to an existing node in --join list - what's the timeout before next is tried?

@bdarnell
Copy link
Contributor

Including two nodes per region is my recommendation as well (for multi-region clusters. For single-region, adding a few more nodes (3-5 total) might make sense.

Currently it looks like we walk the join list in a random order, trying one address per second. This is not good for the first-time startup of a large cluster - it takes a while for all the nodes to converge onto the same gossip network (we saw some real-world issues with this a long time ago). So this is why we recommend limiting the number of addresses used here (there are certainly things we could do here to make it work better with large join lists, but using fewer join addresses is a much simpler workaround)

@taroface
Copy link
Contributor

Including two nodes per region is my recommendation as well (for multi-region clusters.

@bdarnell I think our multi-region Kubernetes docs currently set 3 --join addresses per cluster, corresponding to the 3 pods created by each StatefulSet. Will this be OK, or does the --join flag need to have just 2 addresses per region in this case?

@bdarnell
Copy link
Contributor

There's not a precise numeric requirement here; 2 vs 3 nodes per region won't make a difference. The general guidelines are:

  • Use more than one node
  • Unless your cluster is small, don't list all the nodes
  • Pick nodes that are spread out across failure domains
  • Use the same list for all nodes' join flags

For a 9-node multi-region cluster, you're on the edge of "small". It would be fine to just put all 9 in the join flag as long as you stay that size, but as you scale you'll probably want to stop doing that so it's probably a good idea to limit yourself to 2 nodes per region at the start.

@taroface
Copy link
Contributor

taroface commented Aug 6, 2020

Closed with #7893.

@taroface taroface closed this as completed Aug 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-external Origin: Issue comes from external users. T-incorrect-or-unclear-info T-missing-info
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants