Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: support copysets or replication "teams" of servers #25194

Open
rytaft opened this issue Apr 30, 2018 · 9 comments
Open

core: support copysets or replication "teams" of servers #25194

rytaft opened this issue Apr 30, 2018 · 9 comments
Labels
A-kv-distribution Relating to rebalancing and leasing. A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-community Originated from the community T-kv KV Team

Comments

@rytaft
Copy link
Collaborator

rytaft commented Apr 30, 2018

As described in this forum post, better fault tolerance for extremely large clusters can be achieved by creating replication "teams" of servers.

This phenomenon has been described in the literature as "copysets", and is summarized here: http://hackingdistributed.com/2014/02/14/chainsets/. Although copysets can be manually approximated today with our replication zones feature, we should consider explicitly supporting copysets in the future in order to better support large deployments.

Jira issue: CRDB-5725

@rytaft rytaft added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-community Originated from the community A-kv-replication Relating to Raft, consensus, and coordination. labels Apr 30, 2018
@rytaft rytaft added this to the Later milestone Apr 30, 2018
@BramGruneir
Copy link
Member

This is VERY old now, but copysets were discussed in the rebalancing RFC: https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20160503_rebalancing_v2.md#copysets

@BramGruneir
Copy link
Member

Actually, after re-reading the rfc, I think a lot of that still holds true.

@nvanbenschoten
Copy link
Member

@a-robinson following up on what I mentioned in the reading group - can this be supported in the static case where no nodes are added or removed by using zone configs? If zone configs support disjunction expressions then I think we could emulate the effect of copysets with the following two steps:

  • add a n<N> attribute to each node in a cluster
  • define replica constraints in zone configs as a disjunction of all acceptable copysets. For instance, we could define valid replica placements as: [[n1, n2, n3], [n4, n5, n6], [n7, n8, n9], [n1, n4, n7], [n2, n5, n8], [n3, n6, n9]] (not valid zone constraint syntax, of course).

If this all worked then we could create a simple tool that generates one of these copyset zone configs given a certain number of nodes and a desired scatter width.

cc. @nstewart

@a-robinson
Copy link
Contributor

You could take those two steps to allow manual control over which copysets exist, but you'd still have to implement the rebalancing logic to actually respect the copysets and to try balancing ranges across them. And you'd be really tightly coupling the constraints/attributes on nodes with the constraints in the zone config by statically listing all the nodes in the zone config.

@nvanbenschoten
Copy link
Member

you'd still have to implement the rebalancing logic to actually respect the copysets and to try balancing ranges across them

I'm not sure I follow, which is probably due to a lack of expertise in the area of zone config constraints and their effect on rebalancing decisions. Why wouldn't current rebalancing logic take the manual "copysets" into account? My understanding is that it would try to evenly distribute replicas across nodes randomly while still obeying these constraints. If that's the case then wouldn't a scatter probabilistically fill the copysets in a roughly even manner?

And you'd be really tightly coupling the constraints/attributes on nodes with the constraints in the zone config by statically listing all the nodes in the zone config.

Sure, this wouldn't work for all deployments, but this isn't really a blocker for a POC.

@a-robinson
Copy link
Contributor

I didn't get you wanted to try to use the existing Constraints logic for it. That works in theory, but a disjunction of per-replica constraints is way beyond what the allocator implementation for Constraints currently supports (and even further beyond what we've seen any need to support).

@petermattis petermattis removed this from the Later milestone Oct 5, 2018
@github-actions
Copy link

github-actions bot commented Jun 6, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jun 11, 2021

We don't need to support full copysets either, even a low number of node groups (e.g. via locality tags) significantly lowers the probability of range loss in large clusters. The analytical solution got hairy, but I wrote a small script to simulate it (rangeloss.go). When losing 2 nodes in a 50-node cluster with 3 replicas and 1000 ranges, the probabilities of losing at least 1 range are:

  • 1 partitions: 92%
  • 3 partitions: 67%
  • 6 partitions: 34%
  • 9 partitions: 23%
  • 12 partitions: 17%

Many users already use 3 partitions due to cloud AZ locality tags, but there are significant gains to be had by adding a just a few more partitions.

I think it'd be very worthwhile to add this as one of the allocation heuristics -- of course weighed up against other concerns like load balancing.

UPDATE: The initial numbers here were too low due to a bug which biased range allocations. This also showed that simply adding 12 localities and allocating randomly doesn't work, there needs to be stronger coupling between ranges. The simulation above uses contiguous partition groups with the same size as the number of replicas -- i.e., with 3 replicas, ranges are allocated either to partitions 1-3, 4-6, 7-9, etc.

@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@lunevalex lunevalex added the A-kv-distribution Relating to rebalancing and leasing. label Oct 6, 2021
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-community Originated from the community T-kv KV Team
Projects
None yet
Development

No branches or pull requests

8 participants