Skip to content
This repository has been archived by the owner on Nov 9, 2022. It is now read-only.

Replica forests: Suboptimal initial layout + incorrect layout on rebootstrap after scaleout #616

Closed
tdiepenbrock opened this issue May 25, 2016 · 12 comments

Comments

@tdiepenbrock
Copy link

Replica forests have some issues:

  • The replica forest layout on an initial bootstrap places the forests from Node N on Node N + 1; the replica forests from the last node in the cluster are placed on the first node in the cluster. This is not the recommended way to lay out replica forests because if a node goes down it causes one node to have double the workload. The best way to lay out replica forests is to distribute a node's replica forests among the remaining cluster members, so that if a node goes down, that node's workload is distributed over all cluster members.
  • When adding a node to an existing cluster, a rebootstrap will cause the new node's replica forests to be placed (again) on Node 1. This causes Node 1 to become completely overburdened over repeated scale-out events.

We have been implementing a fix for this on our project in Roxy. The updated code does the following:

  • Replica forests from any cluster node are round-robin distributed over the next N cluster members, where N is the number of primary forests on the node. If there are less than N remaining cluster members, the forest distribution wraps around so that some cluster members will have to host more than one replica forest.
  • The existing replica forest configuration tags in ml-config.xml are used.
  • Replica forest names now have an additional "A" or "B" suffix. This is used during rebootstrapping after scaleout--see below.
  • Scaleout: When new nodes are added to the cluster, Roxy will do the following at rebootstrap time:
  1. If the current replica forest names have an "A" suffix, lay out new replica forests with a "B" suffix across all cluster members. If the current replica forests have a "B" suffix, lay out new replica forests with an "A" suffix. If there are no replicas, lay out replica forests with an "A" suffix.
  2. Once the primary forests and the new replica forests are in open/sync replicating mode, delete the old "A" or "B" replica forests.
  3. Internal forests get blindly replicated across all cluster nodes.

I believe the scaleout process involves multiple Roxy commands that need to be performed manually when inspection of the admin console/forest status shows the cluster is ready for the next step--I will verify with Joe M., who is working on this.

We will fork and do a pull request to commit the code back to the project for review. In the meantime please feel free to comment on the forest layout strategy.

@grtjn
Copy link
Contributor

grtjn commented May 25, 2016

Looking forward very much to the PR for this! I agree that it is not the best algorithm, but it is smarter than you might think (unless it is broken). There have been a few tweaks not so long ago, so make sure you check out latest version. The general idea with forest replication was:

  1. You have content forests spread across all hosts in the cluster or group.
  2. You do a relatively simple round robin of the first replica on hostnr + 1
  3. But for any extra replica you do hostnr + 2 etc

I ran various checks, and it should spread replicas evenly across all hosts.

If talking about replicas for Modules and such, yes that is likely sub-optimal, but you can target hosts with forests manually as well in ml-config..

PS: scale out is indeed a bit tricky, but we came to the conclusion that as long as you scale up and down in FILO style (first in, last out), you should be relatively safe..

@dmcassel
Copy link
Collaborator

This sounds like different views of the current functionality. @tdiepenbrock, please make sure your analysis is based on the most code.

@tdiepenbrock
Copy link
Author

tdiepenbrock commented May 25, 2016

Have these changes been merged to the master branch? We are working off of
the 1.7.3 master branch.

td

@grtjn
Copy link
Contributor

grtjn commented May 25, 2016

Looks like all my work should be in master, and that would indeed be 1.7.3. The round-robin formula should be in this line:

https://github.com/marklogic/roxy/blob/master/deploy/lib/xquery/setup.xqy#L1179

... xdmp:host-name($hosts[($hostnr + $pos - 1) mod count($hosts) + 1]) ...

The fragile part of this is that $hosts and $hostnr can be unpredictable if cluster size changes..

@grtjn
Copy link
Contributor

grtjn commented May 25, 2016

By the way, I wasn't saying everything should already be working as described by Thomas, but it does spread across hosts in the cluster.

Maybe it helps to take a literal example, look at what Roxy makes out of it, and then pinpoint flaws in that? I do can imagine that if N drops out, N+1 can get more extra load than any of the other nodes in the cluster, but an example would help visualize, and think about improvements..

@tdiepenbrock
Copy link
Author

tdiepenbrock commented May 25, 2016

Hm, Geert, using 1.7.3 we see all forests from host N being replicated to
host N + 1. All primary forests from host N + 1 are replicated to host N +
2. All primary forests from host N + 2 are replicated to host N. To give
a specific example:

For a three-node cluster, the layout Roxy 1.7.3 creates looks like this:

Host 1:
App-content-001-1
App-content-001-2
App-content-rep1-003-1
App-content-rep1-003-2

Host 2:
App-content-002-1
App-content-002-2
App-content-rep1-001-1
App-content-rep1-001-2

Host 3:
App-content-003-1
App-content-003-2
App-content-rep1-002-1
App-content-rep1-002-2

All of Host 1's forests are replicated to Host 2, all of Host 2's forests
are replicated to Host 3, and all of Host 3's forests are replicated to
Host 1. This is not the preferred way to lay out replica forests because a
failure of one node results in the doubling of the load of one other node
in the cluster while leaving the other node at its normal load. We would
like them to be laid out like this:

Host 1:
App-content-001-1
App-content-001-2
App-content-rep1-002-2
App-content-rep1-003-1

Host 2:
App-content-002-1
App-content-002-2
App-content-rep1-001-1
App-content-rep1-003-2

Host 3:
App-content-003-1
App-content-003-2
App-content-rep1-001-2
App-content-rep1-002-1

In the event of a failure of one node, this results in each remaining node
seeing it's load increase by only 50%.

In Roxy 1.7.3, scaling this cluster by adding one node and re-bootstrapping
results in this forest layout, which puts a double load of replica forests
on Host 1 and no replica forests on Host 4:

Host 1:
App-content-001-1
App-content-001-2
App-content-rep1-003-1
App-content-rep1-003-2
App-content-rep1-004-1
App-content-rep1-004-2

Host 2:
App-content-002-1
App-content-002-2
App-content-rep1-001-1
App-content-rep1-001-2

Host 3:
App-content-003-1
App-content-003-2
App-content-rep1-002-1
App-content-rep1-002-2

Host 4:
App-content-004-1
App-content-004-2

Under even the 1.7.3 forest layout scheme, re-bootstrapping really should
make the forest layout look like this:

Host 1:
App-content-001-1
App-content-001-2
App-content-rep1-004-1
App-content-rep1-004-2

Host 2:
App-content-002-1
App-content-002-2
App-content-rep1-001-1
App-content-rep1-001-2

Host 3:
App-content-003-1
App-content-003-2
App-content-rep1-002-1
App-content-rep1-002-2

Host 4:
App-content-004-1
App-content-004-2
App-content-rep1-003-1
App-content-rep1-003-2

I believe the line of code you are pointing to essentially says that if we
want multiple replicas of the primary forests, they will be round-robined
to the cluster hosts. In a five node cluster, we could then tolerate the
failure of two nodes. I believe Roxy 1.7.3 will lay out the forests like
this in a notional five-node cluster:

Host 1:
App-content-001-1
App-content-001-2
App-content-rep1-005-1
App-content-rep1-005-2
App-content-rep2-004-1
App-content-rep2-004-2

Host 2:
App-content-002-1
App-content-002-2
App-content-rep1-001-1
App-content-rep1-001-2
App-content-rep2-005-1
App-content-rep2-005-2

Host 3:
App-content-003-1
App-content-003-2
App-content-rep1-002-1
App-content-rep1-002-2
App-content-rep2-001-1
App-content-rep2-001-2

Host 4:
App-content-004-1
App-content-004-2
App-content-rep1-003-1
App-content-rep1-003-2
App-content-rep2-002-1
App-content-rep2-002-2

Host 5:
App-content-005-1
App-content-005-2
App-content-rep1-004-1
App-content-rep1-004-2
App-content-rep2-003-1
App-content-rep2-003-2

I think what we would really like to see in this case is this:

Host 1:
App-content-001-1
App-content-001-2
App-content-rep2-002-2
App-content-rep2-003-1
App-content-rep1-004-2
App-content-rep1-005-1

Host 2:
App-content-002-1
App-content-002-2
App-content-rep1-001-1
App-content-rep2-003-2
App-content-rep2-004-1
App-content-rep1-005-2

Host 3:
App-content-003-1
App-content-003-2
App-content-rep1-001-2
App-content-rep1-002-1
App-content-rep2-004-2
App-content-rep2-005-1

Host 4:
App-content-004-1
App-content-004-2
App-content-rep2-001-1
App-content-rep1-002-2
App-content-rep1-003-1
App-content-rep2-005-2

Host 5:
App-content-005-1
App-content-005-2
App-content-rep2-001-2
App-content-rep2-002-1
App-content-rep1-003-2
App-content-rep1-004-1

This layout still lets us lose two nodes, but in the failure case evenly
distributes the load across the three remaining cluster nodes. As with the
other example, scaling this cluster should result in an even distribution
of replica forests.

The code that Joe M. is writing lays out the forests this way and also
properly handles scaleout without losing HA.

@grtjn
Copy link
Contributor

grtjn commented May 26, 2016

Yes, the issue with scaling out is that you would have to literally migrate forests to different hosts, at least the replicated ones. Currently Roxy does calculate the layout correctly, but most of the rep forests already exists, and Roxy won't reassign them to a different host. In fact that is not allowed, other then deleting and recreating them. I was kinda hesitant going that way..

Re your example, yeah I think you are right. If you have multiple forests on one host, each first rep goes to N+1. My formula doesn't include forestnr, but that should be a relatively small change I think..

Anyhow, still looking forward to a PR!

@tdiepenbrock
Copy link
Author

tdiepenbrock commented May 26, 2016

PR coming soon, Joe M. is working part time on this so he will fork when
he's working on this next.

Yes, ML does not allow you to re-assign forests to other hosts. The plan
for scaling out is to create a second set of replica forests, laid out
properly across the cluster. Once they are in place the original set of
replicas will be deleted. We plan to do this with a naming convention,
using an "A" or "B" suffix. If the "A"'s exist at scaleout, a "B" set of
replicas will be created and the "A"'s will be deleted, and vice versa.
This will work well as long as there is sufficient hardware to temporarily
handle the second set of replicas--we will test to see how this works in
practice. The alternative strategy in the case of insufficient hardware is
to delete the replicas, run without HA temporarily, and re-create the
replicas.

@grtjn
Copy link
Contributor

grtjn commented May 26, 2016

Keep in mind you probably only need to delete/recreate rep forests on node 1 and 2: just those that wrap around from last host, because of the new host becoming the new last one..

@tdiepenbrock
Copy link
Author

tdiepenbrock commented May 26, 2016

I think it depends on how many forests per host--but I'll mention the idea
to Joe and see if he can include the concept in his implementation.

@grtjn
Copy link
Contributor

grtjn commented May 26, 2016

Yeah, true. Still, it probably only affects a limited number of forests..

@RobertSzkutak RobertSzkutak added this to the 1.7.5 milestone Feb 3, 2017
@RobertSzkutak
Copy link
Contributor

Fixed in #684

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants