Replica forests: Suboptimal initial layout + incorrect layout on rebootstrap after scaleout #616

tdiepenbrock · 2016-05-25T02:02:11Z

Replica forests have some issues:

The replica forest layout on an initial bootstrap places the forests from Node N on Node N + 1; the replica forests from the last node in the cluster are placed on the first node in the cluster. This is not the recommended way to lay out replica forests because if a node goes down it causes one node to have double the workload. The best way to lay out replica forests is to distribute a node's replica forests among the remaining cluster members, so that if a node goes down, that node's workload is distributed over all cluster members.
When adding a node to an existing cluster, a rebootstrap will cause the new node's replica forests to be placed (again) on Node 1. This causes Node 1 to become completely overburdened over repeated scale-out events.

We have been implementing a fix for this on our project in Roxy. The updated code does the following:

Replica forests from any cluster node are round-robin distributed over the next N cluster members, where N is the number of primary forests on the node. If there are less than N remaining cluster members, the forest distribution wraps around so that some cluster members will have to host more than one replica forest.
The existing replica forest configuration tags in ml-config.xml are used.
Replica forest names now have an additional "A" or "B" suffix. This is used during rebootstrapping after scaleout--see below.
Scaleout: When new nodes are added to the cluster, Roxy will do the following at rebootstrap time:

If the current replica forest names have an "A" suffix, lay out new replica forests with a "B" suffix across all cluster members. If the current replica forests have a "B" suffix, lay out new replica forests with an "A" suffix. If there are no replicas, lay out replica forests with an "A" suffix.
Once the primary forests and the new replica forests are in open/sync replicating mode, delete the old "A" or "B" replica forests.
Internal forests get blindly replicated across all cluster nodes.

I believe the scaleout process involves multiple Roxy commands that need to be performed manually when inspection of the admin console/forest status shows the cluster is ready for the next step--I will verify with Joe M., who is working on this.

We will fork and do a pull request to commit the code back to the project for review. In the meantime please feel free to comment on the forest layout strategy.

grtjn · 2016-05-25T08:32:30Z

Looking forward very much to the PR for this! I agree that it is not the best algorithm, but it is smarter than you might think (unless it is broken). There have been a few tweaks not so long ago, so make sure you check out latest version. The general idea with forest replication was:

You have content forests spread across all hosts in the cluster or group.
You do a relatively simple round robin of the first replica on hostnr + 1
But for any extra replica you do hostnr + 2 etc

I ran various checks, and it should spread replicas evenly across all hosts.

If talking about replicas for Modules and such, yes that is likely sub-optimal, but you can target hosts with forests manually as well in ml-config..

PS: scale out is indeed a bit tricky, but we came to the conclusion that as long as you scale up and down in FILO style (first in, last out), you should be relatively safe..

dmcassel · 2016-05-25T11:06:54Z

This sounds like different views of the current functionality. @tdiepenbrock, please make sure your analysis is based on the most code.

tdiepenbrock · 2016-05-25T16:44:30Z

Have these changes been merged to the master branch? We are working off of
the 1.7.3 master branch.

td

grtjn · 2016-05-25T18:13:50Z

Looks like all my work should be in master, and that would indeed be 1.7.3. The round-robin formula should be in this line:

https://github.com/marklogic/roxy/blob/master/deploy/lib/xquery/setup.xqy#L1179

... xdmp:host-name($hosts[($hostnr + $pos - 1) mod count($hosts) + 1]) ...

The fragile part of this is that $hosts and $hostnr can be unpredictable if cluster size changes..

grtjn · 2016-05-25T18:19:44Z

By the way, I wasn't saying everything should already be working as described by Thomas, but it does spread across hosts in the cluster.

Maybe it helps to take a literal example, look at what Roxy makes out of it, and then pinpoint flaws in that? I do can imagine that if N drops out, N+1 can get more extra load than any of the other nodes in the cluster, but an example would help visualize, and think about improvements..

tdiepenbrock · 2016-05-25T20:34:13Z

Hm, Geert, using 1.7.3 we see all forests from host N being replicated to
host N + 1. All primary forests from host N + 1 are replicated to host N +
2. All primary forests from host N + 2 are replicated to host N. To give
a specific example:

For a three-node cluster, the layout Roxy 1.7.3 creates looks like this:

Host 1:
App-content-001-1
App-content-001-2
App-content-rep1-003-1
App-content-rep1-003-2

Host 2:
App-content-002-1
App-content-002-2
App-content-rep1-001-1
App-content-rep1-001-2

Host 3:
App-content-003-1
App-content-003-2
App-content-rep1-002-1
App-content-rep1-002-2

All of Host 1's forests are replicated to Host 2, all of Host 2's forests
are replicated to Host 3, and all of Host 3's forests are replicated to
Host 1. This is not the preferred way to lay out replica forests because a
failure of one node results in the doubling of the load of one other node
in the cluster while leaving the other node at its normal load. We would
like them to be laid out like this:

Host 1:
App-content-001-1
App-content-001-2
App-content-rep1-002-2
App-content-rep1-003-1

Host 2:
App-content-002-1
App-content-002-2
App-content-rep1-001-1
App-content-rep1-003-2

Host 3:
App-content-003-1
App-content-003-2
App-content-rep1-001-2
App-content-rep1-002-1

In the event of a failure of one node, this results in each remaining node
seeing it's load increase by only 50%.

In Roxy 1.7.3, scaling this cluster by adding one node and re-bootstrapping
results in this forest layout, which puts a double load of replica forests
on Host 1 and no replica forests on Host 4:

Host 1:
App-content-001-1
App-content-001-2
App-content-rep1-003-1
App-content-rep1-003-2
App-content-rep1-004-1
App-content-rep1-004-2

Host 2:
App-content-002-1
App-content-002-2
App-content-rep1-001-1
App-content-rep1-001-2

Host 3:
App-content-003-1
App-content-003-2
App-content-rep1-002-1
App-content-rep1-002-2

Host 4:
App-content-004-1
App-content-004-2

Under even the 1.7.3 forest layout scheme, re-bootstrapping really should
make the forest layout look like this:

Host 1:
App-content-001-1
App-content-001-2
App-content-rep1-004-1
App-content-rep1-004-2

Host 2:
App-content-002-1
App-content-002-2
App-content-rep1-001-1
App-content-rep1-001-2

Host 3:
App-content-003-1
App-content-003-2
App-content-rep1-002-1
App-content-rep1-002-2

Host 4:
App-content-004-1
App-content-004-2
App-content-rep1-003-1
App-content-rep1-003-2

I believe the line of code you are pointing to essentially says that if we
want multiple replicas of the primary forests, they will be round-robined
to the cluster hosts. In a five node cluster, we could then tolerate the
failure of two nodes. I believe Roxy 1.7.3 will lay out the forests like
this in a notional five-node cluster:

Host 1:
App-content-001-1
App-content-001-2
App-content-rep1-005-1
App-content-rep1-005-2
App-content-rep2-004-1
App-content-rep2-004-2

Host 2:
App-content-002-1
App-content-002-2
App-content-rep1-001-1
App-content-rep1-001-2
App-content-rep2-005-1
App-content-rep2-005-2

Host 3:
App-content-003-1
App-content-003-2
App-content-rep1-002-1
App-content-rep1-002-2
App-content-rep2-001-1
App-content-rep2-001-2

Host 4:
App-content-004-1
App-content-004-2
App-content-rep1-003-1
App-content-rep1-003-2
App-content-rep2-002-1
App-content-rep2-002-2

Host 5:
App-content-005-1
App-content-005-2
App-content-rep1-004-1
App-content-rep1-004-2
App-content-rep2-003-1
App-content-rep2-003-2

I think what we would really like to see in this case is this:

Host 1:
App-content-001-1
App-content-001-2
App-content-rep2-002-2
App-content-rep2-003-1
App-content-rep1-004-2
App-content-rep1-005-1

Host 2:
App-content-002-1
App-content-002-2
App-content-rep1-001-1
App-content-rep2-003-2
App-content-rep2-004-1
App-content-rep1-005-2

Host 3:
App-content-003-1
App-content-003-2
App-content-rep1-001-2
App-content-rep1-002-1
App-content-rep2-004-2
App-content-rep2-005-1

Host 4:
App-content-004-1
App-content-004-2
App-content-rep2-001-1
App-content-rep1-002-2
App-content-rep1-003-1
App-content-rep2-005-2

Host 5:
App-content-005-1
App-content-005-2
App-content-rep2-001-2
App-content-rep2-002-1
App-content-rep1-003-2
App-content-rep1-004-1

This layout still lets us lose two nodes, but in the failure case evenly
distributes the load across the three remaining cluster nodes. As with the
other example, scaling this cluster should result in an even distribution
of replica forests.

The code that Joe M. is writing lays out the forests this way and also
properly handles scaleout without losing HA.

grtjn · 2016-05-26T14:09:48Z

Yes, the issue with scaling out is that you would have to literally migrate forests to different hosts, at least the replicated ones. Currently Roxy does calculate the layout correctly, but most of the rep forests already exists, and Roxy won't reassign them to a different host. In fact that is not allowed, other then deleting and recreating them. I was kinda hesitant going that way..

Re your example, yeah I think you are right. If you have multiple forests on one host, each first rep goes to N+1. My formula doesn't include forestnr, but that should be a relatively small change I think..

Anyhow, still looking forward to a PR!

tdiepenbrock · 2016-05-26T14:15:17Z

PR coming soon, Joe M. is working part time on this so he will fork when
he's working on this next.

Yes, ML does not allow you to re-assign forests to other hosts. The plan
for scaling out is to create a second set of replica forests, laid out
properly across the cluster. Once they are in place the original set of
replicas will be deleted. We plan to do this with a naming convention,
using an "A" or "B" suffix. If the "A"'s exist at scaleout, a "B" set of
replicas will be created and the "A"'s will be deleted, and vice versa.
This will work well as long as there is sufficient hardware to temporarily
handle the second set of replicas--we will test to see how this works in
practice. The alternative strategy in the case of insufficient hardware is
to delete the replicas, run without HA temporarily, and re-create the
replicas.

grtjn · 2016-05-26T14:25:42Z

Keep in mind you probably only need to delete/recreate rep forests on node 1 and 2: just those that wrap around from last host, because of the new host becoming the new last one..

tdiepenbrock · 2016-05-26T14:28:37Z

I think it depends on how many forests per host--but I'll mention the idea
to Joe and see if he can include the concept in his implementation.

grtjn · 2016-05-26T14:31:44Z

Yeah, true. Still, it probably only affects a limited number of forests..

RobertSzkutak · 2017-02-03T17:30:41Z

Fixed in #684

dmcassel added enhancement deployer labels May 25, 2016

RobertSzkutak added this to the 1.7.5 milestone Feb 3, 2017

RobertSzkutak closed this as completed Feb 3, 2017

This was referenced Feb 3, 2017

Issue616 #684

Closed

Ticket616b #689

Merged

rskainth82 mentioned this issue May 2, 2017

Roxy 1.7.5 Issue #763

Open

jamsilvia mentioned this issue Jun 1, 2017

Internal forest replica layout is incorrect. #779

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replica forests: Suboptimal initial layout + incorrect layout on rebootstrap after scaleout #616

Replica forests: Suboptimal initial layout + incorrect layout on rebootstrap after scaleout #616

tdiepenbrock commented May 25, 2016

grtjn commented May 25, 2016

dmcassel commented May 25, 2016

tdiepenbrock commented May 25, 2016 •

edited by grtjn

Loading

grtjn commented May 25, 2016

grtjn commented May 25, 2016

tdiepenbrock commented May 25, 2016 •

edited by grtjn

Loading

grtjn commented May 26, 2016

tdiepenbrock commented May 26, 2016 •

edited by grtjn

Loading

grtjn commented May 26, 2016

tdiepenbrock commented May 26, 2016 •

edited by grtjn

Loading

grtjn commented May 26, 2016

RobertSzkutak commented Feb 3, 2017

Replica forests: Suboptimal initial layout + incorrect layout on rebootstrap after scaleout #616

Replica forests: Suboptimal initial layout + incorrect layout on rebootstrap after scaleout #616

Comments

tdiepenbrock commented May 25, 2016

grtjn commented May 25, 2016

dmcassel commented May 25, 2016

tdiepenbrock commented May 25, 2016 • edited by grtjn Loading

grtjn commented May 25, 2016

grtjn commented May 25, 2016

tdiepenbrock commented May 25, 2016 • edited by grtjn Loading

grtjn commented May 26, 2016

tdiepenbrock commented May 26, 2016 • edited by grtjn Loading

grtjn commented May 26, 2016

tdiepenbrock commented May 26, 2016 • edited by grtjn Loading

grtjn commented May 26, 2016

RobertSzkutak commented Feb 3, 2017

tdiepenbrock commented May 25, 2016 •

edited by grtjn

Loading

tdiepenbrock commented May 25, 2016 •

edited by grtjn

Loading

tdiepenbrock commented May 26, 2016 •

edited by grtjn

Loading

tdiepenbrock commented May 26, 2016 •

edited by grtjn

Loading