Improve Legion Lord election algorithm #100

unbit · 2012-12-27T20:03:41Z

The first purpose of the legion subsystem was ip-takeover. Infact the protocol is very similar to the carp one.

During the "future of the uWSGI clustering" discussions/rfc emerged that a better master ('lord' in legion subsystem gergo) election system is needed.

Inspirations are the paxos algorithm and the redis-sentinel implementation. Objectives are reducing split-brain cases and increase reliability.

unbit · 2012-12-28T07:21:09Z

Fix situations where nodes/members have the same valor:

Initial legion subsystem specs leave the case of nodes with the same valor undefined.

The new implementation will set a UUID (if uuid support is available) too. When 2 (ore more) nodes with the same valor pop-up the one with the higher UUID (using uuid_compare()) wins.

unbit · 2012-12-28T07:25:18Z

Quorum:

it is a pretty decent way to reduce split brain problems.

Each legion will expect (by default) a quorum of 1, that means node will not vote for election of a lord. A quorum value higher than 1 will enable the vote mode. Each node will send an additional announce with the choosen lord. When a potential new lord receive at least Q (quorum) announces for itself it will became the new lord.

unbit · 2012-12-28T08:36:34Z

Multiple lords ?

Are multiple lords useful in some specific context ?

prymitive · 2012-12-28T09:24:28Z

Quorum You described means that I need to have:

N >= Quorum nodes with --quorum --lord=nodeA

??

There is one great feature in uWSGI that I use - statelessness - only FastRouter nodes are "static", all backends can be moved around, added and removed without touching single line of configuration (and it should be easy to use multicast for subscriptions so I could have those made more dynamic as well). So I am not a fan of putting --lord=nodeA anywhere in my config files, I would prefer just to run some number of nodes and let them pick a lord. I understand that some clustering tools require me to input all nodes, but current legion can talk over multicast just fine and I really like that.

Maybe quorum could be implemented as the minimal number of joined nodes (like with cman and other tools)? So that if I use --quorum 3 than legion would not elect lord until there are at least 3 nodes sending announces.

prymitive · 2012-12-28T09:43:29Z

BTW - please add legion API so that plugins could interact with it easily. Right now if I don't want to reimplement (which is fancy term for copy&paste) all uwsgi_opt_legion(), uwsgi_opt_legion_hook() and uwsgi_opt_legion() logic I need to format a string with options for legion parameters.

unbit · 2012-12-28T10:18:08Z

Quorum means that a lord is not elected until Q nodes agree on it. This is required for avoiding (reducing) split brain, and it will be optional (by default no quorum is needed). Regarding stateless the relevant "new part" is the one allowing you to NOT specify the valor of a node. The valor will be somewhat randomic so you do not need to worry about that, and you will be sure only one node per-legion will run "things".

unbit · 2012-12-28T10:20:25Z

Regarding the api you only need to add the legion name in your options. In my mind a --legion-cron would be something like that:

--legion-cron foobar -1 -1 -1 -1 -1 my_command

that means, run the cron only if the node is the lord of the 'foobar' legion.

In your plugin you will have something like that:

run_cron(mycron) {
if (uwsgi_i_am_the_lord(mycron->legion)) {
uwsgi_run_command(mycron->command);
}
}

prymitive · 2012-12-28T10:21:53Z

Great, everything is clear now.

run_cron(mycron) {
if (uwsgi_i_am_the_lord(mycron->legion)) {
uwsgi_run_command(mycron->command);
}
}

This is pretty much what I was toying with in https://github.com/prymitive/uwsgi/blob/clustered_crons/plugins/clustered_crons/clustered_crons.c

unbit · 2012-12-28T10:34:07Z

Yes, i think you will be able to extremely simplify it as soon as the api is ready

prymitive · 2013-01-27T11:08:15Z

Could legion nodes also advertise ip:port of the uWSGI sockets they have to other nodes? This way new node would join legion cluster and once it's joined and the cluster is quorate, it could ask some other node (preferably master) for cache-sync. This would allow for cache syncing without the need to use --cache-sync with fixed node address.

unbit · 2013-01-27T11:21:57Z

this is part of the "legion scroll" system. You will find reference to it in the current code. Basically you can add a "blob"
to each node with custom values.

--legion-scroll = mylegion "foobar=test,youraddr=127.0.0.1:4040"

it is still incomplete but will be ready for sure in time for 1.5 as it is the base for the upcoming clustered-emperor support

prymitive · 2013-01-29T14:41:26Z

Ok, scroll will be great addition.

Could we also consider allowing user/plugin to provide some code that will dynamically alter nodes valor?
Either every N second or/and when legion cluster is altered (node joins or leaves legion).
First one would allow master to jump around from time to time (when load on a node is too high we can move master to other, less loaded node)
Second one would allow to spread cron tasks around during startup.

Without the ability to alter the valor, we might end up with all tasks on single node. If those tasks are memory or cpu heavy than we might kill that node.
Minimum version would be using valor value computed by $(user provided valor) - $(number of legions I am master of).

unbit · 2013-01-29T14:49:28Z

altering the valor is pretty easy (all of the math is done every time from scratch in the legion subsystem so you can trick it in various way)

something like uwsgi_legion_set/inc/dec_valor(struct uwsgi_legion *legion, uint64_t new_valor);

should be good (and without the need of locking as we are changing a single numeric value)

prymitive · 2013-02-27T15:48:08Z

Is "legion scroll" going to be in 1.9 or later? Once scroll is ready and there is a way to sync uWSGI cache from legion master node, than we will have basic legion crons plugin. I'm not rushing or anything, I currently use redis script for locking jobs and I won't move from it anytime soon, but others might find that useful and start if so, we will get few extra legion testers ;)

unbit · 2013-02-27T18:15:48Z

it should be (eventually will be in 1.9.1)

prymitive · 2013-04-13T14:38:23Z

I don't think there are any issues with current legion implementation, per run uid feature is present and working, scroll is ready and working. Closing as resolved

prymitive · 2013-04-17T09:39:21Z

There are 2 things that qualify for this issue that I would like to discuss, so I'll reopen it to have discussion in one place.

I'm thinking about replacing keepalived with legion, for virtual ip management on 2 of my fastrouter nodes. I was thinking about what could go wrong and as always split brain can occur (as with all clustering tools and only 2 nodes), so maybe we could add legion node(s) that can't become The Lord and use quorum=2 (or bigger, with multiple arbiters).

What this would give us - if one node can't communicate with the other currently there is no way to check if this is due to issue one local or remote node. But if local node can talk to arbiter nodes and quorum is reached, then we can make a decision who should become a lord.

Second thing that could help - health checks - run something, if it fails lower the valor of local node or prevent it from becoming The Lord, example:


if we can't ping router than network might be down on local node, don't become the lord.

`--legion-valor-check="50 killall 0 redis-server`

if redis-server is not running (and we use it) lower the valor value by 50 "points".

unbit · 2013-04-17T12:52:26Z

+1 for the arbiter, i will work on this soon after 1.9.7

Regarding the dynamic valor, i am thinking about changing it directly from the apps, so you can have a mule making all the checks you need without spawning a process every time.

unbit · 2013-04-20T12:21:58Z

The support for arbiters has been added. Just add node with a valor of 0 (they will be reported in the logs as "arbiter" instead of "node")

prymitive · 2013-04-23T08:48:36Z

Changing valor for mules is ok, one can add checks there with uwsgi timer or cron API, but I think there is use case for more direct manipulation of nodes valor.

Example:

I need to do some maintenance on lord node and I would prefer to move the lord to another node before stopping lords uWSGI instance (in mongodb I would call rs.stepDown() to do so), so it would be nice to have command line option for setting valor of any running legion instance.
uwsgi --legion-set-valor mylegion 30 -s localhost:uwsgi_port

I would need to talk to local instance, probably over uWSGI socket (?), it would then set valor to 30, that would trigger new election (or we need to force update in legion cluster).

This would allow me to more gracefully stop lord node.

prymitive · 2013-04-23T13:28:02Z

We could also add something like stepDown() into uwsgi_reload(), when I need to stop the lord, I would like it to first pass its lord status to other node in as much graceful way as possible, so in case of virtual IP there is as little downtime as it can be.

I'm not sure what happens when uWSGI instance joined to legion dies/reloads, does it send any message to other nodes that it should be marked as dead? Or does it dies and other nodes thinks it's up until legion-tolerance is exceeded?

It looks like the second case, it has pros and cons:

we can quickly reload instance without other nodes in the legion noticing it (in case valor is different and runtime uuids are not used for lord election) - in such case lord title stays on the reloaded node

if node will not reload correctly (upgrade that went wrong) node will not get up and so if it was the lord we need to wait until other nodes will notice that it is down

I think it's just a matter of (optionally) sending announce with valor = 0 when we need to reload, we can make it an option so user will pick desired behavior.

(?)

unbit · 2013-06-02T18:32:37Z

Ok, i think it is time to deal with it. What to do when an instance is reloaded (or stopped) ? i suppose we need to allow users to choose what to do for each legion. In some case it could be good to leave asap, in others it would be better to wait for the whole cluster to choose the best lord

unbit · 2013-08-06T05:20:04Z

ok, i think we can close it. Latest commit send a "death announce" whenever the server exit or it is reloaded. There is no need to allow the user to choose it, as the uuid of the node is changed on startup, so effectively the old node is always lost

ghost assigned unbit Dec 28, 2012

prymitive closed this as completed Apr 13, 2013

prymitive reopened this Apr 17, 2013

unbit closed this as completed Aug 6, 2013

yuzz920903 mentioned this issue Mar 30, 2023

I have got a coredump with uwsgi, version is 2.0.19.1. #2526

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Legion Lord election algorithm #100

Improve Legion Lord election algorithm #100

unbit commented Dec 27, 2012

unbit commented Dec 28, 2012

unbit commented Dec 28, 2012

unbit commented Dec 28, 2012

prymitive commented Dec 28, 2012

prymitive commented Dec 28, 2012

unbit commented Dec 28, 2012

unbit commented Dec 28, 2012

prymitive commented Dec 28, 2012

unbit commented Dec 28, 2012

prymitive commented Jan 27, 2013

unbit commented Jan 27, 2013

prymitive commented Jan 29, 2013

unbit commented Jan 29, 2013

prymitive commented Feb 27, 2013

unbit commented Feb 27, 2013

prymitive commented Apr 13, 2013

prymitive commented Apr 17, 2013

unbit commented Apr 17, 2013

unbit commented Apr 20, 2013

prymitive commented Apr 23, 2013

prymitive commented Apr 23, 2013

unbit commented Jun 2, 2013

unbit commented Aug 6, 2013

Improve Legion Lord election algorithm #100

Improve Legion Lord election algorithm #100

Comments

unbit commented Dec 27, 2012

unbit commented Dec 28, 2012

unbit commented Dec 28, 2012

unbit commented Dec 28, 2012

prymitive commented Dec 28, 2012

prymitive commented Dec 28, 2012

unbit commented Dec 28, 2012

unbit commented Dec 28, 2012

prymitive commented Dec 28, 2012

unbit commented Dec 28, 2012

prymitive commented Jan 27, 2013

unbit commented Jan 27, 2013

prymitive commented Jan 29, 2013

unbit commented Jan 29, 2013

prymitive commented Feb 27, 2013

unbit commented Feb 27, 2013

prymitive commented Apr 13, 2013

prymitive commented Apr 17, 2013

unbit commented Apr 17, 2013

unbit commented Apr 20, 2013

prymitive commented Apr 23, 2013

prymitive commented Apr 23, 2013

unbit commented Jun 2, 2013

unbit commented Aug 6, 2013