[WIP] #1

tgross · 2016-07-08T13:37:05Z

@yosifkit @misterbisson I'm opening this PR with the work in the wip branch so that we have a place for review and comment.

- the master node in mongo will grab the primary lock in consul - or first when the replica is not yet started - the master node is the only node that updates the replica config - the replica set config is automatically updated from consul's list of mongo nodes

tgross · 2016-07-08T13:40:21Z

Dockerfile

+CMD [ \
+	"containerpilot", \
+	"mongod", \
+	"--replSet=joyent" \


We probably should show this CMD (specifically the --replSet) being overridden in the docker-compose.yml file so that it's clear to someone reading through this that you can create multiple replSets by having multiple service blocks.

Done (push coming shortly), also updating to have a minimal README.md on how to run.

- comments for `mongo_update_replset_config` - update readme with some usage

- expose consul port 8500 in local development - only one ContainerPilot/Consul "state" for mongodb to prevent ContainerPilot reloads - use `replSetGetStatus` instead of `is_primary` - use consul key with non-expiring token to determine "initdb" - add comment to code about scaling above 7 nodes

yosifkit · 2016-07-20T23:36:27Z

Two most recent commits to address the following:

comments for mongo_update_replset_config
update readme with some usage
expose consul port 8500 in local development
only one ContainerPilot/Consul "state" for mongodb to prevent ContainerPilot reloads
use replSetGetStatus instead of is_primary
use consul key with non-expiring token to determine "initdb"
add comment to code about scaling above 7 nodes

…onChange events

tgross · 2016-07-21T12:33:19Z

bin/manage.py

+
+# ---------------------------------------------------------
+
+#class ContainerPilot(object):


~~If we're not using this we should just cut it rather than commenting-out.~~

Edit: nevermind, I see you've got a TODO about this below.

- add a pre_stop method to containerpilot

yosifkit · 2016-07-23T00:24:06Z

Improved mongo_update_replset_config; added a preStop function so that a primary can step down.

yosifkit · 2016-07-25T20:09:03Z

bin/manage.py

+            log.debug(e)
+            try:
+                # force
+                local_mongo.admin.command('replSetStepDown', 60, force=True)


After some more thought, we should drop this force=True part. I think it will be safer if the user explicitly kills the container or does a force stepdown themselves.

ie, if the container gets a SIGTERM and can't stepdown from being primary, it should fail to stop. It fails if there is no secondary or if the secondaries were too far behind and still didn't catch up in the 8 second catch-up period.

Does that sound right to you?

I think it will be safer if the user explicitly kills the container

If a container is stopped via docker stop it receives SIGTERM and then after the timeout it'll get SIGKILL anyways. Will we be in a sane state if that happens?

I think the current forced stepdown or SIGKILL of the primary node will leave us in about the same state (after an attempted stepdown); a new master would be elected (if there are enough nodes left for a vote) and there could likely be a rollback. So it seems better to let the signals tell us to "clean up" for shutdown (SIGTERM) and if we can't step down from primary, then let the user do the kill on purpose. Maybe we add documentation around using docker kill -s SIGTERM rather than docker stop? Or maybe recommend higher timeouts to docker stop, and containerpliot? Can we to adjust the stopTimeout in containerpilot from the python so that it can be set via env?

Or maybe recommend higher timeouts to docker stop, and containerpliot?

Keep in mind that lots of people will be using some kind of higher-level scheduler so telling them not to use docker stop might not be relevant. Setting the timeouts is the right way to go.

Can we to adjust the stopTimeout in containerpilot from the python so that it can be set via env?

You can interpolate it in the ContainerPilot config from the env. See https://www.joyent.com/containerpilot/docs/configuration#template-configuration

I've started making these timeouts configurable from the environment, but I am unable to set stopTimeout in the containerpilot json since the go module expects an int and not a string. Is there another way around it? Maybe use the ContainerPilot class that we still have and write out the updated config using the environment variables?

// works "stopTimeout": 9, // fails "stopTimeout": "{{.STOP_TIMEOUT}}",

This should work just fine, right? "stopTimeout": {{.STOP_TIMEOUT}},

I had tried it and gotten parse errors, but after another rebuild it works fine now. It seems I did not save the Dockerfile with the ENV set. 😮 😞

yosifkit · 2016-08-02T20:12:51Z

So I have a case that I think we need to handle:

have a replica set of 3 or more nodes
docker kill -s TERM current-primary
primary does stepdown and the other nodes begin election
primary removes itself from consul and exits
container pilot delivers onChange
1. there is not yet a primary node so it is ignored
primary is elected
mongo replica config still contains stopped node
1. this might affect a vote later (since at least 50% are needed to vote for a primary)

So the question is, what should we do about the mongo config being out of sync with what is in consul?

The worst case is if someone is scaling down their cluster from say 5 to 3 and one of the containers stopped is the primary with the other being stopped while there is still no elected primary, then the end result is a 3 node cluster that has 5 nodes of config. If another node goes down it will fail to have a primary anymore. If a new node is added or if a node is removed while there is an active primary, then the replica config will become consistent with the config in consul.

One solution is to just add the mongo_update_replset_config to the health check. A second solution is to add a periodic task that does the mongo_update_replset_config on the primary node just like onChange, but with a longer period.

misterbisson · 2016-08-11T22:06:56Z

what should we do about the mongo config being out of sync with what is in consul? The worst case is if someone is scaling down [...] and one of the containers stopped is the primary...

Good question. Can we add a preStop behavior that blocks if the instance is the primary while it demotes the primary and waits for a new primary to be elected? If we set the stop timeout in Docker to a relatively long value (ten minutes, an hour?), we can give the cluster plenty of time to reconcile itself in this condition.

It's also maybe fair to ticket that as a bug and return to it. This is shaping up nicely and looks close (or nearly close) enough to MVP to be ready for merging.

yosifkit · 2016-08-15T22:59:26Z

@misterbisson, I implemented a "wait for election" after the primary node steps down (all within the pre_stop function, which I will push up shortly). But the problem is that the other mongodb containers get the on_change only ~200 milliseconds after the primary pre_stop is called and has not finished, so the election also is still unfinished. From my perspective, it looks like Container Pilot calls the pre_stop asynchronously and then immediately deregisters from consul. Am I doing something wrong or unexpected with pre_stop?

On a related note, I set stopTimeout to 5 for Container Pilot and I am not sure that it changed anything; my single mongodb instance still took almost a minute for just pre_stop. Is the stopTimeout supposed to limit how long pre_stop takes?

tianon · 2016-08-15T23:06:40Z

The conversation in and around TritonDataCenter/containerpilot#200 seems to imply that deregistration prior to invoking preStop is intentional, and unlikely to change. Given that the service is still running at the moment of preStop, this seems kind of strange -- if the service is still running, shouldn't it still be registered? Could this behavior perhaps become configurable so that preStop can do useful things like ensuring that a new "primary" is elected before all the other cluster nodes get their onChange triggered by the deregistration of the current "primary"?

Additionally, given that SIGTERM is a signal, and by itself isn't actually terminal, wouldn't it also make sense for preStop to be able to exit non-zero and cancel the termination of the service? This would mean that users who are simply using docker stop blindly would be potentially killing live services, but only in the rare edge case that the new primary election resulted in this node itself being re-elected (which is the main error case it'd be prudent to be able to cancel for here). Thus, users concerned strongly with the consistent, safe state of their cluster would use docker kill -s TERM instead of docker stop for giving their containers the chance to stop gracefully, but still allowing for the possibility that such a thing isn't currently possible and might require manual intervention of some kind (without Container Pilot force-killing their primary node simply because they asked it to shut down gracefully).

misterbisson · 2016-08-15T23:44:46Z

I set stopTimeout to 5 for Container Pilot and I am not sure that it changed anything; my single mongodb instance still took almost a minute for just pre_stop.

stopTimeout is supposed to start after preStop completes and ContainerPilot attempts to SIGTERM the main process. That's implemented in ContainerPilot's core/app.go.

Your description (preStop doesn't timeout) doesn't sound like a bug, but it probably needs clarification in the docs. Also to clarify: the total of preStop + main process stop + postStop must all complete before the daemon sends the SIGKILL. docker stop -t <seconds> is required for anything that takes time to stop here.

Could [preStop service deregistration] behavior perhaps become configurable so that preStop can do useful things like ensuring that a new "primary" is elected before all the other cluster nodes get their onChange triggered by the deregistration of the current "primary"?

This is a good use case to consider and it deserves more thought/discussion.

wouldn't it also make sense for preStop to be able to exit non-zero and cancel the termination of the service [in the situation where the user calls docker stop naively]?

I'm putting words in your mouth with "naively" there, but is it a fair distillation of the concern? How does the application enforce the refusal to shut down back to the scheduler/daemon/infrastructure? Is there any way to tell docker stop to allow a container to talk back like that?

tgross · 2016-08-16T12:43:16Z

if the service is still running, shouldn't it still be registered?

In stateful applications in particular this means client applications will still be sending writes until their TTL expires. The preStop is your opportunity to wait until writes stop coming so that you don't have a cluster in an inconsistent state before you do an election. It does mean that the client applications need to know what to do if they have no place to send writes temporarily.

Could this behavior perhaps become configurable

I'm certain open to changing the behavior if we think we've got the wrong one but another configuration option at this point has to really sell itself.

Should the "how should ContainerPilot behave?" (vs the "how does ContainerPilot behave?") portion of this discussion get moved over to a ContainerPilot issue?

yosifkit · 2016-08-22T21:14:49Z

It seems fair to bring the discussion about consul registration elsewhere, though what it means for this image is that if you docker kill -s TERM your primary node, the mongodb replica config will be in an inconsistent state until there is another event that would generate an on_change on the primary node.

The way to work around this is to first docker exec -it primary-node mongo --eval 'rs.stepDown()' before shutting down the primary node.

As for what is left here, there are two major parts I can think of:

documentation for all of the possible ENVs (and maybe stepdown caveats)
backups to manta

I'll see if I can get the docs up this afternoon and then we can stick the backups into a new issue (especially since the recommended backup strategies are MongoDB Inc's "Cloud Manager", "Ops Manager", or file system snapshots).

… preStop hooks

yosifkit · 2016-08-22T23:50:22Z

Pushed my docs changes (might want a technical writer to make it better 😉). Let me know if you want anything else in this PR.

One more thing to note: do not scale down to less than half of your voting members or you will lose quorum and the nodes will fail to have a primary node (assumed netsplit). In other words, if you have 7 and want to only have 3, you must first scale to 5 and make sure the replica config is updated to just 5 nodes then you can scale to 3. It is probably best to scale down one at a time and if the current primary is the node you want to destroy, you should do the stepdown first.

So maybe we need safe scaling down steps:

docker exec -it node-to-kill mongo --quiet --eval 'rs.stepDown()'

"network error" or "not primary so can't step down" is good output

2016-08-22T23:40:27.475+0000 E QUERY    [thread1] Error: error doing query: failed: network error while attempting to run command 'replSetStepDown' on host '127.0.0.1:27017'  :
DB.prototype.runCommand@src/mongo/shell/db.js:135:1
DB.prototype.adminCommand@src/mongo/shell/db.js:153:16
rs.stepDown@src/mongo/shell/utils.js:1182:12
@(shell eval):1:1

# or
{ "ok" : 0, "errmsg" : "not primary so can't step down", "code" : 10107 }

docker kill -s TERM node-to-kill
docker rm or whatever you want now

jonatanblue · 2016-09-04T10:14:57Z

This is awesome. Thank you for working on this!

At the MongoDB build step when running docker-compose -f local-compose.yml up, pip install is failing when building cryptography:

build/temp.linux-x86_64-2.7/_openssl.c:429:30: fatal error: openssl/opensslv.h: No such file or directory
 #include <openssl/opensslv.h>
                              ^
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Adding libssl-dev to the packages installed by apt-get resolves it.

add libssl-dev to fix pip cryptography build issue

yosifkit · 2016-09-13T20:58:22Z

@misterbisson and @tgross, this looks fine to me. Let me know if there is anything else you would like me to add.

tianon · 2016-09-26T21:32:06Z

@tgross @misterbisson anything else you'd like @yosifkit to update here? 😄 😇

jasonpincin · 2016-10-26T20:17:42Z

The use of socket.hostname() and relying on that same hostname being embedded in the Consul service name is creating problems when scaling to more than one replica set member on Triton. For example:

hostname (as reported by socket.hostname): cae35330f7c7
consul service name: mongodb-replicaset-cae35330f7c7

The manage.py script's consul_to_mongo_hostname function will convert this to member cae35330f7c7:27017 which will resolve locally within the docker container, but will not resolve anywhere else.

If the Triton account has CNS enabled, there will be a DNS records similar to: 31cfb857-c7a6-4c46-9d73-a8c26c3076ee.inst.<account id>.us-east-1.{cns.joyent.com,triton.zone}

This obviously cannot be inferred by the shortened version within the zone. Is there a suitable way to determine at this FQDN from the current Mongo master prior to adding it to the replica set?

The alternative is to use the host's IP Address, but Mongo specifically discourages this for many reasons. In the event the IP changes the replica set will not recover optimally.

Thoughts?

Use IP addresses instead of hostnames

yosifkit added 4 commits May 27, 2016 12:15

Initial work in progress

3ec7d2e

mongod flags for replica set, more python scaffolding from mysql

48e94cc

Remove pprint

be640f6

tgross reviewed Jul 8, 2016
View reviewed changes

yosifkit added 2 commits July 14, 2016 16:41

Add comments and changes based upon feedback from tgross

3eb8a4d

- comments for `mongo_update_replset_config` - update readme with some usage

yosifkit added 2 commits July 20, 2016 16:54

Fix containerpilot service name to match backend name so that we get …

b2333cb

…onChange events

Remove superfluous env in compose yml

cd3f401

tgross reviewed Jul 21, 2016
View reviewed changes

Improve mongo_update_replset_config to be more pythonic

bcf96af

- add a pre_stop method to containerpilot

yosifkit reviewed Jul 25, 2016
View reviewed changes

Add envs for stepping down and stopping

c2e14b3

yosifkit added 2 commits August 15, 2016 15:40

Change sleep to time.sleep :o

35de507

When a primary is sent TERM, wait for election of new primary

597bfd4

yosifkit added 2 commits August 22, 2016 16:07

Remove unused ContainerPilot class

ac07083

Add some env documentation; drop STOP_TIMEOUT since it doesn't affect…

1d5b83a

… preStop hooks

misterbisson mentioned this pull request Aug 29, 2016

Add MongoDB example autopilotpattern/workshop#20

Open

add libssl-dev to fix pip cryptography build issue

03e9b23

jonatanblue mentioned this pull request Sep 11, 2016

add libssl-dev to fix pip cryptography build issue #2

Merged

Merge pull request #2 from jonatanblue/wip

918de6c

add libssl-dev to fix pip cryptography build issue

jasonpincin added 2 commits November 3, 2016 18:20

Use IP addresses instead of hostnames, for portability

011223e

Merge pull request #4 from jasonpincin/use-ip-addresses

5e052fb

Use IP addresses instead of hostnames

jasonpincin merged commit a862b1d into master Nov 10, 2016

yosifkit deleted the wip branch November 11, 2016 17:54


		# ---------------------------------------------------------

		#class ContainerPilot(object):

[WIP] #1

[WIP] #1

Uh oh!

Conversation

tgross commented Jul 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yosifkit commented Jul 20, 2016

Uh oh!

tgross Jul 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yosifkit commented Jul 23, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgross Jul 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yosifkit commented Aug 2, 2016

Uh oh!

misterbisson commented Aug 11, 2016

Uh oh!

yosifkit commented Aug 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianon commented Aug 15, 2016

Uh oh!

misterbisson commented Aug 15, 2016

Uh oh!

tgross commented Aug 16, 2016

Uh oh!

yosifkit commented Aug 22, 2016

Uh oh!

yosifkit commented Aug 22, 2016

Uh oh!

jonatanblue commented Sep 4, 2016

Uh oh!

yosifkit commented Sep 13, 2016

Uh oh!

tianon commented Sep 26, 2016

Uh oh!

jasonpincin commented Oct 26, 2016

Uh oh!

Uh oh!

tgross Jul 21, 2016 •

edited

Loading

tgross Jul 26, 2016 •

edited

Loading

yosifkit commented Aug 15, 2016 •

edited

Loading