Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] #1

Merged
merged 18 commits into from
Nov 10, 2016
Merged

[WIP] #1

merged 18 commits into from
Nov 10, 2016

Conversation

tgross
Copy link

@tgross tgross commented Jul 8, 2016

@yosifkit @misterbisson I'm opening this PR with the work in the wip branch so that we have a place for review and comment.

 - the master node in mongo will grab the primary lock in consul
  - or first when the replica is not yet started
 - the master node is the only node that updates the replica config
 - the replica set config is automatically updated from consul's list of mongo nodes
CMD [ \
"containerpilot", \
"mongod", \
"--replSet=joyent" \
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should show this CMD (specifically the --replSet) being overridden in the docker-compose.yml file so that it's clear to someone reading through this that you can create multiple replSets by having multiple service blocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (push coming shortly), also updating to have a minimal README.md on how to run.

- comments for `mongo_update_replset_config`
- update readme with some usage
- expose consul port 8500 in local development
- only one ContainerPilot/Consul "state" for mongodb to prevent ContainerPilot reloads
- use `replSetGetStatus` instead of `is_primary`
- use consul key with non-expiring token to determine "initdb"
- add comment to code about scaling above 7 nodes
@yosifkit
Copy link
Contributor

Two most recent commits to address the following:

  • comments for mongo_update_replset_config
  • update readme with some usage
  • expose consul port 8500 in local development
  • only one ContainerPilot/Consul "state" for mongodb to prevent ContainerPilot reloads
  • use replSetGetStatus instead of is_primary
  • use consul key with non-expiring token to determine "initdb"
  • add comment to code about scaling above 7 nodes


# ---------------------------------------------------------

#class ContainerPilot(object):
Copy link
Author

@tgross tgross Jul 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not using this we should just cut it rather than commenting-out.

Edit: nevermind, I see you've got a TODO about this below.

 - add a pre_stop method to containerpilot
@yosifkit
Copy link
Contributor

Improved mongo_update_replset_config; added a preStop function so that a primary can step down.

log.debug(e)
try:
# force
local_mongo.admin.command('replSetStepDown', 60, force=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some more thought, we should drop this force=True part. I think it will be safer if the user explicitly kills the container or does a force stepdown themselves.

ie, if the container gets a SIGTERM and can't stepdown from being primary, it should fail to stop. It fails if there is no secondary or if the secondaries were too far behind and still didn't catch up in the 8 second catch-up period.

Does that sound right to you?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be safer if the user explicitly kills the container

If a container is stopped via docker stop it receives SIGTERM and then after the timeout it'll get SIGKILL anyways. Will we be in a sane state if that happens?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current forced stepdown or SIGKILL of the primary node will leave us in about the same state (after an attempted stepdown); a new master would be elected (if there are enough nodes left for a vote) and there could likely be a rollback. So it seems better to let the signals tell us to "clean up" for shutdown (SIGTERM) and if we can't step down from primary, then let the user do the kill on purpose. Maybe we add documentation around using docker kill -s SIGTERM rather than docker stop? Or maybe recommend higher timeouts to docker stop, and containerpliot? Can we to adjust the stopTimeout in containerpilot from the python so that it can be set via env?

Copy link
Author

@tgross tgross Jul 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe recommend higher timeouts to docker stop, and containerpliot?

Keep in mind that lots of people will be using some kind of higher-level scheduler so telling them not to use docker stop might not be relevant. Setting the timeouts is the right way to go.

Can we to adjust the stopTimeout in containerpilot from the python so that it can be set via env?

You can interpolate it in the ContainerPilot config from the env. See https://www.joyent.com/containerpilot/docs/configuration#template-configuration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started making these timeouts configurable from the environment, but I am unable to set stopTimeout in the containerpilot json since the go module expects an int and not a string. Is there another way around it? Maybe use the ContainerPilot class that we still have and write out the updated config using the environment variables?

// works
"stopTimeout": 9,
// fails
"stopTimeout": "{{.STOP_TIMEOUT}}",

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work just fine, right? "stopTimeout": {{.STOP_TIMEOUT}},

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had tried it and gotten parse errors, but after another rebuild it works fine now. It seems I did not save the Dockerfile with the ENV set. 😮 😞

@yosifkit
Copy link
Contributor

yosifkit commented Aug 2, 2016

So I have a case that I think we need to handle:

  1. have a replica set of 3 or more nodes
  2. docker kill -s TERM current-primary
  3. primary does stepdown and the other nodes begin election
  4. primary removes itself from consul and exits
  5. container pilot delivers onChange
    1. there is not yet a primary node so it is ignored
  6. primary is elected
  7. mongo replica config still contains stopped node
    1. this might affect a vote later (since at least 50% are needed to vote for a primary)

So the question is, what should we do about the mongo config being out of sync with what is in consul?

The worst case is if someone is scaling down their cluster from say 5 to 3 and one of the containers stopped is the primary with the other being stopped while there is still no elected primary, then the end result is a 3 node cluster that has 5 nodes of config. If another node goes down it will fail to have a primary anymore. If a new node is added or if a node is removed while there is an active primary, then the replica config will become consistent with the config in consul.

One solution is to just add the mongo_update_replset_config to the health check. A second solution is to add a periodic task that does the mongo_update_replset_config on the primary node just like onChange, but with a longer period.

@misterbisson
Copy link
Contributor

what should we do about the mongo config being out of sync with what is in consul? The worst case is if someone is scaling down [...] and one of the containers stopped is the primary...

Good question. Can we add a preStop behavior that blocks if the instance is the primary while it demotes the primary and waits for a new primary to be elected? If we set the stop timeout in Docker to a relatively long value (ten minutes, an hour?), we can give the cluster plenty of time to reconcile itself in this condition.

It's also maybe fair to ticket that as a bug and return to it. This is shaping up nicely and looks close (or nearly close) enough to MVP to be ready for merging.

@yosifkit
Copy link
Contributor

yosifkit commented Aug 15, 2016

@misterbisson, I implemented a "wait for election" after the primary node steps down (all within the pre_stop function, which I will push up shortly). But the problem is that the other mongodb containers get the on_change only ~200 milliseconds after the primary pre_stop is called and has not finished, so the election also is still unfinished. From my perspective, it looks like Container Pilot calls the pre_stop asynchronously and then immediately deregisters from consul. Am I doing something wrong or unexpected with pre_stop?

On a related note, I set stopTimeout to 5 for Container Pilot and I am not sure that it changed anything; my single mongodb instance still took almost a minute for just pre_stop. Is the stopTimeout supposed to limit how long pre_stop takes?

@tianon
Copy link
Contributor

tianon commented Aug 15, 2016

The conversation in and around TritonDataCenter/containerpilot#200 seems to imply that deregistration prior to invoking preStop is intentional, and unlikely to change. Given that the service is still running at the moment of preStop, this seems kind of strange -- if the service is still running, shouldn't it still be registered? Could this behavior perhaps become configurable so that preStop can do useful things like ensuring that a new "primary" is elected before all the other cluster nodes get their onChange triggered by the deregistration of the current "primary"?

Additionally, given that SIGTERM is a signal, and by itself isn't actually terminal, wouldn't it also make sense for preStop to be able to exit non-zero and cancel the termination of the service? This would mean that users who are simply using docker stop blindly would be potentially killing live services, but only in the rare edge case that the new primary election resulted in this node itself being re-elected (which is the main error case it'd be prudent to be able to cancel for here). Thus, users concerned strongly with the consistent, safe state of their cluster would use docker kill -s TERM instead of docker stop for giving their containers the chance to stop gracefully, but still allowing for the possibility that such a thing isn't currently possible and might require manual intervention of some kind (without Container Pilot force-killing their primary node simply because they asked it to shut down gracefully).

@misterbisson
Copy link
Contributor

I set stopTimeout to 5 for Container Pilot and I am not sure that it changed anything; my single mongodb instance still took almost a minute for just pre_stop.

stopTimeout is supposed to start after preStop completes and ContainerPilot attempts to SIGTERM the main process. That's implemented in ContainerPilot's core/app.go.

Your description (preStop doesn't timeout) doesn't sound like a bug, but it probably needs clarification in the docs. Also to clarify: the total of preStop + main process stop + postStop must all complete before the daemon sends the SIGKILL. docker stop -t <seconds> is required for anything that takes time to stop here.

Could [preStop service deregistration] behavior perhaps become configurable so that preStop can do useful things like ensuring that a new "primary" is elected before all the other cluster nodes get their onChange triggered by the deregistration of the current "primary"?

This is a good use case to consider and it deserves more thought/discussion.

wouldn't it also make sense for preStop to be able to exit non-zero and cancel the termination of the service [in the situation where the user calls docker stop naively]?

I'm putting words in your mouth with "naively" there, but is it a fair distillation of the concern? How does the application enforce the refusal to shut down back to the scheduler/daemon/infrastructure? Is there any way to tell docker stop to allow a container to talk back like that?

@tgross
Copy link
Author

tgross commented Aug 16, 2016

if the service is still running, shouldn't it still be registered?

In stateful applications in particular this means client applications will still be sending writes until their TTL expires. The preStop is your opportunity to wait until writes stop coming so that you don't have a cluster in an inconsistent state before you do an election. It does mean that the client applications need to know what to do if they have no place to send writes temporarily.

Could this behavior perhaps become configurable

I'm certain open to changing the behavior if we think we've got the wrong one but another configuration option at this point has to really sell itself.

Should the "how should ContainerPilot behave?" (vs the "how does ContainerPilot behave?") portion of this discussion get moved over to a ContainerPilot issue?

@yosifkit
Copy link
Contributor

It seems fair to bring the discussion about consul registration elsewhere, though what it means for this image is that if you docker kill -s TERM your primary node, the mongodb replica config will be in an inconsistent state until there is another event that would generate an on_change on the primary node.

The way to work around this is to first docker exec -it primary-node mongo --eval 'rs.stepDown()' before shutting down the primary node.

As for what is left here, there are two major parts I can think of:

  • documentation for all of the possible ENVs (and maybe stepdown caveats)
  • backups to manta

I'll see if I can get the docs up this afternoon and then we can stick the backups into a new issue (especially since the recommended backup strategies are MongoDB Inc's "Cloud Manager", "Ops Manager", or file system snapshots).

@yosifkit
Copy link
Contributor

Pushed my docs changes (might want a technical writer to make it better 😉). Let me know if you want anything else in this PR.

One more thing to note: do not scale down to less than half of your voting members or you will lose quorum and the nodes will fail to have a primary node (assumed netsplit). In other words, if you have 7 and want to only have 3, you must first scale to 5 and make sure the replica config is updated to just 5 nodes then you can scale to 3. It is probably best to scale down one at a time and if the current primary is the node you want to destroy, you should do the stepdown first.

So maybe we need safe scaling down steps:

  • docker exec -it node-to-kill mongo --quiet --eval 'rs.stepDown()'
    • "network error" or "not primary so can't step down" is good output

      2016-08-22T23:40:27.475+0000 E QUERY    [thread1] Error: error doing query: failed: network error while attempting to run command 'replSetStepDown' on host '127.0.0.1:27017'  :
      DB.prototype.runCommand@src/mongo/shell/db.js:135:1
      DB.prototype.adminCommand@src/mongo/shell/db.js:153:16
      rs.stepDown@src/mongo/shell/utils.js:1182:12
      @(shell eval):1:1
      
      # or
      { "ok" : 0, "errmsg" : "not primary so can't step down", "code" : 10107 }
  • docker kill -s TERM node-to-kill
  • docker rm or whatever you want now

@jonatanblue
Copy link
Contributor

This is awesome. Thank you for working on this!

At the MongoDB build step when running docker-compose -f local-compose.yml up, pip install is failing when building cryptography:

build/temp.linux-x86_64-2.7/_openssl.c:429:30: fatal error: openssl/opensslv.h: No such file or directory
 #include <openssl/opensslv.h>
                              ^
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Adding libssl-dev to the packages installed by apt-get resolves it.

add libssl-dev to fix pip cryptography build issue
@yosifkit
Copy link
Contributor

@misterbisson and @tgross, this looks fine to me. Let me know if there is anything else you would like me to add.

@tianon
Copy link
Contributor

tianon commented Sep 26, 2016

@tgross @misterbisson anything else you'd like @yosifkit to update here? 😄 😇

@jasonpincin
Copy link
Contributor

The use of socket.hostname() and relying on that same hostname being embedded in the Consul service name is creating problems when scaling to more than one replica set member on Triton. For example:

hostname (as reported by socket.hostname): cae35330f7c7
consul service name: mongodb-replicaset-cae35330f7c7

The manage.py script's consul_to_mongo_hostname function will convert this to member cae35330f7c7:27017 which will resolve locally within the docker container, but will not resolve anywhere else.

If the Triton account has CNS enabled, there will be a DNS records similar to: 31cfb857-c7a6-4c46-9d73-a8c26c3076ee.inst.<account id>.us-east-1.{cns.joyent.com,triton.zone}

This obviously cannot be inferred by the shortened version within the zone. Is there a suitable way to determine at this FQDN from the current Mongo master prior to adding it to the replica set?

The alternative is to use the host's IP Address, but Mongo specifically discourages this for many reasons. In the event the IP changes the replica set will not recover optimally.

Thoughts?

@jasonpincin jasonpincin merged commit a862b1d into master Nov 10, 2016
@yosifkit yosifkit deleted the wip branch November 11, 2016 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants