Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non existing stolon-proxy members in kubernetes #438

Closed
vbasem opened this issue Feb 12, 2018 · 6 comments
Closed

non existing stolon-proxy members in kubernetes #438

vbasem opened this issue Feb 12, 2018 · 6 comments

Comments

@vbasem
Copy link

vbasem commented Feb 12, 2018

Submission type

  • Bug report
  • Request for enhancement (RFE)

Environment

  • Server Version: version.Info{Major:"1", Minor:"7+", GitVersion:"v1.7.11", GitCommit:"b13f2fd682d56eab7a6a2b5a1cab1a3d2c8bdd55", GitTreeState:"dirty", BuildDate:"2017-12-06T15:23:13Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

  • consul backend

Stolon version

  • 0.8.0

Additional environment information if useful to understand the bug

Expected behavior you didn't see

When stolon-proxy crashes and the new one gets a fresh UID, the old proxy should be cleaned up and no longer be part of the cluster.

Unexpected behavior you saw

Occasionally when stolon-proxy crashes, the old proxy UID is not removed and I end up with multiple stolon proxies that do not exist any longer.
Stolon sentinel usually ends up outputting:
waiting for proxies to close connections to old master

I end up manually deleting the non existing proxy from consul, only then the traffic flows again.

Steps to reproduce the problem

  • sadly its very random and happens occasionally when crashes happen.

Enhancement Description

  • allow defining UIDs for stolon-proxy
  • stolonctl option to update proxy members and clean up -> can be used perhaps in kubernetes lifecycle
@sgotti
Copy link
Member

sgotti commented Feb 12, 2018

@vbasem This looks similar to #397 but there it happened when consul was under load and misbehaving.

This shouldn't really happen since the proxy key is set with a ttl and should expire. My impression is that the related proxy key in consul isn't expiring for some reasons. Can you check this? I tried but wan't able to reproduce this problem.

BTW in current stolon master we added a different way to handle stale proxies so we don't rely on the store key ttl. This change will help for future work and, as a side effect, can workaround your reported behavior. You could give it a try.

@vbasem
Copy link
Author

vbasem commented Feb 12, 2018

@sgotti thanks alot for the quick reply. As soon as I get the error again, I will make sure to get the TTL out. In the meanwhile I am will work on the update to 0.9.0.
Currently my work around was to add a shutdown command in my kubernetes that removes the proxy key from consul.

@sgotti
Copy link
Member

sgotti commented Feb 14, 2018

@vbasem please let us know if this happens again because I'd like to understand if this is a consul server bug or a consul client or a libkv bug (wasn't able to reproduce it here).

Having a new proxy and sentinel uid at every start and not being able to set them is by design since it avoids having duplicated uids that will cause many problems and distinguish when a new proxy starts and stop.

@vbasem
Copy link
Author

vbasem commented Feb 16, 2018

@sgotti I have had this setup running on prod for a few months now and here is my feedback:

  • consul on kubernetes with non dynamic IPs is a disaster.
  • once consul fails due to majority down, stolon goes nuts
  • if during consul down period sentinel or keeper get restarts weird stuff start to happen
  • we have had issues ranging from:
    • total loss of data in the keepers (the postgres directory completely gone)
    • keepers couldn't decide who is master
    • proxy seems to have a master even tho there isnt any
    • stolon sentinel aborts with an exception > this shouldnt happen !!

Sadly the issue and the symptoms are not deterministic. But once thing I am sure of , is that at its core consul seems to be the culprit. Recovery from multiple node failures are not as clean as it should.
Considering redoing the setup with ETCD but the Investment is not small

@sgotti
Copy link
Member

sgotti commented Feb 16, 2018

@vbasem regarding your last comment, probably the issue isn't the best place. Can you start a discussion on the stolon mailing list? When consul is down the behavior should be well defined and some of your points shouldn't really happen with just consul down (like total loss of data in the keepers (the postgres directory completely gone)). But I'll need the logs of when this happened and understand how you are deploying stolon (how are you keepers pod persistent volumes configured?).

@vbasem
Copy link
Author

vbasem commented Feb 18, 2018

@sgotti I apologize it took me this long to reply. My setup ion kubernetes is as follows>
1- 3x statefulset consul
2- 3x statefulset stolon-keeper
3- 2x deployment stolon-sentinel
4- 2x deployment stolon-proxy
5- 1x JOB to init the cluster

Both consul and keepers work with volumeClaimTemplates working on cinder. My setup is running on 9 node kubernetes 1.7 cluster. The PG itself is 9.6

As I mentioned above, the cluster init is run as a kubernetes job and once it is done, we never touch it again. So expected behavior is that when consul is having trouble electing new leader, the keepers shouldn't touch the data at all.

I will move the discussion to gitter and the next time I will keep some of the logs for further insights.

@vbasem vbasem closed this as completed Feb 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants