non existing stolon-proxy members in kubernetes #438

vbasem · 2018-02-12T13:17:12Z

Submission type

Bug report
Request for enhancement (RFE)

Environment

Server Version: version.Info{Major:"1", Minor:"7+", GitVersion:"v1.7.11", GitCommit:"b13f2fd682d56eab7a6a2b5a1cab1a3d2c8bdd55", GitTreeState:"dirty", BuildDate:"2017-12-06T15:23:13Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
consul backend

Stolon version

0.8.0

Additional environment information if useful to understand the bug

Expected behavior you didn't see

When stolon-proxy crashes and the new one gets a fresh UID, the old proxy should be cleaned up and no longer be part of the cluster.

Unexpected behavior you saw

Occasionally when stolon-proxy crashes, the old proxy UID is not removed and I end up with multiple stolon proxies that do not exist any longer.
Stolon sentinel usually ends up outputting:
waiting for proxies to close connections to old master

I end up manually deleting the non existing proxy from consul, only then the traffic flows again.

Steps to reproduce the problem

sadly its very random and happens occasionally when crashes happen.

Enhancement Description

allow defining UIDs for stolon-proxy
stolonctl option to update proxy members and clean up -> can be used perhaps in kubernetes lifecycle

The text was updated successfully, but these errors were encountered:

sgotti · 2018-02-12T14:04:24Z

@vbasem This looks similar to #397 but there it happened when consul was under load and misbehaving.

This shouldn't really happen since the proxy key is set with a ttl and should expire. My impression is that the related proxy key in consul isn't expiring for some reasons. Can you check this? I tried but wan't able to reproduce this problem.

BTW in current stolon master we added a different way to handle stale proxies so we don't rely on the store key ttl. This change will help for future work and, as a side effect, can workaround your reported behavior. You could give it a try.

vbasem · 2018-02-12T15:09:27Z

@sgotti thanks alot for the quick reply. As soon as I get the error again, I will make sure to get the TTL out. In the meanwhile I am will work on the update to 0.9.0.
Currently my work around was to add a shutdown command in my kubernetes that removes the proxy key from consul.

sgotti · 2018-02-14T20:58:51Z

@vbasem please let us know if this happens again because I'd like to understand if this is a consul server bug or a consul client or a libkv bug (wasn't able to reproduce it here).

Having a new proxy and sentinel uid at every start and not being able to set them is by design since it avoids having duplicated uids that will cause many problems and distinguish when a new proxy starts and stop.

vbasem · 2018-02-16T07:09:21Z

@sgotti I have had this setup running on prod for a few months now and here is my feedback:

consul on kubernetes with non dynamic IPs is a disaster.
once consul fails due to majority down, stolon goes nuts
if during consul down period sentinel or keeper get restarts weird stuff start to happen
we have had issues ranging from:
- total loss of data in the keepers (the postgres directory completely gone)
- keepers couldn't decide who is master
- proxy seems to have a master even tho there isnt any
- stolon sentinel aborts with an exception > this shouldnt happen !!

Sadly the issue and the symptoms are not deterministic. But once thing I am sure of , is that at its core consul seems to be the culprit. Recovery from multiple node failures are not as clean as it should.
Considering redoing the setup with ETCD but the Investment is not small

sgotti · 2018-02-16T17:54:07Z

@vbasem regarding your last comment, probably the issue isn't the best place. Can you start a discussion on the stolon mailing list? When consul is down the behavior should be well defined and some of your points shouldn't really happen with just consul down (like total loss of data in the keepers (the postgres directory completely gone)). But I'll need the logs of when this happened and understand how you are deploying stolon (how are you keepers pod persistent volumes configured?).

vbasem · 2018-02-18T20:12:07Z

@sgotti I apologize it took me this long to reply. My setup ion kubernetes is as follows>
1- 3x statefulset consul
2- 3x statefulset stolon-keeper
3- 2x deployment stolon-sentinel
4- 2x deployment stolon-proxy
5- 1x JOB to init the cluster

Both consul and keepers work with volumeClaimTemplates working on cinder. My setup is running on 9 node kubernetes 1.7 cluster. The PG itself is 9.6

As I mentioned above, the cluster init is run as a kubernetes job and once it is done, we never touch it again. So expected behavior is that when consul is having trouble electing new leader, the keepers shouldn't touch the data at all.

I will move the discussion to gitter and the next time I will keep some of the logs for further insights.

vbasem closed this as completed Feb 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non existing stolon-proxy members in kubernetes #438

non existing stolon-proxy members in kubernetes #438

vbasem commented Feb 12, 2018

sgotti commented Feb 12, 2018

vbasem commented Feb 12, 2018

sgotti commented Feb 14, 2018

vbasem commented Feb 16, 2018

sgotti commented Feb 16, 2018

vbasem commented Feb 18, 2018

non existing stolon-proxy members in kubernetes #438

non existing stolon-proxy members in kubernetes #438

Comments

vbasem commented Feb 12, 2018

Submission type

Environment

Stolon version

Additional environment information if useful to understand the bug

Expected behavior you didn't see

Unexpected behavior you saw

Steps to reproduce the problem

Enhancement Description

sgotti commented Feb 12, 2018

vbasem commented Feb 12, 2018

sgotti commented Feb 14, 2018

vbasem commented Feb 16, 2018

sgotti commented Feb 16, 2018

vbasem commented Feb 18, 2018