-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non existing stolon-proxy members in kubernetes #438
Comments
@vbasem This looks similar to #397 but there it happened when consul was under load and misbehaving. This shouldn't really happen since the proxy key is set with a ttl and should expire. My impression is that the related proxy key in consul isn't expiring for some reasons. Can you check this? I tried but wan't able to reproduce this problem. BTW in current stolon master we added a different way to handle stale proxies so we don't rely on the store key ttl. This change will help for future work and, as a side effect, can workaround your reported behavior. You could give it a try. |
@sgotti thanks alot for the quick reply. As soon as I get the error again, I will make sure to get the TTL out. In the meanwhile I am will work on the update to 0.9.0. |
@vbasem please let us know if this happens again because I'd like to understand if this is a consul server bug or a consul client or a libkv bug (wasn't able to reproduce it here). Having a new proxy and sentinel uid at every start and not being able to set them is by design since it avoids having duplicated uids that will cause many problems and distinguish when a new proxy starts and stop. |
@sgotti I have had this setup running on prod for a few months now and here is my feedback:
Sadly the issue and the symptoms are not deterministic. But once thing I am sure of , is that at its core consul seems to be the culprit. Recovery from multiple node failures are not as clean as it should. |
@vbasem regarding your last comment, probably the issue isn't the best place. Can you start a discussion on the stolon mailing list? When consul is down the behavior should be well defined and some of your points shouldn't really happen with just consul down (like |
@sgotti I apologize it took me this long to reply. My setup ion kubernetes is as follows> Both consul and keepers work with volumeClaimTemplates working on cinder. My setup is running on 9 node kubernetes 1.7 cluster. The PG itself is 9.6 As I mentioned above, the cluster init is run as a kubernetes job and once it is done, we never touch it again. So expected behavior is that when consul is having trouble electing new leader, the keepers shouldn't touch the data at all. I will move the discussion to gitter and the next time I will keep some of the logs for further insights. |
Submission type
Environment
Server Version: version.Info{Major:"1", Minor:"7+", GitVersion:"v1.7.11", GitCommit:"b13f2fd682d56eab7a6a2b5a1cab1a3d2c8bdd55", GitTreeState:"dirty", BuildDate:"2017-12-06T15:23:13Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
consul backend
Stolon version
Additional environment information if useful to understand the bug
Expected behavior you didn't see
When stolon-proxy crashes and the new one gets a fresh UID, the old proxy should be cleaned up and no longer be part of the cluster.
Unexpected behavior you saw
Occasionally when stolon-proxy crashes, the old proxy UID is not removed and I end up with multiple stolon proxies that do not exist any longer.
Stolon sentinel usually ends up outputting:
waiting for proxies to close connections to old master
I end up manually deleting the non existing proxy from consul, only then the traffic flows again.
Steps to reproduce the problem
Enhancement Description
The text was updated successfully, but these errors were encountered: