-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consul doesn't expire proxy keys #397
Comments
Here's
|
After doing
so the
|
It also seems that it would be helpful if |
@nh2 I'm not sure what's the real issue here. This is a common state during a new election and should converge once the new elected master is ready. Is this not happening? I can see different errors talking to consul: Anyway It'll be useful if you can also provide a longer leader sentinel and keepers log since the systemctl one id only few lines. P.S. I'm on vacation so I won't be able to give fast answers for the next days.
Please file a new issue. |
At that point we have:
And the keeper of node-2 is still follwing the
The final
The final consul contents are:
I have now restarted all services on all machines and and am still stuck. |
@sgotti This leader loss was ~30 minutes before the other log outputs though, since then the sentinel has made no further output. |
OK! |
@sgotti OK, here they are: node-1-keeper-log.txt node-1-sentinel-log.txt There were some consul failures before, but with those recovered I we should still expect stolon to converge, is that right? |
With restarts not helping, I'm now trying to manually fix the clusterdata to get rid of the 2-masters situation. I first tried this:
This apparently brings the
|
Tried now with:
But node-3 logs still say
even though the clusterdata says that there's only 1 master now so I'm not sure why it says |
@nh2 changing cluster data is a bad thing (also because it wasn't wrong). It'll better if you reinitialize the cluster with initmode existing. I think that the issue is the sentinel waiting for proxies but the proxy key has a TTL and looks like it isn't expiring (consul issue?). You could try removing the proxy keys by hand. I'll take a look at the logs in the next days. |
I've wiped the clusterdata now and recreated the DB contents from a backup I took just before that (luckily my DB in this case is small enoug that I can restore it very quickly). |
@nh2 I took a look a the logs. The main events are these:
here node1 sentinel (leader sentinel) didn't receive updates from node1 keeper and elected node3 keeper as the new master keeper. Then it started waiting for proxies to close connection. This continued also after you restarted all the sentinel:
the new sentinel leader on node3 continued waiting for proxies to close connections. main problemHere I think is the main issue: the proxies should read from the store and update their state, this state key has a ttl so if not updated it should expire after 30 seconds. The sentinel is waiting for the updated state or the key to disappear before continuing to the next state (to be sure that no clients are connecting to the old master). I don't have the proxies logs but looks like for some reason they weren't updating their state but the key continued to be alive (do you have the proxies logs yet?). Since we rely on the store to respect the ttl looks like something wasn't working correctly inside consul. I know that in etcd if there's a leader change it will renew its leases (in etcd v3 api), perhaps consul was behaving in a similar way continuosly electing a new leader (due to the consul heavy load) and so always renewing keys ttl without letting them expire? workaroundAssuming that there's no bug in the proxies state update the problem could lie in you current heavy load + consul behavior (keys not expiring) A workaround could have been to manually delete the proxy keys from consul and see if the sentinel went ahead. current cluster dataAbout your cluster data and the issue title
This is symptomatic of something strange, if you are sure that the clusterdata in consul says that node3 should be a standby than a possible explanation is that consul agent on the node is returning stale data also if we are asking for not stale and consistent reads: https://github.com/docker/libkv/blob/master/store/consul/consul.go#L172-L175 |
@sgotti Thanks for your detailed analysis. consulIt seems likely to me that the issue is as you describe, with Consul not expiring the TTL. I've found a couple of issues with Consul not expiring TTLs as you'd expect in the past, such as hashicorp/consul#3569. The nonexpiring-TTL-due-to-Consul-leader-switch under high load also seems likely to me. As mentioned in Consul's docs on sessions:
And as mentioned in the Consul PR that implemented TTLs:
logs
node-1-proxy-log.txt.gz Please let me know if the time ranges for these logs are good, otherwise I can extend them. two masters
Ah, I see. You correctly identified that I had the assumption that stolon would only ever indicate one node as master at any given time. Your explanation makes sense to me. Can you explain a bit more though: Even though there can be times where 2 nodes are indicated as postgres replication masters, Stolon will still ensure that proxies allow only one of them to be connected to by clients to ensure consistency, is that right? In that case, where in the clusterdata do I have to look to see which master that is? clusterdata
What do you mean by "if you are sure"? I pasted some of the consul kv contents that should show what exactly the clusterdata is -- or do you mean clusterdata from a time that I have not posted? Regarding stale data, I have run
As we can see, not all invocations run in As per consul docs, those that don't use
I don't buy the |
From the logs looks like they read that they shouldn't proxy to any node and should have updated their status. The problems is that looks like one of these has a state in consul behind one generation (293 instead of 294) and it's probably the one the sentinel is waiting for:
But from the proxies logs looks like they're correctly updating their state (no errors). So, or they are reading an old proxy generation from cluster data (stale data) or it's coming from a fourth proxy (unfortunately the logs don't report the proxy uid that is generated at every process start). Another strange thing is that after you restarted all the proxies it's already there:
looks like consul isn't expiring it (or it's a fourth proxy).
Right.
In the cluster object status:
I mean if you are sure that the modified cluster data was correctly written to consul and not lost for whatever reason.
Good catch, that's definitely a bug in libkv where the List method doesn't set |
This I remember from the time: Even though I used only 3 machines with The 4th proxy already appears in my consul dump from #397 (comment) but not in the one further up in the issue description. Should the 4th proxy not have appeared? What method does stolon use to identify proxies across restarts? Maybe this, too, can be attributed to consul keeping some keys for very long. |
The 4th proxy was only a guess but if you said that you don't have other proxies and it disappeared later than it was probably a key that took more time to expire. Though I don't understand why it was reporting a proxy generation of 293 instead of 294.
Every time a proxy process starts it generates a unique id so at every start it's like a new proxy. Keys of old proxies in the store should then expire after the ttl (we could also improve the proxy to remove the keys at shutdown). |
@sgotti Right now I had the same problem again (interestingly roughly 1 month after last time). This time I found the issue much faster. Like last time As a result, Restarting consul on all nodes immediately fixes the problem. I'm thinking it would be very useful if This should make it much easier to debug these issues, and making this case obvious when it's not Stolon's fault at all that things aren't working (but with the What do you think about this idea? |
I just got into this issue again with consul hanging. This time I also learned that Judging from that, it might be this consul issue: hashicorp/consul#3700 |
I tried to obtain debug goroutine stacktraces from consul as done in hashicorp/consul#3700, but I think I didn't manage to do so in a comparable way because
|
After restarting consul and stolon this time, the After restart of consul, with the consul DB writable again, this was the output:
Both At that point, After
So here we have the situation again where likely consul is not expiring keys, thus we have 4 entries for sentinels and proxies even though my cluster has only 3 nodes, and no node can run more than one sentinel/proxy/keeper process as those are managed by systemd. |
Even after a reboot of all machines involved, stolon did not recover from only |
@nh2 If the sentinel is waiting for proxies to close connection forever then it's consul not expiring the key of an old proxy. You should gather some debug info (if any) and the just remove the key to make your cluster work again. BTW in current master #427 changed to logic to not rely on store ttl. |
@sgotti I'm trying to fix the cluster this way this time. I'm using
and it did something, but nodes 1 and 3 did not rejoin the cluster:
After this, nodes 1 and 3 have
Did I use |
Why haven't you just removed the proxy key unexpired by consul? I need the sentinel logs to see what's happening. PS gitter is probably better for support like this... |
OK, with support help of @sgotti on Gitter I have this time recovered from the issue without having to wipe the DB: After running the
The stale proxy key is the last entry; I also confirmed this by checking the UIDs against With one
the system recoverd to
|
@sgotti That sounds fair to me. Do you think it would make sense to adjust some of the log messages though to make them more explicit? E.g. to expand or replace |
@nh2 I'm open to any improvement, please open new issues or PRs for every proposed change. Closing. |
Submission type
Environment
NixOS
Stolon version
0.7.0
Additional environment information if useful to understand the bug
My stolon cluster with synchronous replication went into a state where
node-1
is masternode-1
says it is masternode-3
says it is masternode-2
says it is standby followingff0af56a
, which according to the clusterdata is on node-3.../keepers/info/node_1
and.../keepers/info/node_2
inconsul kv
but no entry for node-3Consul state
Here is the entire
consul kv get -recurse
:node-1
node-2
node-3
Observations
On
node-3
, this seems like a problem:Also, there is a
Unexpected response code: 500 (No cluster leader)
in the sentinel log on node-2, but note that the timestamp on that is 30 minutes older than the other timestamp, so Consul already has a leader for a long time by now.Is it problematic that the sentinel hasn't output any info for the last 30 minutes?
The text was updated successfully, but these errors were encountered: