-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Promotion fails because of missing LSN location #131
Comments
Hi, Sorry for getting back to you so lately. I seen your report but managed to forget it under a big pile of TODO :/ You are right, It looks like a race condition, but not only. Actually, I wonder if you had a bunch of race conditions in your logs. During a switchover, PAF detect the current transition is a master migration and is not supposed to trigger an election to compare LSN. So I suppose the first transition was interrupted but some events right after some actions already occurred...Because the cluster situation changed in the mean time, the new computed transition "hide" the switchover and an election is triggered. Looking at how many times the LSN were set/unset in few seconds, I suppose you have multiple transition interruptions...and one of them managed to fails before being interrupted because "pg1-dev" did not set its location fast enough...Which is actually surprising because PAF is waiting for each LSN to be set before exiting the notify action. I suppose attrd did not propagate the value fast enough to other nodes... :/ We could probably check this theory by looking at your logs from all of the nodes and compare them. |
Hello @ioguix Thanks for your answer. I restarted the cluster and fixed the Pacemaker issues until I had a stable cluster status. From here everything is fine, pg1-dev is currently holding the master role. root@pg2-dev postgresql_cluster:~# date ; crm resource constraints pgsql-ha
Mon Apr 2 09:02:25 UTC 2018
pgsql-master-ip (score=INFINITY, needs role=Master, id=ip-with-master)
* pgsql-ha
: Node pg1-dev (score=5, id=prefer-pg1-dev)
root@pg2-dev postgresql_cluster:~# date ; crm resource move pgsql-ha pg2-dev
Mon Apr 2 09:02:49 UTC 2018
INFO: Move constraint created for pgsql-ha to pg2-dev Migration then fails because of root@pg1-dev postgresql_cluster:~# crm status
Stack: corosync
Current DC: pg1-dev (version 1.1.16-94ff4df) - partition with quorum
Last updated: Mon Apr 2 09:03:53 2018
Last change: Mon Apr 2 09:02:49 2018 by root via crm_resource on pg2-dev
3 nodes configured
7 resources configured
Online: [ pg1-dev pg2-dev pg3-dev ]
Full list of resources:
Master/Slave Set: pgsql-ha [pgsqld]
Masters: [ pg1-dev ]
Slaves: [ pg2-dev pg3-dev ]
pgsql-master-ip (ocf::heartbeat:gcp-vpc-move-ip): Started pg1-dev
fence_pg1-dev (stonith:fence_gce): Started pg3-dev
fence_pg2-dev (stonith:fence_gce): Started pg1-dev
fence_pg3-dev (stonith:fence_gce): Started pg1-dev
Failed Actions:
* pgsqld_promote_0 on pg2-dev 'unknown error' (1): call=233, status=complete, exitreason='Can not get LSN location for "pg1-dev"',
last-rc-change='Mon Apr 2 09:02:50 2018', queued=474ms, exec=189ms Here are the complete Pacemaker/Corosync/PostgreSQL logs from "2018-04-02 09:02" to "2018-04-02 09:04" on all 3 nodes : paf_#131.tar.gz Thanks for your help |
Hello Here are additional logs. I set Pacemaker in debug mode
Starting point :
Here are the complete Pacemaker/Corosync/PostgreSQL logs from "2018-04-13 08:48" to "2018-04-13 08:50" on all 3 nodes : paf_#131_2018-04-13.tar.gz I've tried to analyze all that, but I don't follow Pacemaker interal workings. On each node, it clears its own LSN location then sets it to some value and then it clears it again :
I've noticed the same behaviour on all nodes, so I assume this is expected.... Regards |
Hi, Is it possible to get the exact same logs from journald for all the nodes, but with microsec precision ( Thanks, |
Here you are : paf_#131_2018-04-13.tar.gz Also, in case there is any doubt, I also checked that the NTP service is indeed running. |
Hi, Some fresh news. I poke the Pacemaker contributors to get some experts feedback on this. It appears the attempt of PAF to avoid the asynchronous behavior of attrd is not enough. It only secures that the value is available locally, not on all nodes :( In the meantime, I was checking in PAF source code if we could bypass the election process during a switchover...and it appears I left there a note to myself as a reminder:
I was probably too much defensive back then as this comment appears right after we checked the standby received the checkpoint shutdown from the master...Which means it can not possibly lagging anyway. I'll work on a patch later in this regard. However, the election process will still suffer of a race condition because of the async attrd for other failover situation...which is quite annoying. Pacemaker devs might come up with a solution for the future, but we probably can't afford waiting for a new feature. I'll try to figure if we have some other way to deal with this. |
Thanks. |
Hello @ioguix I've ended up writing a quick and diry workaround to at least make PAF work : dud225@8a9f369 This seems to do the trick :
I can then move PG without any more issue. Regards |
The loop in _set_priv_attr() is just enough to make sure the private attribute is available locally. But it doesn't mean the attribute has been propagated to other nodes. This should fix gh issue #131 where lsn_location from remote node might not be available yet during the promote action. Note that I haven't been able to reproduce the same behavior myself despite multiple creatives way of making attrd lagging...
I know it's been loooong overdue, but anyway, I believe the last patch should fix your issue in next PAF release. regards, |
Hello
I've setup a new cluster from scratch using the following software stack :
Pacemaker onfiguration :
Status is fine :
Replication looks fine too :
Then I'd like to move the PG master to another node :
But Pacemaker fails to do so :
Here is an excerpt from pacemaker.log on pg2-dev :
I'm not sure to properly understand the output, but it seems there is some kind of race ?
Regards
dud
The text was updated successfully, but these errors were encountered: