Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promotion fails because of missing LSN location #131

Closed
dud225 opened this issue Mar 7, 2018 · 9 comments
Closed

Promotion fails because of missing LSN location #131

dud225 opened this issue Mar 7, 2018 · 9 comments

Comments

@dud225
Copy link

dud225 commented Mar 7, 2018

Hello

I've setup a new cluster from scratch using the following software stack :

  • Debian Stretch 9.3
  • pacemaker 1.1.16-1
  • corosync 2.4.2-3
  • resource-agents-paf 2.2.0-2.pgdg90+1
  • crmsh 2.3.2-4
  • postgresql-10 10.3-1.pgdg90+1

Pacemaker onfiguration :

node 1: pg1-dev \
        attributes master-pgsqld=1000 maintenance=off
node 2: pg2-dev \
        attributes master-pgsqld=990
node 3: pg3-dev \
        attributes master-pgsqld=1001
primitive fence_pg1-dev stonith:fence_gce \
        params pcmk_host_check=static-list pcmk_host_list=pg1-dev project=my_project zone=europe-west1-c
primitive fence_pg2-dev stonith:fence_gce \
        params pcmk_host_check=static-list pcmk_host_list=pg2-dev project=my_project zone=europe-west1-c
primitive fence_pg3-dev stonith:fence_gce \
        params pcmk_host_check=static-list pcmk_host_list=pg3-dev project=my_project zone=europe-west1-c
primitive pgsql-master-ip gcp-vpc-move-ip \
        params ip=192.168.1.1 interface=eth0 vpc_network=vpc route_name=pg-dev \
        op start timeout=180s interval=0 \
        op stop timeout=180s interval=0 \
        op monitor timeout=30s interval=60s
primitive pgsqld pgsqlms \
        params pgdata="/var/lib/postgresql/10/main" bindir="/usr/lib/postgresql/10/bin" pghost="/var/run/postgresql" recovery_template="/etc/postgresql/10/main/recovery.conf.pcmk" start_opts="-c config_file=/etc/postgresql/10/main/postgresql.conf" \
        op start timeout=60s interval=0 \
        op stop timeout=60s interval=0 \
        op promote timeout=30s interval=0 \
        op demote timeout=120s interval=0 \
        op monitor interval=15s timeout=10s role=Master \
        op monitor interval=16s timeout=10s role=Slave \
        op notify timeout=60s interval=0
ms pgsql-ha pgsqld \
        meta master-max=1 master-node-max=1 clone-max=3 clone-node-max=1 notify=true maintenance=false
order demote-then-stop-ip Mandatory: pgsql-ha:demote pgsql-master-ip:stop symmetrical=false
location fence_pg1-dev-avoids-self fence_pg1-dev -inf: pg1-dev
location fence_pg2-dev-avoids-self fence_pg2-dev -inf: pg2-dev
location fence_pg3-dev-avoids-self fence_pg3-dev -inf: pg3-dev
colocation ip-with-master inf: pgsql-master-ip pgsql-ha:Master
location prefer-pg1-dev pgsql-ha role=Master 5: pg1-dev
order promote-then-ip Mandatory: pgsql-ha:promote pgsql-master-ip:start symmetrical=false
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.16-94ff4df \
        cluster-infrastructure=corosync \
        cluster-name=pg-dev \
        stonith-enabled=true \
        stonith-action=off \
        last-lrm-refresh=1520431209
rsc_defaults rsc-options: \
        resource-stickiness=10 \
        migration-threshold=5 \
        failure-timeout=1h

Status is fine :

Online: [ pg1-dev pg2-dev pg3-dev ]

Full list of resources:

 Master/Slave Set: pgsql-ha [pgsqld]
     Masters: [ pg1-dev ]
     Slaves: [ pg2-dev pg3-dev ]
 pgsql-master-ip        (ocf::heartbeat:gcp-vpc-move-ip):       Started pg1-dev
 fence_pg1-dev  (stonith:fence_gce):    Started pg2-dev
 fence_pg2-dev  (stonith:fence_gce):    Started pg3-dev
 fence_pg3-dev  (stonith:fence_gce):    Started pg2-dev

Replication looks fine too :

root@pg1-dev postgresql_cluster:~# psql -U postgres
psql (10.3 (Debian 10.3-1.pgdg90+1))
Type "help" for help.

postgres=# SELECT client_hostname FROM pg_stat_replication ;
                    client_hostname
-------------------------------------------------------
 pg2-dev.europe-west1-c.c.my_project.internal
 pg3-dev.europe-west1-c.c.my_project.internal
(2 rows)

Then I'd like to move the PG master to another node :

root@pg1-dev postgresql_cluster:~# crm resource move pgsql-ha pg2-dev
INFO: Move constraint created for pgsql-ha to pg2-dev

But Pacemaker fails to do so :

root@pg1-dev postgresql_cluster:~# crm status
Online: [ pg1-dev pg2-dev pg3-dev ]

Full list of resources:

 Master/Slave Set: pgsql-ha [pgsqld]
     Masters: [ pg3-dev ]
     Slaves: [ pg1-dev ]
     Stopped: [ pg2-dev ]
 pgsql-master-ip        (ocf::heartbeat:gcp-vpc-move-ip):       Started pg3-dev
 fence_pg1-dev  (stonith:fence_gce):    Started pg2-dev
 fence_pg2-dev  (stonith:fence_gce):    Started pg3-dev
 fence_pg3-dev  (stonith:fence_gce):    Started pg2-dev

Failed Actions:
* pgsqld_promote_0 on pg1-dev 'unknown error' (1): call=243, status=complete, exitreason='pg3-dev is the best candidate to promote, aborting current promotion',
    last-rc-change='Wed Mar  7 15:09:14 2018', queued=383ms, exec=312ms
* pgsqld_promote_0 on pg2-dev 'unknown error' (1): call=239, status=complete, exitreason='Can not get LSN location for "pg1-dev"',
    last-rc-change='Wed Mar  7 15:09:25 2018', queued=289ms, exec=149ms

Here is an excerpt from pacemaker.log on pg2-dev :

Mar 07 15:09:14 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg2-dev]: (null) -> 9#100663296 from pg2-dev
Mar 07 15:09:14 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg3-dev]: (null) -> 9#100665424 from pg3-dev
Mar 07 15:09:14 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg1-dev]: (null) -> 9#100663296 from pg1-dev
Mar 07 15:09:14 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg2-dev]: 9#100663296 -> (null) from pg2-dev
Mar 07 15:09:14 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg1-dev]: 9#100663296 -> (null) from pg1-dev
Mar 07 15:09:14 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg3-dev]: 9#100665424 -> (null) from pg3-dev
Mar 07 15:09:21 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg2-dev]: (null) -> 9#100663296 from pg2-dev
Mar 07 15:09:21 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg1-dev]: (null) -> 9#100663296 from pg1-dev
Mar 07 15:09:21 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg3-dev]: (null) -> 9#100665424 from pg3-dev
Mar 07 15:09:21 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg2-dev]: 9#100663296 -> (null) from pg2-dev
Mar 07 15:09:21 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg3-dev]: 9#100665424 -> (null) from pg3-dev
Mar 07 15:09:21 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg1-dev]: 9#100663296 -> (null) from pg1-dev
Mar 07 15:09:23 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg2-dev]: (null) -> 9#100663296 from pg2-dev
Mar 07 15:09:23 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg1-dev]: (null) -> 9#100663296 from pg1-dev
Mar 07 15:09:23 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg3-dev]: (null) -> 9#100665424 from pg3-dev
Mar 07 15:09:23 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg1-dev]: 9#100663296 -> (null) from pg1-dev
Mar 07 15:09:23 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg2-dev]: 9#100663296 -> (null) from pg2-dev
Mar 07 15:09:24 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg3-dev]: 9#100665424 -> (null) from pg3-dev
Mar 07 15:09:25 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg2-dev]: (null) -> 9#100663296 from pg2-dev
Mar 07 15:09:25 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg3-dev]: (null) -> 9#100665424 from pg3-dev
Mar 07 15:09:25 [25146] pg2-dev       lrmd:   notice: operation_finished:       pgsqld_promote_0:31023:stderr [ ocf-exit-reason:Can not get LSN location for "pg1-dev" ]
Mar 07 15:09:25 [25149] pg2-dev       crmd:   notice: process_lrm_event:        pg2-dev-pgsqld_promote_0:239 [ ocf-exit-reason:Can not get LSN location for "pg1-dev"\n ]
Mar 07 15:09:25 [25144] pg2-dev        cib:     info: cib_perform_op:   +  /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='pgsqld']/lrm_rsc_op[@id='pgsqld_last_0']:  @transition-magic=0:1;16:93:0:8958b5d6-2ffb-4c76-9d76-df018cdc47cc, @call-id=239, @rc-code=1, @op-status=0, @exec-time=149, @queue-time=289, @exit-reason=Can not get LSN location for "pg1-dev"
Mar 07 15:09:25 [25144] pg2-dev        cib:     info: cib_perform_op:   +  /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='pgsqld']/lrm_rsc_op[@id='pgsqld_last_failure_0']:  @transition-key=16:93:0:8958b5d6-2ffb-4c76-9d76-df018cdc47cc, @transition-magic=0:1;16:93:0:8958b5d6-2ffb-4c76-9d76-df018cdc47cc, @call-id=239, @last-run=1520435365, @last-rc-change=1520435365, @exec-time=149, @queue-time=289, @exit-reason=Can not get LSN location for "pg1-dev"
Mar 07 15:09:25 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg1-dev]: (null) -> 9#100663296 from pg1-dev
Mar 07 15:09:25 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg2-dev]: 9#100663296 -> (null) from pg2-dev
Mar 07 15:09:25 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg3-dev]: 9#100665424 -> (null) from pg3-dev
Mar 07 15:09:26 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg1-dev]: 9#100663296 -> (null) from pg1-dev
Mar 07 15:09:27 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg3-dev]: (null) -> 9#100665424 from pg3-dev
Mar 07 15:09:27 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg1-dev]: (null) -> 9#100663296 from pg1-dev
Mar 07 15:09:28 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg1-dev]: 9#100663296 -> (null) from pg1-dev
Mar 07 15:09:28 [25147] pg2-dev      attrd:     info: attrd_peer_update:        Setting lsn_location-pgsqld[pg3-dev]: 9#100665424 -> (null) from pg3-dev

I'm not sure to properly understand the output, but it seems there is some kind of race ?

Regards
dud

@ioguix
Copy link
Member

ioguix commented Mar 31, 2018

Hi,

Sorry for getting back to you so lately. I seen your report but managed to forget it under a big pile of TODO :/

You are right, It looks like a race condition, but not only. Actually, I wonder if you had a bunch of race conditions in your logs.

During a switchover, PAF detect the current transition is a master migration and is not supposed to trigger an election to compare LSN. So I suppose the first transition was interrupted but some events right after some actions already occurred...Because the cluster situation changed in the mean time, the new computed transition "hide" the switchover and an election is triggered.

Looking at how many times the LSN were set/unset in few seconds, I suppose you have multiple transition interruptions...and one of them managed to fails before being interrupted because "pg1-dev" did not set its location fast enough...Which is actually surprising because PAF is waiting for each LSN to be set before exiting the notify action. I suppose attrd did not propagate the value fast enough to other nodes... :/

We could probably check this theory by looking at your logs from all of the nodes and compare them.

@dud225
Copy link
Author

dud225 commented Apr 2, 2018

Hello @ioguix

Thanks for your answer.

I restarted the cluster and fixed the Pacemaker issues until I had a stable cluster status. From here everything is fine, pg1-dev is currently holding the master role.

root@pg2-dev postgresql_cluster:~# date ; crm resource constraints pgsql-ha
Mon Apr  2 09:02:25 UTC 2018
    pgsql-master-ip          (score=INFINITY, needs role=Master, id=ip-with-master)
* pgsql-ha
  : Node pg1-dev             (score=5, id=prefer-pg1-dev)

root@pg2-dev postgresql_cluster:~# date ; crm resource move pgsql-ha pg2-dev
Mon Apr  2 09:02:49 UTC 2018
INFO: Move constraint created for pgsql-ha to pg2-dev

Migration then fails because of Can not get LSN location for "pg1-dev" and finally Pacemaker rolls back the master role to pg1-dev

root@pg1-dev postgresql_cluster:~# crm status
Stack: corosync
Current DC: pg1-dev (version 1.1.16-94ff4df) - partition with quorum
Last updated: Mon Apr  2 09:03:53 2018
Last change: Mon Apr  2 09:02:49 2018 by root via crm_resource on pg2-dev

3 nodes configured
7 resources configured

Online: [ pg1-dev pg2-dev pg3-dev ]

Full list of resources:

 Master/Slave Set: pgsql-ha [pgsqld]
     Masters: [ pg1-dev ]
     Slaves: [ pg2-dev pg3-dev ]
 pgsql-master-ip        (ocf::heartbeat:gcp-vpc-move-ip):       Started pg1-dev
 fence_pg1-dev  (stonith:fence_gce):    Started pg3-dev
 fence_pg2-dev  (stonith:fence_gce):    Started pg1-dev
 fence_pg3-dev  (stonith:fence_gce):    Started pg1-dev

Failed Actions:
* pgsqld_promote_0 on pg2-dev 'unknown error' (1): call=233, status=complete, exitreason='Can not get LSN location for "pg1-dev"',
    last-rc-change='Mon Apr  2 09:02:50 2018', queued=474ms, exec=189ms

Here are the complete Pacemaker/Corosync/PostgreSQL logs from "2018-04-02 09:02" to "2018-04-02 09:04" on all 3 nodes : paf_#131.tar.gz

Thanks for your help
Regards
dud

@dud225
Copy link
Author

dud225 commented Apr 13, 2018

Hello

Here are additional logs.

I set Pacemaker in debug mode

# crm cluster run "grep ^PCMK_debug /etc/default/pacemaker"
OK: [pg1-dev]
PCMK_debug=yes
OK: [pg2-dev]
PCMK_debug=yes
OK: [pg3-dev]
PCMK_debug=yes

Starting point : pg3-dev is the master

# crm status
Stack: corosync
Current DC: pg2-dev (version 1.1.16-94ff4df) - partition with quorum
Last updated: Fri Apr 13 08:47:40 2018
Last change: Fri Apr 13 08:36:47 2018 by root via crm_attribute on pg3-dev

3 nodes configured
7 resources configured

Online: [ pg1-dev pg2-dev pg3-dev ]

Full list of resources:

 Master/Slave Set: pgsql-ha [pgsqld]
     Masters: [ pg3-dev ]
     Slaves: [ pg1-dev pg2-dev ]
 pgsql-master-ip        (ocf::heartbeat:gcp-vpc-move-ip):       Started pg3-dev
 fence_pg1-dev  (stonith:fence_gce):    Started pg2-dev
 fence_pg2-dev  (stonith:fence_gce):    Started pg1-dev
 fence_pg3-dev  (stonith:fence_gce):    Started pg1-dev

# date ; crm resource constraints pgsql-ha
Fri Apr 13 08:47:58 UTC 2018
    pgsql-master-ip                                                              (score=INFINITY, needs role=Master, id=ip-with-master)
* pgsql-ha
  : Node pg1-dev                                                                 (score=5, id=prefer-pg1-dev)

# date ; crm resource move pgsql-ha pg1-dev
Fri Apr 13 08:48:14 UTC 2018
INFO: Move constraint created for pgsql-ha to pg1-dev

# date ; crm resource constraints pgsql-ha
Fri Apr 13 08:48:18 UTC 2018
    pgsql-master-ip                                                              (score=INFINITY, needs role=Master, id=ip-with-master)
* pgsql-ha
  : Node pg1-dev                                                                 (score=INFINITY, id=cli-prefer-pgsql-ha)
  : Node pg1-dev                                                                 (score=5, id=prefer-pg1-dev)

Here are the complete Pacemaker/Corosync/PostgreSQL logs from "2018-04-13 08:48" to "2018-04-13 08:50" on all 3 nodes : paf_#131_2018-04-13.tar.gz

I've tried to analyze all that, but I don't follow Pacemaker interal workings. On each node, it clears its own LSN location then sets it to some value and then it clears it again :

237:Apr 13 08:48:15 [2201] pg1-dev      attrd:    debug: attrd_client_update:   Broadcasting lsn_location-pgsqld[pg1-dev] = (null)
341:Apr 13 08:48:16 [2201] pg1-dev      attrd:    debug: attrd_client_update:   Broadcasting lsn_location-pgsqld[pg1-dev] = 57#134219088
342:Apr 13 08:48:16 [2201] pg1-dev      attrd:     info: attrd_peer_update:     Setting lsn_location-pgsqld[pg1-dev]: (null) -> 57#134219088 from pg1-dev
567:Apr 13 08:48:16 [2201] pg1-dev      attrd:    debug: attrd_client_update:   Broadcasting lsn_location-pgsqld[pg1-dev] = (null)
568:Apr 13 08:48:16 [2201] pg1-dev      attrd:     info: attrd_peer_update:     Setting lsn_location-pgsqld[pg1-dev]: 57#134219088 -> (null) from pg1-dev

I've noticed the same behaviour on all nodes, so I assume this is expected....

Regards
dud

@ioguix
Copy link
Member

ioguix commented Apr 13, 2018

Hi,

Is it possible to get the exact same logs from journald for all the nodes, but with microsec precision (-o short-precise)?

Thanks,

@dud225
Copy link
Author

dud225 commented Apr 13, 2018

Here you are : paf_#131_2018-04-13.tar.gz

Also, in case there is any doubt, I also checked that the NTP service is indeed running.

@ioguix
Copy link
Member

ioguix commented Apr 16, 2018

Hi,

Some fresh news.

I poke the Pacemaker contributors to get some experts feedback on this. It appears the attempt of PAF to avoid the asynchronous behavior of attrd is not enough. It only secures that the value is available locally, not on all nodes :(

In the meantime, I was checking in PAF source code if we could bypass the election process during a switchover...and it appears I left there a note to myself as a reminder:

   # Keep going with the election process in case the switchover was
   # instruct to the wrong node.
   # FIXME: should we allow a switchover to a lagging slave?

I was probably too much defensive back then as this comment appears right after we checked the standby received the checkpoint shutdown from the master...Which means it can not possibly lagging anyway. I'll work on a patch later in this regard.

However, the election process will still suffer of a race condition because of the async attrd for other failover situation...which is quite annoying. Pacemaker devs might come up with a solution for the future, but we probably can't afford waiting for a new feature. I'll try to figure if we have some other way to deal with this.

@dud225
Copy link
Author

dud225 commented Apr 17, 2018

Thanks.
Let me know if I can help you any further.

@dud225
Copy link
Author

dud225 commented May 9, 2018

Hello @ioguix

I've ended up writing a quick and diry workaround to at least make PAF work : dud225@8a9f369

This seems to do the trick :

May 09 14:42:12.457355 pg1-dev pgsqlms(pgsqld)[2724]: DEBUG: _delete_priv_attr: delete "lsn_location-pgsqld"...
May 09 14:42:12.480111 pg1-dev pgsqlms(pgsqld)[2729]: DEBUG: _get_priv_attr: value of "lsn_location-pgsqld" is ""
May 09 14:42:12.593334 pg1-dev pgsqlms(pgsqld)[2756]: DEBUG: _query: SELECT pg_wal_lsn_diff( pg_last_wal_receive_lsn(), '0/0' )
May 09 14:42:12.615355 pg1-dev pgsqlms(pgsqld)[2761]: INFO: Current node TL#LSN: 64#134217728
May 09 14:42:12.617133 pg1-dev pgsqlms(pgsqld)[2762]: DEBUG: _set_priv_attr: set "lsn_location-pgsqld=64#134217728"...
May 09 14:42:12.637486 pg1-dev pgsqlms(pgsqld)[2767]: DEBUG: _get_priv_attr: value of "lsn_location-pgsqld" is "64#134217728"
May 09 14:42:13.405102 pg1-dev pgsqlms(pgsqld)[2810]: DEBUG: _get_priv_attr: value of "lsn_location-pgsqld" is "64#134217728"
May 09 14:42:13.407170 pg1-dev pgsqlms(pgsqld)[2811]: DEBUG: pgsql_promote: current node TL#LSN location: 64#134217728
May 09 14:42:13.408899 pg1-dev pgsqlms(pgsqld)[2812]: DEBUG: pgsql_promote: getting lsn_location attribute of node pg3-dev (try 1/20)
May 09 14:42:13.419909 pg1-dev pgsqlms(pgsqld)[2815]: DEBUG: _get_priv_attr: value of "lsn_location-pgsqld" on "pg3-dev" is "64#134217728"
May 09 14:42:13.421863 pg1-dev pgsqlms(pgsqld)[2816]: DEBUG: pgsql_promote: comparing with "pg3-dev": TL#LSN is 64#134217728
May 09 14:42:13.423507 pg1-dev pgsqlms(pgsqld)[2817]: DEBUG: pgsql_promote: getting lsn_location attribute of node pg2-dev (try 1/20)
May 09 14:42:13.434853 pg1-dev pgsqlms(pgsqld)[2820]: DEBUG: _get_priv_attr: value of "lsn_location-pgsqld" on "pg2-dev" is ""
May 09 14:42:13.537596 pg1-dev pgsqlms(pgsqld)[2821]: DEBUG: pgsql_promote: getting lsn_location attribute of node pg2-dev (try 2/20)
May 09 14:42:13.548561 pg1-dev pgsqlms(pgsqld)[2824]: DEBUG: _get_priv_attr: value of "lsn_location-pgsqld" on "pg2-dev" is ""
May 09 14:42:13.651315 pg1-dev pgsqlms(pgsqld)[2825]: DEBUG: pgsql_promote: getting lsn_location attribute of node pg2-dev (try 3/20)
May 09 14:42:13.662464 pg1-dev pgsqlms(pgsqld)[2828]: DEBUG: _get_priv_attr: value of "lsn_location-pgsqld" on "pg2-dev" is ""
May 09 14:42:13.765369 pg1-dev pgsqlms(pgsqld)[2829]: DEBUG: pgsql_promote: getting lsn_location attribute of node pg2-dev (try 4/20)
May 09 14:42:13.782673 pg1-dev pgsqlms(pgsqld)[2832]: DEBUG: _get_priv_attr: value of "lsn_location-pgsqld" on "pg2-dev" is "64#134217728"
May 09 14:42:13.785577 pg1-dev pgsqlms(pgsqld)[2833]: DEBUG: pgsql_promote: comparing with "pg2-dev": TL#LSN is 64#134217728
May 09 14:42:14.363121 pg1-dev pgsqlms(pgsqld)[2864]: DEBUG: _delete_priv_attr: delete "lsn_location-pgsqld"...
May 09 14:42:14.386648 pg1-dev pgsqlms(pgsqld)[2869]: DEBUG: _get_priv_attr: value of "lsn_location-pgsqld" is ""

I can then move PG without any more issue.

Regards
dud

ioguix added a commit that referenced this issue Nov 27, 2019
The loop in _set_priv_attr() is just enough to make sure the private
attribute is available locally. But it doesn't mean the attribute has
been propagated to other nodes.

This should fix gh issue #131 where lsn_location from remote node
might not be available yet during the promote action.

Note that I haven't been able to reproduce the same behavior myself
despite multiple creatives way of making attrd lagging...
@ioguix
Copy link
Member

ioguix commented Nov 27, 2019

I know it's been loooong overdue, but anyway, I believe the last patch should fix your issue in next PAF release.
I'm closing this issue, but feel free to reopen it if needed!

regards,

@ioguix ioguix closed this as completed Nov 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants