Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with synchronous standby order when upgrading from v0.9.0 to v0.10.0 #488

Closed
nh2 opened this issue May 13, 2018 · 9 comments
Closed

Comments

@nh2
Copy link
Contributor

nh2 commented May 13, 2018

Submission type

  • Bug report

Environment

Stolon version

Upgrade from v0.9.0 to v0.10.0

Additional environment information if useful to understand the bug

When upgrading a cluster from v0.9.0 to v0.10.0, after the upgrade all write SQL queries hung forever because replication was stuck. In ps aux I could see the WAL sender process stuck at 0 like:

postgres: wal sender process postgres 10.0.0.2(54752) streaming 0/7012D20

The stolon-sentinel log revealed the problem:

won't update masterDB required synchronous standby since the latest master reported synchronous standbys are different from the db spec ones%!(EXTRA string=reported, []string=[0d1a8783 dc15fe22], string=spec, []string=[dc15fe22 0d1a8783])

[0d1a8783 dc15fe22] is exactly [dc15fe22 0d1a8783] but the order is inverted.

I suspect that somehow through the upgrade stolon started to expect a different ordering.


Full logs

May 13 00:44:38 node-1 stolon-keeper[1055]: 2018-05-13T00:44:38.102Z        INFO        cmd/keeper.go:1423        our db requested role is master
May 13 00:44:38 node-1 stolon-keeper[1055]: 2018-05-13T00:44:38.103Z        INFO        cmd/keeper.go:1451        already master
May 13 00:44:38 node-1 stolon-keeper[1055]: 2018-05-13T00:44:38.106Z        INFO        cmd/keeper.go:1615        postgres parameters not changed
May 13 00:44:38 node-1 stolon-keeper[1055]: 2018-05-13T00:44:38.106Z        INFO        cmd/keeper.go:1626        postgres hba entries not changed
May 13 00:44:40 node-1 stolon-keeper[1055]: 2018-05-13T00:44:40.122Z        INFO        cmd/keeper.go:1423        our db requested role is master
May 13 00:44:40 node-1 stolon-keeper[1055]: 2018-05-13T00:44:40.123Z        INFO        cmd/keeper.go:1451        already master
May 13 00:44:40 node-1 stolon-keeper[1055]: 2018-05-13T00:44:40.126Z        INFO        cmd/keeper.go:1615        postgres parameters not changed
May 13 00:44:40 node-1 stolon-keeper[1055]: 2018-05-13T00:44:40.126Z        INFO        cmd/keeper.go:1626        postgres hba entries not changed
May 13 00:43:48 node-1 stolon-sentinel[1057]: 2018-05-13T00:43:48.642Z        INFO        cmd/sentinel.go:1196        won't update masterDB required synchronous standby since the latest master reported synchronous standbys are different from the db spec ones%!(EXTRA string=reported, []string=[0d1a8783 dc15fe22], string=spec, []string=[dc15fe22 0d1a8783])
May 13 00:43:50 node-1 stolon-sentinel[1057]: 2018-05-13T00:43:50.678Z        INFO        cmd/sentinel.go:1196        won't update masterDB required synchronous standby since the latest master reported synchronous standbys are different from the db spec ones%!(EXTRA string=reported, []string=[0d1a8783 dc15fe22], string=spec, []string=[dc15fe22 0d1a8783])
May 13 00:43:52 node-1 stolon-sentinel[1057]: 2018-05-13T00:43:52.714Z        INFO        cmd/sentinel.go:1196        won't update masterDB required synchronous standby since the latest master reported synchronous standbys are different from the db spec ones%!(EXTRA string=reported, []string=[0d1a8783 dc15fe22], string=spec, []string=[dc15fe22 0d1a8783])
May 13 00:43:54 node-1 stolon-sentinel[1057]: 2018-05-13T00:43:54.742Z        INFO        cmd/sentinel.go:1196        won't update masterDB required synchronous standby since the latest master reported synchronous standbys are different from the db spec ones%!(EXTRA string=reported, []string=[0d1a8783 dc15fe22], string=spec, []string=[dc15fe22 0d1a8783])
May 13 00:43:56 node-1 stolon-sentinel[1057]: 2018-05-13T00:43:56.774Z        INFO        cmd/sentinel.go:1196        won't update masterDB required synchronous standby since the latest master reported synchronous standbys are different from the db spec ones%!(EXTRA string=reported, []string=[0d1a8783 dc15fe22], string=spec, []string=[dc15fe22 0d1a8783])
May 13 00:43:58 node-1 stolon-sentinel[1057]: 2018-05-13T00:43:58.808Z        INFO        cmd/sentinel.go:1196        won't update masterDB required synchronous slines 956-971/1001 97%
[root@node-1:~]# stolonctl status --cluster-name mystolon-cluster --store-backend=consul
=== Active sentinels ===

ID    LEADER
22a83926  false
6bd8559e  false
b04179aa  true

=== Active proxies ===

ID
10540fcf
37b30d74
cc0d2d39
e823b8d6

=== Keepers ===

UID   HEALTHY PG LISTENADDRESS  PG HEALTHY  PG WANTEDGENERATION PG CURRENTGENERATION
node_1 true  10.0.0.1:5432   true    6     6 
node_2 true  10.0.0.2:5432   true    4     4 
node_3 true  10.0.0.3:5432   true    2     2 

=== Cluster Info ===

Master: node_1

===== Keepers/DB tree =====

node_1 (master)
├─node_3
└─node_2
[root@node-1:~]# stolonctl clusterdata --cluster-name mystolon-cluster --store-backend=consul | jq .
{
  "FormatVersion": 1,
  "Cluster": {
    "uid": "2a7da692",
    "generation": 1,
    "changeTime": "2018-05-11T16:31:29.806422027Z",
    "spec": {
      "sleepInterval": "2s",
      "requestTimeout": "4s",
      "failInterval": "8s",
      "synchronousReplication": true,
      "minSynchronousStandbys": 1,
      "maxSynchronousStandbys": 2,
      "additionalWalSenders": null,
      "additionalMasterReplicationSlots": null,
      "usePgrewind": true,
      "initMode": "new",
      "pgParameters": {
        "datestyle": "iso, mdy",
        "default_text_search_config": "pg_catalog.english",
        "dynamic_shared_memory_type": "posix",
        "lc_messages": "en_US.UTF-8",
        "lc_monetary": "en_US.UTF-8",
        "lc_numeric": "en_US.UTF-8",
        "lc_time": "en_US.UTF-8",
        "log_min_duration_statement": "10",
        "log_timezone": "UTC",
        "logging_collector": "on",
        "max_connections": "100",
        "shared_buffers": "128MB",
        "timezone": "UTC"
      },
      "pgHBA": null
    },
    "status": {
      "phase": "normal",
      "master": "fec76798"
    }
  },
  "Keepers": {
    "node_1": {
      "uid": "node_1",
      "generation": 1,
      "changeTime": "0001-01-01T00:00:00Z",
      "spec": {},
      "status": {
        "healthy": true,
        "lastHealthyTime": "2018-05-13T00:46:47.394375098Z",
        "bootUUID": "fb6350a3-394a-4e80-918a-9c2c9caa51d4"
      }
    },
    "node_2": {
      "uid": "node_2",
      "generation": 1,
      "changeTime": "0001-01-01T00:00:00Z",
      "spec": {},
      "status": {
        "healthy": true,
        "lastHealthyTime": "2018-05-13T00:46:47.394375224Z",
        "bootUUID": "a2332566-7921-4cd7-a170-fbbb260cb16c"
      }
    },
    "node_3": {
      "uid": "node_3",
      "generation": 1,
      "changeTime": "0001-01-01T00:00:00Z",
      "spec": {},
      "status": {
        "healthy": true,
        "lastHealthyTime": "2018-05-13T00:46:47.394375276Z",
        "bootUUID": "d6f94936-ed3d-4a32-b50b-f03361437bca"
      }
    }
  },
  "DBs": {
    "c9157e4f": {
      "uid": "c9157e4f",
      "generation": 2,
      "changeTime": "2018-05-12T21:56:02.001971044Z",
      "spec": {
        "keeperUID": "node_3",
        "requestTimeout": "4s",
        "maxStandbys": 20,
        "usePgrewind": true,
        "additionalWalSenders": 5,
        "additionalReplicationSlots": null,
        "initMode": "none",
        "pgParameters": {
          "datestyle": "iso, mdy",
          "default_text_search_config": "pg_catalog.english",
          "dynamic_shared_memory_type": "posix",
          "lc_messages": "en_US.UTF-8",
          "lc_monetary": "en_US.UTF-8",
          "lc_numeric": "en_US.UTF-8",
          "lc_time": "en_US.UTF-8",
          "log_min_duration_statement": "10",
          "log_timezone": "UTC",
          "logging_collector": "on",
          "max_connections": "100",
          "shared_buffers": "128MB",
          "timezone": "UTC"
        },
        "pgHBA": null,
        "role": "standby",
        "followConfig": {
          "type": "internal",
          "dbuid": "fec76798"
        },
        "followers": [],
        "synchronousStandbys": null,
        "externalSynchronousStandbys": null
      },
      "status": {
        "healthy": true,
        "currentGeneration": 2,
        "listenAddress": "10.0.0.3",
        "port": "5432",
        "systemdID": "6554361905281504843",
        "timelineID": 2,
        "xLogPos": 117518112,
        "timelinesHistory": [
          {
            "timelineID": 1,
            "switchPoint": 84734376,
            "reason": "no recovery target specified"
          }
        ],
        "pgParameters": {
          "datestyle": "iso, mdy",
          "default_text_search_config": "pg_catalog.english",
          "dynamic_shared_memory_type": "posix",
          "lc_messages": "en_US.UTF-8",
          "lc_monetary": "en_US.UTF-8",
          "lc_numeric": "en_US.UTF-8",
          "lc_time": "en_US.UTF-8",
          "log_min_duration_statement": "10",
          "log_timezone": "UTC",
          "logging_collector": "on",
          "max_connections": "100",
          "shared_buffers": "128MB",
          "timezone": "UTC",
          "wal_level": "replica"
        },
        "synchronousStandbys": [],
        "olderWalFile": "000000020000000000000006"
      }
    },
    "dc15fe22": {
      "uid": "dc15fe22",
      "generation": 4,
      "changeTime": "2018-05-12T21:53:29.647722542Z",
      "spec": {
        "keeperUID": "node_2",
        "requestTimeout": "4s",
        "maxStandbys": 20,
        "usePgrewind": true,
        "additionalWalSenders": 5,
        "additionalReplicationSlots": null,
        "initMode": "none",
        "pgParameters": {
          "datestyle": "iso, mdy",
          "default_text_search_config": "pg_catalog.english",
          "dynamic_shared_memory_type": "posix",
          "lc_messages": "en_US.UTF-8",
          "lc_monetary": "en_US.UTF-8",
          "lc_numeric": "en_US.UTF-8",
          "lc_time": "en_US.UTF-8",
          "log_min_duration_statement": "10",
          "log_timezone": "UTC",
          "logging_collector": "on",
          "max_connections": "100",
          "shared_buffers": "128MB",
          "timezone": "UTC"
        },
        "pgHBA": null,
        "role": "standby",
        "followConfig": {
          "type": "internal",
          "dbuid": "fec76798"
        },
        "followers": [],
        "synchronousStandbys": null,
        "externalSynchronousStandbys": null
      },
      "status": {
        "healthy": true,
        "currentGeneration": 4,
        "listenAddress": "10.0.0.2",
        "port": "5432",
        "systemdID": "6554361905281504843",
        "timelineID": 2,
        "xLogPos": 117518112,
        "timelinesHistory": [
          {
            "timelineID": 1,
            "switchPoint": 84734376,
            "reason": "no recovery target specified"
          }
        ],
        "pgParameters": {
          "datestyle": "iso, mdy",
          "default_text_search_config": "pg_catalog.english",
          "dynamic_shared_memory_type": "posix",
          "lc_messages": "en_US.UTF-8",
          "lc_monetary": "en_US.UTF-8",
          "lc_numeric": "en_US.UTF-8",
          "lc_time": "en_US.UTF-8",
          "log_min_duration_statement": "10",
          "log_timezone": "UTC",
          "logging_collector": "on",
          "max_connections": "100",
          "shared_buffers": "128MB",
          "timezone": "UTC",
          "wal_level": "replica"
        },
        "synchronousStandbys": [],
        "olderWalFile": "000000010000000000000002"
      }
    },
    "fec76798": {
      "uid": "fec76798",
      "generation": 6,
      "changeTime": "2018-05-12T21:55:55.906576019Z",
      "spec": {
        "keeperUID": "node_1",
        "requestTimeout": "4s",
        "maxStandbys": 20,
        "synchronousReplication": true,
        "usePgrewind": true,
        "additionalWalSenders": 5,
        "additionalReplicationSlots": null,
        "initMode": "none",
        "pgParameters": {
          "datestyle": "iso, mdy",
          "default_text_search_config": "pg_catalog.english",
          "dynamic_shared_memory_type": "posix",
          "lc_messages": "en_US.UTF-8",
          "lc_monetary": "en_US.UTF-8",
          "lc_numeric": "en_US.UTF-8",
          "lc_time": "en_US.UTF-8",
          "log_min_duration_statement": "10",
          "log_timezone": "UTC",
          "logging_collector": "on",
          "max_connections": "100",
          "shared_buffers": "128MB",
          "timezone": "UTC"
        },
        "pgHBA": null,
        "role": "master",
        "followers": [
          "c9157e4f",
          "dc15fe22"
        ],
        "synchronousStandbys": [
          "dc15fe22",
          "0d1a8783"
        ],
        "externalSynchronousStandbys": []
      },
      "status": {
        "healthy": true,
        "currentGeneration": 6,
        "listenAddress": "10.0.0.1",
        "port": "5432",
        "systemdID": "6554361905281504843",
        "timelineID": 2,
        "xLogPos": 117518112,
        "timelinesHistory": [
          {
            "timelineID": 1,
            "switchPoint": 84734376,
            "reason": "no recovery target specified"
          }
        ],
        "pgParameters": {
          "datestyle": "iso, mdy",
          "default_text_search_config": "pg_catalog.english",
          "dynamic_shared_memory_type": "posix",
          "lc_messages": "en_US.UTF-8",
          "lc_monetary": "en_US.UTF-8",
          "lc_numeric": "en_US.UTF-8",
          "lc_time": "en_US.UTF-8",
          "log_min_duration_statement": "10",
          "log_timezone": "UTC",
          "logging_collector": "on",
          "max_connections": "100",
          "shared_buffers": "128MB",
          "timezone": "UTC",
          "wal_level": "replica"
        },
        "synchronousStandbys": [
          "0d1a8783",
          "dc15fe22"
        ],
        "olderWalFile": "000000010000000000000003"
      }
    }
  },
  "Proxy": {
    "generation": 11,
    "changeTime": "2018-05-12T23:10:16.345113001Z",
    "spec": {
      "masterDbUid": "fec76798",
      "enabledProxies": [
        "10540fcf",
        "37b30d74",
        "cc0d2d39",
        "e823b8d6"
      ]
    },
    "status": {}
  }
}
postgres=# show synchronous_standby_names;
      synchronous_standby_names      
-------------------------------------
 2 (stolon_dc15fe22,stolon_0d1a8783)
(1 row)
postgres=# set synchronous_standby_names = '2 (stolon_0d1a8783,stolon_dc15fe22)';
ERROR:  parameter "synchronous_standby_names" cannot be changed now

Workaround

Wiping the clusterdata (after taking a backup) fixed it:

[root@node-1:~]# stolonctl init --cluster-name mystolon-cluster --store-backend=consul '{"failInterval":"8s","initMode":"new","maxSynchronousStandbys":2,"minSynchronousStandbys":1,"pgParameters":{"log_min_duration_statement":"10","logging_collector":"on"},"requestTimeout":"4s","sleepInterval":"2s","synchronousReplication":true,"usePgrewind":true}'
WARNING: The current cluster data will be removed
WARNING: The databases managed by the keepers will be overwritten depending on the provided cluster spec.
Are you sure you want to continue? [yes/no] yes

The fix worked after I did the above and restarted all stolon related services.

@sgotti
Copy link
Member

sgotti commented May 14, 2018

@nh2 Thanks for the report.

In the sentinel we already sort the sync standby names here

sort.Sort(sort.StringSlice(dbs.SynchronousStandbys))

And looks like the cluster data contains them correctly sorted:

        "synchronousStandbys": [
          "0d1a8783",
          "dc15fe22"
        ],

So I'm not sure why the log reports them in the wrong sort order since they are read from the clusterdata:

won't update masterDB required synchronous standby since the latest master reported synchronous standbys are different from the db spec ones%!(EXTRA string=reported, []string=[0d1a8783 dc15fe22], string=spec, []string=[dc15fe22 0d1a8783])

I'll try to reproduce it if possible.

@nh2
Copy link
Contributor Author

nh2 commented May 24, 2018

@sgotti This happened to me right now again, this time not during an update but just during v0.10.0 operation.

@nh2
Copy link
Contributor Author

nh2 commented May 24, 2018

Here is the stolon-sentinel log from boot over 5 days of normal operation to the failure this morning:

-- Reboot --
May 13 22:16:23 node-2 systemd[1]: Started stolon-sentinel.service.
2018-05-13T22:16:23.321Z        INFO        cmd/sentinel.go:1873        sentinel uid        {"uid": "0c47f5ef"}
2018-05-13T22:16:23.323Z        INFO        cmd/sentinel.go:94        Trying to acquire sentinels leadership
2018-05-17T00:13:47.036Z        ERROR        cmd/sentinel.go:114        election loop error        {"error": "failed to read lock: Unexpected response code: 500"}
2018-05-17T00:13:47.036Z        ERROR        cmd/sentinel.go:1727        error retrieving cluster data        {"error": "Unexpected response code: 500"}
2018-05-17T00:13:56.345Z        ERROR        cmd/sentinel.go:1727        error retrieving cluster data        {"error": "Unexpected response code: 500"}
2018-05-17T00:13:57.036Z        INFO        cmd/sentinel.go:94        Trying to acquire sentinels leadership
2018-05-17T00:14:04.262Z        ERROR        cmd/sentinel.go:114        election loop error        {"error": "Unexpected response code: 500 (No cluster leader)"}
2018-05-17T00:14:05.477Z        ERROR        cmd/sentinel.go:1727        error retrieving cluster data        {"error": "Unexpected response code: 500"}
2018-05-17T00:14:14.263Z        INFO        cmd/sentinel.go:94        Trying to acquire sentinels leadership
2018-05-18T13:22:22.965Z        ERROR        cmd/sentinel.go:114        election loop error        {"error": "failed to read lock: Unexpected response code: 500"}
2018-05-18T13:22:22.965Z        ERROR        cmd/sentinel.go:1727        error retrieving cluster data        {"error": "Unexpected response code: 500"}
2018-05-18T13:22:32.965Z        INFO        cmd/sentinel.go:94        Trying to acquire sentinels leadership
2018-05-19T21:46:25.626Z        ERROR        cmd/sentinel.go:1727        error retrieving cluster data        {"error": "Unexpected response code: 500"}
2018-05-24T09:04:34.218Z        ERROR        cmd/sentinel.go:114        election loop error        {"error": "failed to read lock: Unexpected response code: 500"}
2018-05-24T09:04:34.218Z        ERROR        cmd/sentinel.go:1727        error retrieving cluster data        {"error": "Unexpected response code: 500"}
2018-05-24T09:04:44.218Z        INFO        cmd/sentinel.go:94        Trying to acquire sentinels leadership
2018-05-24T09:04:44.233Z        INFO        cmd/sentinel.go:101        sentinel leadership acquired
2018-05-24T09:04:46.352Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "063ec7b0", "keeper": "node_3"}
2018-05-24T09:04:46.356Z        INFO        cmd/sentinel.go:1264        adding new synchronous standby in good state trying to reach MaxSynchronousStandbys        {"masterDB": "063ec7b0", "synchronousStandbyDB": "1b09e02a", "keeper": "node_1"}
2018-05-24T09:04:46.356Z        INFO        cmd/sentinel.go:1307        merging current and previous synchronous standbys        {"masterDB": "063ec7b0", "prevSynchronousStandbys": {"d75bc438":{}}, "synchronousStandbys": {"1b09e02a":{},"d75bc438":{}}}
2018-05-24T09:04:46.356Z        INFO        cmd/sentinel.go:1311        adding previous synchronous standby        {"masterDB": "063ec7b0", "synchronousStandbyDB": "d75bc438", "keeper": "node_2"}
2018-05-24T09:04:46.356Z        INFO        cmd/sentinel.go:1318        synchronousStandbys changed        {"masterDB": "063ec7b0", "prevSynchronousStandbys": {"d75bc438":{}}, "synchronousStandbys": {"1b09e02a":{},"d75bc438":{}}}
2018-05-24T09:04:48.385Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "063ec7b0", "keeper": "node_3"}
2018-05-24T09:04:50.425Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "063ec7b0", "keeper": "node_3"}
2018-05-24T09:04:52.464Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "063ec7b0", "keeper": "node_3"}
2018-05-24T09:04:54.504Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "063ec7b0", "keeper": "node_3"}
2018-05-24T09:04:54.508Z        INFO        cmd/sentinel.go:989        master db is failed        {"db": "063ec7b0", "keeper": "node_3"}
2018-05-24T09:04:54.508Z        INFO        cmd/sentinel.go:1000        trying to find a new master to replace failed master
2018-05-24T09:04:54.509Z        INFO        cmd/sentinel.go:1026        electing db as the new master        {"db": "d75bc438", "keeper": "node_2"}
2018-05-24T09:04:56.544Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "063ec7b0", "keeper": "node_3"}
2018-05-24T09:04:56.551Z        INFO        cmd/sentinel.go:1100        waiting for proxies to be converged to the current generation        {"proxies": ["a5fbd147", "deac06bc", "f186d151"]}
2018-05-24T09:04:58.585Z        WARN        cmd/sentinel.go:271        no keeper info available        {"db": "063ec7b0", "keeper": "node_3"}
2018-05-24T09:04:58.590Z        INFO        cmd/sentinel.go:1141        removing old master db        {"db": "063ec7b0", "keeper": "node_3"}
2018-05-24T09:04:58.590Z        INFO        cmd/sentinel.go:1196        won't update masterDB required synchronous standby since the latest master reported synchronous standbys are different from the db spec ones%!(EXTRA string=reported, []string=[063ec7b0 1b09e02a], string=spec, []string=[1b09e02a 063ec7b0])
2018-05-24T09:05:02.662Z        INFO        cmd/sentinel.go:1196        won't update masterDB required synchronous standby since the latest master reported synchronous standbys are different from the db spec ones%!(EXTRA string=reported, []string=[063ec7b0 1b09e02a], string=spec, []string=[1b09e02a 063ec7b0])

Cropped a bit, here's what happens in the failure scenario:

Trying to acquire sentinels leadership
 sentinel leadership acquired
 no keeper info available        {"db": "063ec7b0", "keeper": "node_3"}
  adding new synchronous standby in good state trying to reach MaxSynchronousStandbys        {"masterDB": "063ec7b0", "synchronousStandbyDB": "1b09e02a", "keeper": "node_1"}
  merging current and previous synchronous standbys        {"masterDB": "063ec7b0", "prevSynchronousStandbys": {"d75bc438":{}}, "synchronousStandbys": {"1b09e02a":{},"d75bc438":{}}}
  adding previous synchronous standby        {"masterDB": "063ec7b0", "synchronousStandbyDB": "d75bc438", "keeper": "node_2"}
  synchronousStandbys changed        {"masterDB": "063ec7b0", "prevSynchronousStandbys": {"d75bc438":{}}, "synchronousStandbys": {"1b09e02a":{},"d75bc438":{}}}
 no keeper info available        {"db": "063ec7b0", "keeper": "node_3"}
 master db is failed        {"db": "063ec7b0", "keeper": "node_3"}
  trying to find a new master to replace failed master
  electing db as the new master        {"db": "d75bc438", "keeper": "node_2"}
 no keeper info available        {"db": "063ec7b0", "keeper": "node_3"}
  waiting for proxies to be converged to the current generation        {"proxies": ["a5fbd147", "deac06bc", "f186d151"]}
 no keeper info available        {"db": "063ec7b0", "keeper": "node_3"}
  removing old master db        {"db": "063ec7b0", "keeper": "node_3"}
  won't update masterDB required synchronous standby since the latest master reported synchronous standbys are different from the db spec ones%!(EXTRA string=reported, []string=[063ec7b0 1b09e02a], string=spec, []string=[1b09e02a 063ec7b0])

@nh2
Copy link
Contributor Author

nh2 commented May 24, 2018

I'm just looking through all assignments/changes to Spec.SynchronousStandbys. Should there be a sort after these appends?

if dbUID != newMasterDB.UID {
newMasterDB.Spec.SynchronousStandbys = append(newMasterDB.Spec.SynchronousStandbys, dbUID)
} else {
newMasterDB.Spec.SynchronousStandbys = append(newMasterDB.Spec.SynchronousStandbys, oldMasterdb.UID)
}

@nh2
Copy link
Contributor Author

nh2 commented May 24, 2018

I'm working around this right now with

diff --git a/cmd/sentinel/cmd/sentinel.go b/cmd/sentinel/cmd/sentinel.go
index 14d5649..451b529 100644
--- a/cmd/sentinel/cmd/sentinel.go
+++ b/cmd/sentinel/cmd/sentinel.go
@@ -1196,6 +1196,10 @@ func (s *Sentinel) updateCluster(cd *cluster.ClusterData, pis cluster.ProxiesInf
 					// this way, when we have to choose a new master we are sure
 					// that there're no intermediate changes between the
 					// reported standbys and the required ones.
+
+					// Workaround for https://github.com/sorintlab/stolon/issues/488
+					sort.Sort(sort.StringSlice(masterDB.Spec.SynchronousStandbys))
+
 					if !util.CompareStringSlice(masterDB.Status.SynchronousStandbys, masterDB.Spec.SynchronousStandbys) {
 						log.Infof("won't update masterDB required synchronous standby since the latest master reported synchronous standbys are different from the db spec ones", "reported", curMasterDB.Status.SynchronousStandbys, "spec", curMasterDB.Spec.SynchronousStandbys)
 					} else {

@sgotti
Copy link
Member

sgotti commented May 25, 2018

@nh2 somewhere, probably in the keeper, the synchronous standbys read from the postgresql.conf file are not returned in the order I was expecting. Sorting it before the check like you're doing seems a sane thing to do anyway. Can you open a PR?

@nh2
Copy link
Contributor Author

nh2 commented May 25, 2018

@sgotti I don't feel totally equipped to do that; while I could put certainly put in that line, I would prefer if we could close this with a real investigation of what's happening vs me landing my workaround.

Also what you said

So I'm not sure why the log reports them in the wrong sort order since they are read from the clusterdata

is a valid point and I don't know the answer to that yet.

@sgotti
Copy link
Member

sgotti commented May 28, 2018

I'm just looking through all assignments/changes to Spec.SynchronousStandbys. Should there be a sort after these appends?

Yeah, that's the problem.

I would prefer if we could close this with a real investigation of what's happening vs me landing my workaround.

From your logs looks like the problem happens only when electing a new master since the new master uid is swapped with the old one without sorting.

The best fix will be to ignore the sort order in the check.
Opened PR #494.

@sgotti
Copy link
Member

sgotti commented May 29, 2018

Fixed in #494

@sgotti sgotti closed this as completed May 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants