-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with synchronous standby order when upgrading from v0.9.0 to v0.10.0 #488
Comments
@nh2 Thanks for the report. In the sentinel we already sort the sync standby names here stolon/cmd/sentinel/cmd/sentinel.go Line 308 in ca5ca5a
And looks like the cluster data contains them correctly sorted:
So I'm not sure why the log reports them in the wrong sort order since they are read from the clusterdata:
I'll try to reproduce it if possible. |
@sgotti This happened to me right now again, this time not during an update but just during v0.10.0 operation. |
Here is the
Cropped a bit, here's what happens in the failure scenario:
|
I'm just looking through all assignments/changes to stolon/cmd/sentinel/cmd/sentinel.go Lines 1069 to 1073 in d07f290
|
I'm working around this right now with diff --git a/cmd/sentinel/cmd/sentinel.go b/cmd/sentinel/cmd/sentinel.go
index 14d5649..451b529 100644
--- a/cmd/sentinel/cmd/sentinel.go
+++ b/cmd/sentinel/cmd/sentinel.go
@@ -1196,6 +1196,10 @@ func (s *Sentinel) updateCluster(cd *cluster.ClusterData, pis cluster.ProxiesInf
// this way, when we have to choose a new master we are sure
// that there're no intermediate changes between the
// reported standbys and the required ones.
+
+ // Workaround for https://github.com/sorintlab/stolon/issues/488
+ sort.Sort(sort.StringSlice(masterDB.Spec.SynchronousStandbys))
+
if !util.CompareStringSlice(masterDB.Status.SynchronousStandbys, masterDB.Spec.SynchronousStandbys) {
log.Infof("won't update masterDB required synchronous standby since the latest master reported synchronous standbys are different from the db spec ones", "reported", curMasterDB.Status.SynchronousStandbys, "spec", curMasterDB.Spec.SynchronousStandbys)
} else { |
@nh2 somewhere, probably in the keeper, the synchronous standbys read from the postgresql.conf file are not returned in the order I was expecting. Sorting it before the check like you're doing seems a sane thing to do anyway. Can you open a PR? |
@sgotti I don't feel totally equipped to do that; while I could put certainly put in that line, I would prefer if we could close this with a real investigation of what's happening vs me landing my workaround. Also what you said
is a valid point and I don't know the answer to that yet. |
Yeah, that's the problem.
From your logs looks like the problem happens only when electing a new master since the new master uid is swapped with the old one without sorting. The best fix will be to ignore the sort order in the check. |
Fixed in #494 |
Submission type
Environment
Stolon version
Upgrade from v0.9.0 to v0.10.0
Additional environment information if useful to understand the bug
When upgrading a cluster from v0.9.0 to v0.10.0, after the upgrade all write SQL queries hung forever because replication was stuck. In
ps aux
I could see the WAL sender process stuck at0
like:The
stolon-sentinel
log revealed the problem:[0d1a8783 dc15fe22]
is exactly[dc15fe22 0d1a8783]
but the order is inverted.I suspect that somehow through the upgrade stolon started to expect a different ordering.
Full logs
Workaround
Wiping the clusterdata (after taking a backup) fixed it:
The fix worked after I did the above and restarted all stolon related services.
The text was updated successfully, but these errors were encountered: