Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impossible to add a second node to the Stolon cluster [SOLVED] #168

Closed
afiskon opened this issue Oct 20, 2016 · 6 comments
Closed

Impossible to add a second node to the Stolon cluster [SOLVED] #168

afiskon opened this issue Oct 20, 2016 · 6 comments

Comments

@afiskon
Copy link

afiskon commented Oct 20, 2016

Stolon version: 0.3.0 (same issue on master branch) - built from source
Consul version: 0.7.0 (default Arch Linux package)
PostgreSQL version: 9.5.4 (default Arch Linux package)
Go version: 1.7.1 (default Arch Linux package)

I've created 3 virtual machines using Virtual Box. VMs are in 10.0.3.0/24 network - 10.0.3.7 (archlinux1), 10.0.3.8 (archlinux2) and 10.0.3.9 (the last one is not used below).

Steps to reproduce.

On archlinux1:

./bin/stolon-sentinel --cluster-name mycluster --store-backend consul --listen-address 10.0.3.7

echo -n 'supasswd123' > ~/.pgsupasswd
echo -n 'replpasswd123' > ~/.pgreplpasswd

./bin/stolon-keeper --data-dir ./postgres1 --id postgres1 --cluster-name mycluster --pg-su-passwordfile ~/.pgsupasswd --pg-repl-username repluser --pg-repl-passwordfile ~/.pgreplpasswd --store-backend consul --pg-listen-address 10.0.3.7

./bin/stolon-proxy --cluster-name mycluster --listen-address 10.0.3.7 --port 25432 --store-backend consul

psql --host localhost --port 5432 -d postgres

CREATE DATABASE test;
CREATE USER test WITH PASSWORD 'qwerty';
GRANT ALL PRIVILEGES ON DATABASE test TO test;

psql --host 10.0.3.7 --port 25432 --user test

Single node configuration works as expected.

On archlinux2:

./bin/stolon-keeper --data-dir ./postgres2 --id postgres2 --cluster-name mycluster --pg-su-passwordfile ~/.pgsupasswd --pg-repl-username repluser --pg-repl-passwordfile ~/.pgreplpasswd --store-backend consul --pg-listen-address 10.0.3.8

./bin/stolon-sentinel --cluster-name mycluster --store-backend consul --listen-address 10.0.3.8

sentinel output:

2016-10-20 18:06:44.186737 [pkg_logger.go:134] I | sentinel: id: 96674841
2016-10-20 18:06:44.188273 [pkg_logger.go:134] I | sentinel: Trying to acquire sentinels leadership

(and nothing more)

keeper output:

2016-10-20 18:06:39.308606 [pkg_logger.go:134] I | keeper: id: postgres2
2016-10-20 18:06:39.313567 [pkg_logger.go:134] I | postgresql: Stopping database
2016-10-20 18:06:39.324793 [pkg_logger.go:134] I | keeper: current pg state: master
2016-10-20 18:06:39.325030 [pkg_logger.go:134] I | keeper: our keeper requested role is not available
2016-10-20 18:06:44.330639 [pkg_logger.go:134] I | keeper: current pg state: master
2016-10-20 18:06:44.330683 [pkg_logger.go:134] I | keeper: our keeper requested role is not available
2016-10-20 18:06:49.336491 [pkg_logger.go:134] I | keeper: current pg state: master

(etc, etc, etc)

keeper --debug output: http://afiskon.ru/s/21/a5c6ef5ab5_keeper.txt
sentinel --debug output: http://afiskon.ru/s/96/5affa784dc_sentinel.txt

Any advice?

@sgotti
Copy link
Member

sgotti commented Oct 20, 2016

@afiskon I need the logs on the leader sentinel (the one on archlinux1). For now I can guess that the leader sentinel cannot talk with the keeper on the second node (default on port 5431) (firewalled?).

@afiskon
Copy link
Author

afiskon commented Oct 21, 2016

@sgotti OK, it was my mistake, not a bug. I forgot to specify --listen-address parameter for a keeper. Everything works, handles shutting machines down, netsplits, etc as expected. Thanks!

I would like to ask you a few more questions:

  • Does proxy send read-only requests to standbys or all queries are always served only by master?
  • During replacing a master does Stolon consider whether standby is synchronous on not (naturally if synchronous_standby_names has a proper value)?
  • Lets say I have multiple Stolon clusters. Do I need a separate Consul cluster for each Stolon cluster to handle netsplits and other corner cases properly?

Unfortunately currently there is no answer on these questions in the documentation. Perhaps you could create a FAQ.md file or something like this.

@afiskon afiskon changed the title BUG: impossible to add a second node to the cluster Impossible to add a second node to the Stolon cluster [SOLVED] Oct 21, 2016
@sgotti
Copy link
Member

sgotti commented Oct 21, 2016

@afiskon Glad it worked!

Does proxy send read-only requests to standbys or all queries are always served only by master?

Currently the proxy redirects all requests to the master. Its primary scope is to avoid connections to partitioned master and does this just passing the data to it (it's just a layer4 proxy, no content inspection is done). #132 is a feature request for using the proxy also for standbys but it's low in the priority list.

During replacing a master does Stolon consider whether standby is synchronous on not (naturally if synchronous_standby_names has a proper value)?

I'm not sure if I got your question right. Currently you can only have async or sync standbys per cluster and not a mix of them.

Lets say I have multiple Stolon clusters. Do I need a separate Consul cluster for each Stolon cluster to handle netsplits and other corner cases properly?

It depends on your architecture and where the different stolon cluster are located. A good choice is to connect to a consul kv store placed in the same "location" of your stolon cluster. So if you have multiple stolon cluster in the same "location" you can use only one store (the path is based on the stolon cluster name so they won't conflict). Obvisouly if your store goes down all the clusters will be affected.

Unfortunately currently there is no answer on these questions in the documentation. Perhaps you could create a FAQ.md file or something like this.

Yes we are lacking some doc (there're also some upcoming changes that will change a lot of things to add additional features so the current doc will be reworked a bit).
Any question and PRs to add them to the documentation are greatly appreciated (see also #164) since, as it always happens, what looks obvious to who develops a software is not so obvious to an external user.

@afiskon
Copy link
Author

afiskon commented Oct 21, 2016

@sgotti thanks a lot for your reply!

Currently the proxy redirects all requests to the master

It's not a big problem in practice since [ see https://github.com//issues/132#issuecomment-255372997 ]. What actually leads me to one more little question - by any chance does Stolon use Consul as DNS server as well? Unfortunately I didn't found a way to list all DNS records stored in Consul to figure it out by myself.

I'm not sure if I got your question right. Currently you can only have async or sync standbys per cluster and not a mix of them.

OK, basically sync mode means:

synchronous_standby_names = 'postgres2,postgres3'

According to PostgreSQL documentation ( https://www.postgresql.org/docs/9.5/static/runtime-config-replication.html ) :

Specifies a comma-separated list of standby names that can support synchronous replication, as described in Section 25.2.8. At any one time there will be at most one active synchronous standby; transactions waiting for commit will be allowed to proceed after this standby server confirms receipt of their data. The synchronous standby will be the first standby named in this list that is both currently connected and streaming data in real-time (as shown by a state of streaming in the pg_stat_replication view). Other standby servers appearing later in this list represent potential synchronous standbys.

Lets say postgres1 dies, postgres2 is synchronous standby and postgres3 is a bit out of sync (doesn't has a few latest WAL records). Does Stolon guarantee that postgres2 will be a new master or there is 50% change that it will be postgres2 and 50% change - that postgres3?

Any question and PRs to add them to the documentation are greatly appreciated

Well I think you could just copy-paste this discussion to FAQ section :) IMO it's very important details everyone should be aware of. Or I'll just send a corresponding PR a bit letter.

@sgotti
Copy link
Member

sgotti commented Oct 21, 2016

by any chance does Stolon use Consul as DNS server as well? Unfortunately I didn't found a way to list all DNS records stored in Consul to figure it out by myself.

Currently stolon doesn't need to resolve any dns name for internal communication (sentinel -> keeper), (stolonctl->sentinel) but the communication style will probably change. Resolving name for connecting to a store (like etcd) are delegated to the client and so to the default go resolver (depens if compiled with or without cgo)

Or did you mean registering a service in consul? In this case the store (etcd/consul) is only used as a k/v store.

OK, basically sync mode means:

synchronous_standby_names = 'postgres2,postgres3'
According to PostgreSQL documentation ( https://www.postgresql.org/docs/9.5/static/runtime-config-replication.html ) :

This is automatically handled by stolon if you set synchronous replication https://github.com/sorintlab/stolon/blob/master/doc/syncrepl.md

Does Stolon guarantee that postgres2 will be a new master or there is 50% change that it will be postgres2 and 50% change - that postgres3?

Currently it tries to find the best standby, the one with the xlog location nearest to the master latest knows xlog location. If a master is down there's no way to know its latest xlog position (stolon get and save it at some intervals) so there's no way to guarantee that the standby is not behind but just that the best standby of the ones available will be choosed. An option that could be added in future (didn't had time) will be to specify a maximum lag. But I think it's quite impossible to guarantee no out of sync in some situations.

@afiskon
Copy link
Author

afiskon commented Oct 21, 2016

But I think it's quite impossible to guarantee no out of sync in some situations.

It's impossible in PostgreSQL <= 9.5, but in 9.6 replication to the quorum was added (see 9.6 docs for synchronous_standby_names). It guarantees that in case of netsplit there is at least one synchronous standby among nodes in the majority. (Naturally nothing will help in case when cluster splits in three equal parts, i.e. double netsplit. To handle this you need a so called AP solution, like Riak, so it's not our case).

It would be very nice to have a quorum synchronous replication support in Stolon. Lack of it basically means that sometimes Stolon can loose a few recent changes.

Thanks again for your insightful answers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants