Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test disabling, dropping, and recreating subscriptions for RDS upgrade #2766

Closed
mlissner opened this issue May 18, 2023 · 4 comments
Closed

Comments

@mlissner
Copy link
Member

As part of #2745, I need to recreate our replication chain and test the process to make sure it works before doing it on our clients. Following the approach in #1115, I'll give this a shot and put notes here.

@mlissner
Copy link
Member Author

Set up the replication chain

The first step is to set up three postgresql docker images on port 5432, 5433, and 5434. These will be known by their port numbers, and data will flow from lowest to highest. In other words, 5432 will publish to 5433, and 5433 will publish to 5434.

Do this across three terminals:

 sudo docker run -p 5432 -e POSTGRES_PASSWORD=asdf postgres -p 5432 -c wal_level=logical
 sudo docker run -p 5433 -e POSTGRES_PASSWORD=asdf postgres -p 5433 -c wal_level=logical
 sudo docker run -p 5434 -e POSTGRES_PASSWORD=asdf postgres -p 5434 -c wal_level=logical

Then log into the three databases:

docker container ls | grep postgre
# In another three terminals
docker exec -it bash
psql -U postgres --port {5432|5433|5434}

Then create the DB in each and connect to it:

create database test;
\c test;
-- Make sure to create a table with a PK:
create table test_table(id serial primary key, test_col varchar(10));

Then create a user on all three...

CREATE ROLE user1 WITH REPLICATION LOGIN;
\password user1;

...and grant them permissions:

grant select on test_table to user1;

Create a publication on 5432 and 5433 with:

create publication test_pub for all tables;

Then finally subscribe to 5433 and 5432 from 5434 and 5433 respectively. To get the correct host IPs of the containers, use docker network inspect bridge and docker container ls | grep postgres, then:

CREATE SUBSCRIPTION test_sub CONNECTION 'host=172.17.0.3 port=5433 user=user1 password=pass dbname=test' PUBLICATION test_pub;

And:

CREATE SUBSCRIPTION test_sub CONNECTION 'host=172.17.0.2 port=5432 user=user1 password=pass dbname=test' PUBLICATION test_pub;

@mlissner
Copy link
Member Author

Run the test

Run this while pointing it to the 5432 server:

watch -n 0.25 docker exec -it 500  'psql -U postgres -d test -c "insert into test_table (test_col) values ('\''It works'\'');"

Then check counts in test_table during and after killing the above. They should map across.

If that works, your replication is up and running!

@mlissner
Copy link
Member Author

Now, get real and test dropping/disabling/creating subscriptions

Make sure that you're adding data:

watch -n 0.25 docker exec -it 500  'psql -U postgres -d test -c "insert into test_table (test_col) values ('\''It works'\'');"

Disable the prod (5432) to replica (5433) subscription. From 5433:

test=# select * from pg_subscription;
  oid  | subdbid | subskiplsn | subname  | subowner | subenabled | subbinary | substream | subtwophasestate | subdisableonerr |                          subconninfo                           | subslotname | subsynccommit | subpublications 
-------+---------+------------+----------+----------+------------+-----------+-----------+------------------+-----------------+----------------------------------------------------------------+-------------+---------------+-----------------
 16398 |   16388 | 0/0        | test_sub |       10 | t          | f         | f         | d                | f               | host=172.17.0.2 port=5432 user=user1 password=pass dbname=test | test_sub    | off           | {test_pub}
(1 row)

alter subscription test_sub disable;

Check that the data has flushed to 5434 (lsn_distance should be zero before proceeding):

        SELECT
            slot_name,
            confirmed_flush_lsn,
            pg_current_wal_lsn(),
            (pg_current_wal_lsn() - confirmed_flush_lsn) AS lsn_distance
        FROM pg_replication_slots;

Drop subscriptions on 5434 servers:

test=# select * from pg_subscription;
  oid  | subdbid | subskiplsn | subname  | subowner | subenabled | subbinary | substream | subtwophasestate | subdisableonerr |                          subconninfo                           | subslotname | subsynccommit | subpublications 
-------+---------+------------+----------+----------+------------+-----------+-----------+------------------+-----------------+----------------------------------------------------------------+-------------+---------------+-----------------
 16397 |   16388 | 0/0        | test_sub |       10 | t          | f         | f         | d                | f               | host=172.17.0.3 port=5433 user=user1 password=pass dbname=test | test_sub    | off           | {test_pub}

test=# drop subscription test_sub ;
NOTICE:  dropped replication slot "test_sub" on publisher
DROP SUBSCRIPTION

Check slots on 5433:

test=# SELECT
            slot_name,
            confirmed_flush_lsn,
            pg_current_wal_lsn(),
            (pg_current_wal_lsn() - confirmed_flush_lsn) AS lsn_distance
        FROM pg_replication_slots;
 slot_name | confirmed_flush_lsn | pg_current_wal_lsn | lsn_distance 
-----------+---------------------+--------------------+--------------
(0 rows)

Great! Upgrade 5433 in RDS (not part of this test).

Then re-create the subscriptions to 5433 from 5434:

CREATE SUBSCRIPTION test_sub CONNECTION 'host=172.17.0.3 port=5433 user=user1 password=pass dbname=test' PUBLICATION test_pub with (copy_data=false);
NOTICE:  created replication slot "test_sub" on publisher
CREATE SUBSCRIPTION

Check slots on 5433 as above. lsn_distance should still be 0.

Finally, re-enable the subscription from 5432 to 5433:

alter subscription test_sub enable;

Stop adding data with CTRL+C on the watch command, then check that all rows have the same count on all three servers:

select count(*) from test_table;

That's it! if the count is the same, the test is successful.

This looks like it works perfectly.

@mlissner
Copy link
Member Author

Note that this worked well, EXCEPT for one very huge problem. I haven' confirmed this yet, but it appears that when you upgrade a database using pg_upgrade (which is what RDS uses), it nukes the LSN location for (disabled?) subscriptions.

This means that when you restart the subscription, it's going to try to copy everything, with disastrous results.

The fix is that instead of (dis/en)abling the subscription, you DROP it at the outset, and then create a new one after, with copy_data=False. Lesson learned.

I asked for more info about this here: https://www.postgresql.org/message-id/CAMp9%3DEzf9OyuQaQUZDHEauTp-Dn_pPdS6xNmcFJHuQM_qsmZvQ%40mail.gmail.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

1 participant