-
-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shorten replication chain #1115
Comments
Here's my message:
|
The one response I got:
I asked "How?" and they replied:
|
I don't know how I feel about this at this point. It seems like there's no easy or good way to do this, which is pretty terrible. If there was a way to start enquing changes at the same time as stopping replication, we'd be in business, but I haven't figured that out yet. I'm going to try asking that of the mailing list. We'll see... |
OK, we'll see if this gets any responses:
|
OK, some better conversation on this one. Looks like this will be doable. The trick is to drop the subscription while keeping the replication slot. You can do that with something like:
That'll keep the slot on "A" open so that changes pile up there. Then, on C, you can wait for changes to flush from B, then you subscribe C directly to A. And you should be off to the races! |
OK, so the plan is to do a test set up not too different from the one I did in #977. I'll set up three docker containers on ports 5432, 5433, and 5434. And I'll set up replication from lowest port to highest. I'll set up some basic table, and I'll write a little script to add content on the low port so we can observe it going to the high port. Then, with ongoing writes to the low port, I'll attempt the approach above. We'll see if it all works. |
Setting up the test environmentI started by getting three images going with:
I wasn't able to figure out how to
Once inside each of the three images, I created a database with one table with one column. I did this on each of the three images since they don't replicate schemas. create database test;
\c test;
create table test_table (test_col varchar(10)); Then, on the master publisher, I was able to insert values with: insert into test_table (test_col) values ('It works!'); The next step is to set up replication from 5432 --> 5433 --> 5434. We start by making users with passwords on each of the publishers:
Then we make two publishers: On 5432 I ran:
On 5433 I ran:
I didn't create a publication on 5434 (it's the end of the chain). Finally we create two subscribers:
To get the correct host I used With the above done, I could see data replicating. Whoo! Onwards to the test. |
Running the testOk, this delightful mess with insert a value into our 5432 DB every quarter second (or whatever I tweak it to):
(That's being run inside our 5432 DB.) So I'm going to set that up to run pretty frequently and we'll see if anything gets dropped when I shorten the chain. Step one, disconnect 5433 from 5432, but maintain changes on 5432On the middle server I run: test=# BEGIN;
BEGIN
test=# alter subscription test_sub disable;
ALTER SUBSCRIPTION
test=# alter subscription test_sub set (slot_name = NONE);
ALTER SUBSCRIPTION
test=# COMMIT; I don't know if the transaction helps. I hope so, but we're in the woods here. The step to disable had to be done first because without it you get an error that:
After 5433 was unsubscribed from 5432, I was able to check that it and the terminal server (5434) had the same number of items as Step two, subscribe 5434 to 5432Next, I subscribe 5434 to the publication that was previously sending data to 5433. My first attempt at this went...poorly. I ran this at first:
Which returned:
Fair enough. So I pulled up the CREATE SUBSCRIPTION docs, and then ran:
That worked, but I was getting error messages on 5434 (the server I ran it on) that said:
So.....I did something dumb. I dropped the subscription with:
Careful readers will see that that dropped the remote slot that I needed to keep around to have my queued changes from going away. If this happened live I'd be very sad, but that's why we run tests. So...now I have to start over a bit. Luckily I have lots of notes. |
Step two, try two, wherein I hope not to mess it up againAfter re-establishing replication, clearing the data, and getting the Then I tried again to subscribe to the newly available slot using this command:
That almost worked, but I got one lingering problem: It did a full COPY of the data again.. When I stop INSERTing rows into 5432 via the Bah. Let's try one more time for the folks in the back. |
Step two, try three, wherein I feel pretty good about it?The one missing bit that I didn't do when I created the subscription last time was to include:
Arg. Let's try it one more time. I'll clear things out, set up the replication chain, set up watch again, and then try the following command: CREATE SUBSCRIPTION test_sub CONNECTION 'host=172.17.0.5 port=5432 user=user1 password=asdf dbname=test' PUBLICATION test_pub with (create_slot=false, slot_name=test_sub, copy_data=false); That almost worked! But for some reason I don't have the right number of items. In 5432, I have one fewer item than in 5434 after all is said and done. That means that somehow an item got added twice. I don't know what to make of that. I'll have to think about it more carefully tomorrow. |
I need to repeat this test with a primary key on the table. |
Another todo:
|
Using #1115 (comment) I was able to recreate three postgresql servers in a chain. The only tweaks needed were:
Onwards to the test. |
Ugh, so the newer postgresql image that I'm running lacks the
Once that's running, I went into the middle server, and dropped the subscription using: BEGIN;
alter subscription test_sub set (slot_name = NONE);
alter subscription test_sub set (slot_name = NONE);
COMMIT; Nothing is flowing at this point. My next step was to go to the terminal server (5344) and subscribe it to the initial server (5432), with: CREATE SUBSCRIPTION test_sub2 CONNECTION 'host=172.17.0.7 port=5432 user=user1 password=asdf dbname=test' PUBLICATION test_pub with (create_slot=false, slot_name=test_sub, copy_data=false); With that done, the missing changes synced and I confirmed with a count:
Finally, I nuked the old subscription on the terminal server (this was just cleanup): select * from pg_subscription;
alter subscription test_sub disable;
alter subscription test_sub set (slot_name = NONE);
drop subscription test_sub; And finally, I double checked that everything was working properly still. It was. This was a successful test. |
The only missing piece here is to think about when to switch the ports over. I think the answer is to do it after nuking the subscription from the middle to terminal servers. The live servers can be called Before startingOn the select view_count from search_docket where id=4214664; Then check on the terminal server with the same command. Reload the page here: Check each again. On the
|
We've got the router part set up with a mapping of port 5433 to 5432, so that's in a good place. Brian also suggested that I do a second chain to a second terminal AWS server and test this there, so using the instructions in #932, I set that up. The first copy is going now. Once it completes, I'll be able to do a trial run of this that's even more realistic than the docker-based one. |
And, my plan to do a trial run isn't a good one because the topology isn't the same. It's like this:
Trying to use that as a test case would mean severing the tie between root and middle, which we can't do. Ultimately, I've read a lot of documentation and I've done a few trials, but I just don't have the confidence I need to do this properly. I need deeper expertise than I can get from documentation, so I'm going to go see if we can get support from 2nd Quadrant. |
On the advice of somebody at pgexperts.com,* the safer approach here is to:
|
Steps taken:
If good...
If bad...
Cleanup
Later
Lessons learned:
|
This is finally done, after about six months and a lot of emails, stress, and work. I'm excited to do more routine work going forward. |
And I dropped the DB from our old server, finally fixing the last piece of this problem:
Woo, it's good to see a drive go from 87% to 5% full. |
As discussed in #1109, the chain of replication currently goes from:
master --> old-master --> AWS server --> Clients
The old-server should be axed from this chain. I have a post on the psql mailing list about this. Hopefully there will be some responses soon.
The text was updated successfully, but these errors were encountered: