-
-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reverse direction of logical replication in prod #2060
Comments
Organizing my thoughts a bit more about this:
|
Here's the full process... PreparationStart by creating two publications on
After that, the
The next trick is to create the subscriptions. I think I can test connectivity with a subscription that doesn't do anything:
To make those work I had to get the external IP address of cl-replica and of prod. On cl-replica's security group I had to add an outbound rule and on prods I had to add an inbound rule. Once those connections work, I drop the subscriptions (they get properly created again in a moment):
Update Go TimeOK, that's everything we can do with the system up. Everything that follows must be done with it down.
Everything is now stopped. Monitor replication lag on
Disable replication:
Reset sequences on the prod DB. THEY ARE NOT REPLICATED. Run this:
And then paste in the output SQL commands. Finally, create the subscriptions properly:
Note that I cannot create these subscriptions in advance with enable=false or something, because once you create the slots, the changes start piling up. We don't want changes piling up because we're already getting those same changes from the current subscription chain. TestDo a sample write on
Check that it arrived on
If good...Great. Re-enable all service in reverse order. If bad...Disable new subscriptions, enable old ones, regroup. Clean upIf it's all good at this point, drop unused subscriptions and publications. |
PostmortemPretty smooth! A couple lesson's learned:
|
OK, so a couple more follow ups before I wrap this up:
In fixing all this, I deleted content from these tables:
Key:
For the search_* tables, I did it manually using some interactive scripts I wrote, doing it bit by bit until it was clean. This took most of the time. Finally, for Parenthtical and ParentheticalGroup, and EmailSent, they didn't get synced before flipping the flow, so I nuked the tables from our new AWS DB, dumped them from our old on-premise DB using pg_dump, and then loaded them into the AWS DB. This then synced via replication back to the on-prem DB, so I then nuked these tables on the on-prem DB, so replication could restore them. To load the Parenthetical tables, I had to drop the contraints on the tables, load the data, then add the constraints back. It was a pain, but luckily we had migration SQL ready to go in the code base. Here's the dump/load commands:
I think things are stabilized as of now, but I've got new cloudwatch monitoring on the |
Oh, and there are more details about this (yes, even more!) in Slack from today, in the #development channel. |
Currently we have replication that goes like this:
prod
is called that because it's destined to be our new production DB.We need to tweak the replication so it goes like this:
The challenge, of course, is doing this quickly and correctly so that we minimize downtime and don't lose any changes. In #1115 I did a lot of research on a similar topic, so I'll be using that to make a plan here.
Note that once this is done, we'll have our DB hosted in AWS. A very big step forward.
The text was updated successfully, but these errors were encountered: