Reverse direction of logical replication in prod #2060

mlissner · 2022-05-12T23:26:16Z

Currently we have replication that goes like this:

app code --> on-prem-db --> cl-replica (AWS) --> prod (AWS)

prod is called that because it's destined to be our new production DB.

We need to tweak the replication so it goes like this:

app code --> prod (AWS) --> cl-replica (AWS)
                        `-> on-prem-db

The challenge, of course, is doing this quickly and correctly so that we minimize downtime and don't lose any changes. In #1115 I did a lot of research on a similar topic, so I'll be using that to make a plan here.

Note that once this is done, we'll have our DB hosted in AWS. A very big step forward.

The text was updated successfully, but these errors were encountered:

mlissner · 2022-05-12T23:53:49Z

Organizing my thoughts a bit more about this:

Component	Action
App Code connection to `on-prem`	Redirect to new DB
Celery code connection to `on-prem`	Redirect to new DB
`on-prem` publication to `cl-replica`	drop once confirmed
`cl-replica` subscription to `on-prem`	disable at first, drop once confirmed
`cl-replica` publication to `prod`	drop once confirmed
`prod` subscription to `cl-replica`	disable at first, drop once confirmed
`cl-replica` subscription to `prod`	create and enable
`prod` publication to `cl-replica`	create
`on-prem` subscription to `prod`	create and enable
`prod` publication to `on-prem`	create

mlissner · 2022-05-13T04:45:15Z

Here's the full process...

Preparation

Start by creating two publications on prod. Have to do a few things:

Give the django user rds_superuser access:
```
  grant rds_superuser to django;
```
Set rds.logical_replication parameter to 1 in the parameter group, and then reboot. Once rebooted, confirmed with show wal_level (which then showed it was logical).

After that, the django user was able to log into the courtlistener DB, and create the two publications:

create publication to_cl_replica for all tables;
create publication to_on_prem for all tables;

The next trick is to create the subscriptions. I think I can test connectivity with a subscription that doesn't do anything:

# On on-prem, connect with:
sudo -u postgres psql --dbname courtlistener

# Then...
CREATE SUBSCRIPTION from_prod_to_on_prem
  CONNECTION 'host=prod.db.aws port=5432 password=xxx user=django dbname=courtlistener' 
  PUBLICATION to_on_prem with (create_slot=false, slot_name=NONE, copy_data=false, enabled=false);

# On cl-replica, connect with:
psql --host=xxx.rds.amazonaws.com --port=5432 --username=django --dbname courtlistener

# Then...
CREATE SUBSCRIPTION from_prod_to_replica
  CONNECTION 'host=prod.db.aws port=5432 password=xxx user=django dbname=courtlistener' 
  PUBLICATION to_cl_replica with (create_slot=false, slot_name=NONE, copy_data=false, enabled=false);

To make those work I had to get the external IP address of cl-replica and of prod. On cl-replica's security group I had to add an outbound rule and on prods I had to add an inbound rule.

Once those connections work, I drop the subscriptions (they get properly created again in a moment):

drop subscription from_prod;  --on_prem
drop subscription from_prod;  --cl_replica

Update .env files to have all the needed DB connection information, so it's picked up when things are started.

Go Time

OK, that's everything we can do with the system up. Everything that follows must be done with it down.

Stop cron jobs
Stop RSS: cd /var/www/cl/docker/nginx/ && sudo docker-compose stop cl-scrape-rss
Stop celery: cd /opt/tasks/docker/task-server/ && sudo docker-compose --env-file ../../.env.celery stop
Stop python (important so it gets new settings): cd /var/www/cl/docker/nginx/ && sudo docker-compose stop cl-python
Stop nginx: cd /var/www/cl/docker/nginx/ && sudo docker-compose stop cl-nginx

Everything is now stopped. Monitor replication lag on on-prem and cl-replica. Both should go to zero:

SELECT 
  slot_name,
  confirmed_flush_lsn, 
  pg_current_wal_lsn(), 
  (pg_current_wal_lsn() - confirmed_flush_lsn) AS lsn_distance
FROM pg_replication_slots;

Disable replication:

on cl-replica: alter subscription replicasubscription2 disable;
on prod: alter subscription from_replica disable;

Reset sequences on the prod DB. THEY ARE NOT REPLICATED. Run this:

\pset pager 0
SELECT 'SELECT SETVAL(' ||
       quote_literal(quote_ident(PGT.schemaname) || '.' || quote_ident(S.relname)) ||
       ', COALESCE(MAX(' ||quote_ident(C.attname)|| '), 1) ) FROM ' ||
       quote_ident(PGT.schemaname)|| '.'||quote_ident(T.relname)|| ';'
FROM pg_class AS S,
     pg_depend AS D,
     pg_class AS T,
     pg_attribute AS C,
     pg_tables AS PGT
WHERE S.relkind = 'S'
    AND S.oid = D.objid
    AND D.refobjid = T.oid
    AND D.refobjid = C.attrelid
    AND D.refobjsubid = C.attnum
    AND T.relname = PGT.tablename
ORDER BY S.relname;

And then paste in the output SQL commands.

Finally, create the subscriptions properly:

USE SUBSCRIPTION COMMANDS ABOVE, BUT WITH ONLY copy_data=false

Note that I cannot create these subscriptions in advance with enable=false or something, because once you create the slots, the changes start piling up. We don't want changes piling up because we're already getting those same changes from the current subscription chain.

Test

Do a sample write on prod:

\c courtlistener
select view_count from search_docket where id = 4214664; 
update search_docket set view_count = view_count + 1 where id = 4214664;

Check that it arrived on cl-replica and on on-prem:

select view_count from search_docket where id = 4214664;

If good...

Great. Re-enable all service in reverse order.

If bad...

Disable new subscriptions, enable old ones, regroup.

Clean up

If it's all good at this point, drop unused subscriptions and publications.

mlissner · 2022-05-13T04:47:16Z

Postmortem

Pretty smooth! A couple lesson's learned:

You can't have two subscriptions to one publisher that have the same subscription name because the slot on the publisher is based on the subscription name.
Before you start the new subscriptions, all sequences need to be reset on prod. They don't get set by logical replication.
Don't forget to save changes in your settings file before starting things (doh).
Watch logs and Sentry carefully after something like this, and for a while, to ensure no further backups develop.

mlissner · 2022-05-13T23:34:42Z

OK, so a couple more follow ups before I wrap this up:

I didn't reset the sequences properly while doing this and that created a lot of conflicts during replication, because new rows got created on prod after the replication reverse. When those rows got created, they had really low PKs (starting with 1), because that's how they got created.
Some of the tables weren't synced to the prod before flipping direction, because they were new. This sucked and created a lot of trouble. After flipping the direction, new rows would get created starting with PK of 1. When they replicated back to the on-prem, they'd conflict with the rows that were already there. To fix this, I just nuked some of these tables (they only had new, unimportant data, so far, like the user_email* tables), or I deleted conflicting content in on-prem to clear it out of the way of the new stuff.

In fixing all this, I deleted content from these tables:

People (gte)
Role (gte)
AttorneyOrganizationAssociation (gte)
users_emailsent (all)
users_emailflag (all)
ProcessingQueue (only new)
users_backoffevent (all)
search_docket (gte)
search_docketentry (gte)
search_recapdocument (gte)

Key:

gte -> I was able to limit the deletion just to new PKs that were conflicting
all -> I deleted the table

For the search_* tables, I did it manually using some interactive scripts I wrote, doing it bit by bit until it was clean. This took most of the time.

Finally, for Parenthtical and ParentheticalGroup, and EmailSent, they didn't get synced before flipping the flow, so I nuked the tables from our new AWS DB, dumped them from our old on-premise DB using pg_dump, and then loaded them into the AWS DB. This then synced via replication back to the on-prem DB, so I then nuked these tables on the on-prem DB, so replication could restore them. To load the Parenthetical tables, I had to drop the contraints on the tables, load the data, then add the constraints back. It was a pain, but luckily we had migration SQL ready to go in the code base.

Here's the dump/load commands:

pg_dump -U django --host localhost --port 5432 --format plain --verbose -a --file /tmp/recap_emailprocessingqueue.sql --table public.recap_emailprocessingqueue --table users_emailsent courtlistener --password
psql -h prod.hash.rds.amazonaws.com --username django --dbname courtlistener -f /tmp/recap_emailprocessingqueue.sql  --password

I think things are stabilized as of now, but I've got new cloudwatch monitoring on the prod DB, so I'll find out if not.

mlissner · 2022-05-13T23:35:04Z

Oh, and there are more details about this (yes, even more!) in Slack from today, in the #development channel.

mlissner self-assigned this May 12, 2022

mlissner added this to AWS Migration May 12, 2022

mlissner closed this as completed May 13, 2022

mlissner moved this to Done in AWS Migration May 13, 2022

mlissner mentioned this issue May 25, 2022

Disable replication back to on-prem server #2077

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reverse direction of logical replication in prod #2060

Reverse direction of logical replication in prod #2060

mlissner commented May 12, 2022

mlissner commented May 12, 2022

mlissner commented May 13, 2022

mlissner commented May 13, 2022

mlissner commented May 13, 2022

mlissner commented May 13, 2022

Reverse direction of logical replication in prod #2060

Reverse direction of logical replication in prod #2060

Comments

mlissner commented May 12, 2022

mlissner commented May 12, 2022

mlissner commented May 13, 2022

Preparation

Go Time

Test

If good...

If bad...

Clean up

mlissner commented May 13, 2022

Postmortem

mlissner commented May 13, 2022

mlissner commented May 13, 2022