Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reverse direction of logical replication in prod #2060

Closed
mlissner opened this issue May 12, 2022 · 5 comments
Closed

Reverse direction of logical replication in prod #2060

mlissner opened this issue May 12, 2022 · 5 comments
Assignees

Comments

@mlissner
Copy link
Member

Currently we have replication that goes like this:

app code --> on-prem-db --> cl-replica (AWS) --> prod (AWS)

prod is called that because it's destined to be our new production DB.

We need to tweak the replication so it goes like this:

app code --> prod (AWS) --> cl-replica (AWS)
                        `-> on-prem-db

The challenge, of course, is doing this quickly and correctly so that we minimize downtime and don't lose any changes. In #1115 I did a lot of research on a similar topic, so I'll be using that to make a plan here.

Note that once this is done, we'll have our DB hosted in AWS. A very big step forward.

@mlissner mlissner self-assigned this May 12, 2022
@mlissner
Copy link
Member Author

Organizing my thoughts a bit more about this:

Component Action
App Code connection to on-prem Redirect to new DB
Celery code connection to on-prem Redirect to new DB
on-prem publication to cl-replica drop once confirmed
cl-replica subscription to on-prem disable at first, drop once confirmed
cl-replica publication to prod drop once confirmed
prod subscription to cl-replica disable at first, drop once confirmed
cl-replica subscription to prod create and enable
prod publication to cl-replica create
on-prem subscription to prod create and enable
prod publication to on-prem create

@mlissner
Copy link
Member Author

Here's the full process...

Preparation

Start by creating two publications on prod. Have to do a few things:

  1. Give the django user rds_superuser access:

      grant rds_superuser to django;
    
  2. Set rds.logical_replication parameter to 1 in the parameter group, and then reboot. Once rebooted, confirmed with show wal_level (which then showed it was logical).

After that, the django user was able to log into the courtlistener DB, and create the two publications:

create publication to_cl_replica for all tables;
create publication to_on_prem for all tables;

The next trick is to create the subscriptions. I think I can test connectivity with a subscription that doesn't do anything:

# On on-prem, connect with:
sudo -u postgres psql --dbname courtlistener

# Then...
CREATE SUBSCRIPTION from_prod_to_on_prem
  CONNECTION 'host=prod.db.aws port=5432 password=xxx user=django dbname=courtlistener' 
  PUBLICATION to_on_prem with (create_slot=false, slot_name=NONE, copy_data=false, enabled=false);

# On cl-replica, connect with:
psql --host=xxx.rds.amazonaws.com --port=5432 --username=django --dbname courtlistener

# Then...
CREATE SUBSCRIPTION from_prod_to_replica
  CONNECTION 'host=prod.db.aws port=5432 password=xxx user=django dbname=courtlistener' 
  PUBLICATION to_cl_replica with (create_slot=false, slot_name=NONE, copy_data=false, enabled=false);

To make those work I had to get the external IP address of cl-replica and of prod. On cl-replica's security group I had to add an outbound rule and on prods I had to add an inbound rule.

Once those connections work, I drop the subscriptions (they get properly created again in a moment):

drop subscription from_prod;  --on_prem
drop subscription from_prod;  --cl_replica

Update .env files to have all the needed DB connection information, so it's picked up when things are started.

Go Time

OK, that's everything we can do with the system up. Everything that follows must be done with it down.

  • Stop cron jobs
  • Stop RSS: cd /var/www/cl/docker/nginx/ && sudo docker-compose stop cl-scrape-rss
  • Stop celery: cd /opt/tasks/docker/task-server/ && sudo docker-compose --env-file ../../.env.celery stop
  • Stop python (important so it gets new settings): cd /var/www/cl/docker/nginx/ && sudo docker-compose stop cl-python
  • Stop nginx: cd /var/www/cl/docker/nginx/ && sudo docker-compose stop cl-nginx

Everything is now stopped. Monitor replication lag on on-prem and cl-replica. Both should go to zero:

SELECT 
  slot_name,
  confirmed_flush_lsn, 
  pg_current_wal_lsn(), 
  (pg_current_wal_lsn() - confirmed_flush_lsn) AS lsn_distance
FROM pg_replication_slots;

Disable replication:

  • on cl-replica: alter subscription replicasubscription2 disable;
  • on prod: alter subscription from_replica disable;

Reset sequences on the prod DB. THEY ARE NOT REPLICATED. Run this:

\pset pager 0
SELECT 'SELECT SETVAL(' ||
       quote_literal(quote_ident(PGT.schemaname) || '.' || quote_ident(S.relname)) ||
       ', COALESCE(MAX(' ||quote_ident(C.attname)|| '), 1) ) FROM ' ||
       quote_ident(PGT.schemaname)|| '.'||quote_ident(T.relname)|| ';'
FROM pg_class AS S,
     pg_depend AS D,
     pg_class AS T,
     pg_attribute AS C,
     pg_tables AS PGT
WHERE S.relkind = 'S'
    AND S.oid = D.objid
    AND D.refobjid = T.oid
    AND D.refobjid = C.attrelid
    AND D.refobjsubid = C.attnum
    AND T.relname = PGT.tablename
ORDER BY S.relname;

And then paste in the output SQL commands.

Finally, create the subscriptions properly:

USE SUBSCRIPTION COMMANDS ABOVE, BUT WITH ONLY copy_data=false

Note that I cannot create these subscriptions in advance with enable=false or something, because once you create the slots, the changes start piling up. We don't want changes piling up because we're already getting those same changes from the current subscription chain.

Test

Do a sample write on prod:

\c courtlistener
select view_count from search_docket where id = 4214664; 
update search_docket set view_count = view_count + 1 where id = 4214664;

Check that it arrived on cl-replica and on on-prem:

select view_count from search_docket where id = 4214664; 

If good...

Great. Re-enable all service in reverse order.

If bad...

Disable new subscriptions, enable old ones, regroup.

Clean up

If it's all good at this point, drop unused subscriptions and publications.

@mlissner
Copy link
Member Author

Postmortem

Pretty smooth! A couple lesson's learned:

  1. You can't have two subscriptions to one publisher that have the same subscription name because the slot on the publisher is based on the subscription name.

  2. Before you start the new subscriptions, all sequences need to be reset on prod. They don't get set by logical replication.

  3. Don't forget to save changes in your settings file before starting things (doh).

  4. Watch logs and Sentry carefully after something like this, and for a while, to ensure no further backups develop.

@mlissner
Copy link
Member Author

OK, so a couple more follow ups before I wrap this up:

  1. I didn't reset the sequences properly while doing this and that created a lot of conflicts during replication, because new rows got created on prod after the replication reverse. When those rows got created, they had really low PKs (starting with 1), because that's how they got created.

  2. Some of the tables weren't synced to the prod before flipping direction, because they were new. This sucked and created a lot of trouble. After flipping the direction, new rows would get created starting with PK of 1. When they replicated back to the on-prem, they'd conflict with the rows that were already there. To fix this, I just nuked some of these tables (they only had new, unimportant data, so far, like the user_email* tables), or I deleted conflicting content in on-prem to clear it out of the way of the new stuff.

In fixing all this, I deleted content from these tables:

People (gte)
Role (gte)
AttorneyOrganizationAssociation (gte)
users_emailsent (all)
users_emailflag (all)
ProcessingQueue (only new)
users_backoffevent (all)
search_docket (gte)
search_docketentry (gte)
search_recapdocument (gte)

Key:

gte -> I was able to limit the deletion just to new PKs that were conflicting
all -> I deleted the table

For the search_* tables, I did it manually using some interactive scripts I wrote, doing it bit by bit until it was clean. This took most of the time.

Finally, for Parenthtical and ParentheticalGroup, and EmailSent, they didn't get synced before flipping the flow, so I nuked the tables from our new AWS DB, dumped them from our old on-premise DB using pg_dump, and then loaded them into the AWS DB. This then synced via replication back to the on-prem DB, so I then nuked these tables on the on-prem DB, so replication could restore them. To load the Parenthetical tables, I had to drop the contraints on the tables, load the data, then add the constraints back. It was a pain, but luckily we had migration SQL ready to go in the code base.

Here's the dump/load commands:

pg_dump -U django --host localhost --port 5432 --format plain --verbose -a --file /tmp/recap_emailprocessingqueue.sql --table public.recap_emailprocessingqueue --table users_emailsent courtlistener --password
psql -h prod.hash.rds.amazonaws.com --username django --dbname courtlistener -f /tmp/recap_emailprocessingqueue.sql  --password

I think things are stabilized as of now, but I've got new cloudwatch monitoring on the prod DB, so I'll find out if not.

@mlissner mlissner moved this to Done in AWS Migration May 13, 2022
@mlissner
Copy link
Member Author

Oh, and there are more details about this (yes, even more!) in Slack from today, in the #development channel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done
Development

No branches or pull requests

1 participant