Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create replication documentation and formalize the process #932

Closed
mlissner opened this issue Feb 28, 2019 · 4 comments
Closed

Create replication documentation and formalize the process #932

mlissner opened this issue Feb 28, 2019 · 4 comments

Comments

@mlissner
Copy link
Member

In a few days, we launch our replication system, and we have one other client so far that's ready to turn it on. This is great! But we need to be better at setting people up. Our first client helped us a lot along the way, our second will probably find more issues, but our third needs it to be smooth, and our second needs to have formal documentation.

So, lets:

  1. Use this issue to write down the commands and processes for this, and
  2. Let's write up some documentation as a word doc or whatever, since this is a nice product and people expect and deserve that.
@mlissner
Copy link
Member Author

mlissner commented Feb 28, 2019

Create a new AWS server & database

  1. Create an AWS server with 1.2TB of general purpose gp2 storage space, as much RAM and CPU as needed. Recommended instance running Postgresql 14 in the Oregon availability zone.

  2. The server needs a database configured as follows:

    • Single DB instance
    • Instance name (a spice, was customers, but no more)
    • maser username: postgres (we'll create the django user later)
    • create and use a password
    • instance type: t3.xlarge
    • Storage:
      • 1200 GB gp2 storage, autoscale to 1500.
    • Connections:
      • Don't connect to EC2
      • VPC of vpc-0d1020b9cea79a3e3
      • Subnet group of default-vpc-0d102...
      • publicly accessible
      • Security group of rds-launch-wizard (yeah, yeah), availability zone of us-west-2b
      • No RDS proxy
      • Certificate authority: default
      • port: 5432
    • Password auth
    • Monitoring:
      • Performance insights on
      • No devops guru
      • Enhanced monitoring 60s.
    • Additional config:
  3. Allow ingress to the server:

    • From your laptop, ping the RDS domain name for the new server.
    • Add the IP of the server to the inbound rules in the security group.

    In AWS this is done through your VPC's security group. The one to edit is: sg-044759e3ba310566a.

  4. Test the connection:

     psql --host replica-for-documentation2.xxx.rds.amazonaws.com \
          --username=postgres --dbname postgres
    

@mlissner
Copy link
Member Author

mlissner commented Feb 28, 2019

Networking

Generally this went well. The general architecture is:

  • Route 53 connecting a domain name to ...
  • Elastic Load Balancer (ELB) connected to...
  • ELB Target Groups, routing data to...
  • An EC2 proxy running...
  • HAProxy, which forwards traffic to...
  • The RDS instance

The trick is to set this all up from bottom to top and to have it all work at the end. You have to do it that way so each piece can connect to the piece before.

The RDS instance

See above. No major tricks here.

EC2 & HAProxy

This is running on an EC2 micro instance that's built from a saved AMI. Just launch that AMI:

  • Instance type: t3.micro
  • Do not create new keys or use existing ones; the aws-replica-keys.pem key is built into the AMI.
  • Set the VPC (free-law-vpc)
  • Subnet: public-2d
  • Assign public IP
  • Set the security group to rds-launch-wizard and replication-proxy
  • Storage is all good go into the proxy settings

Launch!

Once launched, ssh into the instance...

ssh -i .ssh/aws-replica-keys.pem ubuntu@xyz

...and tweak the proxy settings to point to the RDS instance. (see bash history in the AMI for details)

Restart HAProxy for good measure and check its logs. They should say almost nothing.

Check that you can use psql to get into RDS from the CL server through the proxy.

ELB Target Groups

Create a target group on port 5432 and register the EC2 instance as a target:

  • Target type: Instance
  • Name $customer-{5432,5433}-tcp
  • Port: 5432, 5433
  • free-law-vpc
  • Health check protocol: tcp
  • Select instance

Elastic Load Balancer

  • Network load balancer
  • Name: {spice}-replica-elb
  • Scheme: Internet-facing
  • free-law-vpc
  • us-west-2d
  • No security groups
  • Add listeners from above

It may take a moment to provision (note the "State" field). Even once it's live, it can take a minute, but eventually you'll be able to use psql to connect directly to the DNS name.

Route 53

This part is easy. Just set up an A record as an alias to the ELB. Remember that subdomains aren't private, so use codenames here if needed.

Once it's done, you should be able to psql directly to something like spicey.courtlistener.com!

Set up Replication

Replication Ingress and Egress

There are two steps. First, you have to allow the client server's IP address for inbound and outbound traffic. Get their IP and allow it.

Second, you have to allow inbound traffic from the external IP address of the new client RDS server. You can get that IP address by pinging it from your laptop. Something like:

ping some-rds-server-us-west-2.rds.amazonaws.com

The above is done in the Security Group settings.

Set up replication

  1. On the subscriber server, create the django user:

    CREATE USER django CREATEDB NOSUPERUSER;
    \password django
    
  2. Create the latest version of the database schema. Run the following on the maintenance server:

     pg_dump --host $DB_HOST \
       --username django \
       --create \
       --schema-only \
       --no-privileges \
       --no-publications \
       --no-subscriptions courtlistener \
       --password  > | grep -v -i 'trigger' > cl_schema.`date -I`.sql
    

    Then edit the file. Audit references toTRIGGER, GRANT, PUBLICATION, SUBSCRIPTION, EXTENSION, or REPLICA (you might not see all of these, indeed, the last time I did this, the only one I saw was EXTENSION).

  3. Import the schema into the new server:

      psql --host=replica-for-documentation2.xxx.us-west-2.rds.amazonaws.com \
           --port=5432 --username=django --dbname postgres < cl_schema.2019-02-27.sql
    

    Note that the above uses the new django user, not the postgresql admin user to connect. I had an issue here last time with LOCALE_PROVIDER. It is an option for CREATEDB in postgresql 15, but not 14, but 14 seems to output it in the schema.

  4. Create a publication on cl-replica for the user to subscribe to:

    • Log into the publisher with django user.

    • List existing publications:

      select * from pg_publication;
      
    • Make a new publication:

    For search, people DB, and disclosures (2023-08-11):

     CREATE PUBLICATION nutmeg_publication FOR TABLE ONLY
         -- Audio
         audio_audio, audio_audio_panel, 
         -- Disclosures 
         disclosures_agreement, disclosures_debt, disclosures_financialdisclosure, disclosures_gift, disclosures_investment, disclosures_noninvestmentincome, disclosures_position, disclosures_reimbursement, disclosures_spouseincome, 
         -- People
         people_db_abarating, people_db_attorney, people_db_attorneyorganization, people_db_attorneyorganizationassociation, people_db_criminalcomplaint, people_db_criminalcount, people_db_education, people_db_party, people_db_partytype, people_db_person, people_db_person_race, people_db_politicalaffiliation, people_db_position, people_db_race, people_db_role, people_db_school, people_db_source, 
         -- FJC
         recap_fjcintegrateddatabase,
         -- Search
         search_bankruptcyinformation, search_citation, search_claim, search_claim_tags, search_claimhistory, search_court, search_docket, search_docket_panel, search_docket_tags, search_docketentry, search_docketentry_tags, search_opinion, search_opinion_joined_by, search_opinioncluster, search_opinioncluster_non_participating_judges, search_opinioncluster_panel, search_opinionscited, search_opinionscitedbyrecapdocument, search_originatingcourtinformation, search_parenthetical, search_parentheticalgroup, search_recapdocument, search_recapdocument_tags, search_tag
         ;
    

    To create a new list of tables, use \pset pager off then \dt then copy/paste into a spreadsheet. Probably better ways to do this.

  5. Create a new user for the subscriber to subscribe with on cl-replica.

    You can see existing roles with:

     SELECT rolname FROM pg_roles;
    

    On cl-replica do:

     CREATE ROLE some-client-name WITH LOGIN;
     GRANT rds_replication to some-user;
    

    Set the password (note that really long passwords will fail when you try to create the subscription, so be careful about that):

     \password some-user
    

    And grant them select on the needed tables:

    For PACER, disclosures, and people_db:

     GRANT SELECT on table
         -- ALL OF THE TABLES ABOVE...
     TO XXXXX;
    
  6. Set up the subscription on the new server:

    • On subscriber, connect to courtlistener DB as postgres user.

    • Run:

      CREATE SUBSCRIPTION spice_subscription 
        CONNECTION 'host=cl-replica (not subscriber host!) port=5432 password=XXX user=xxx dbname=courtlistener' 
        PUBLICATION opendata_custname;
      

    Subscriptions can be viewed with:

     select * from pg_subscription;
    

@mlissner

This comment has been minimized.

@mlissner
Copy link
Member Author

mlissner commented Sep 9, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant