DB Migration and Replication Post-Mortem (2019-12-27 to 2019-12-30) #1109

mlissner · 2019-12-31T00:29:38Z

This issue notes the various causes, solutions, and issues that occurred during the database migration begun on 2019-12-27 and mostly completed on 2019-12-30. This migration, while successful, encountered numerous problems, which had to be solved on the fly.

Architecture and goals analysis

The CourtListener replication system uses the PostgreSQL system to replicate data from our servers to those of our clients. The current architecture has five components:

The servers

The master server. This is our core database that serves all SQL requests and which serves as the ground truth. This is an on-premise server that is incredibly powerful.
The old-master server. This is our previous master server/database. It subscribes to the master publisher, and then publishes changes to our AWS-server. This server should be removed from the architecture, but we have not figured out how to do so (more below). This is an on-premise server with medium horsepower.
The `AWS-server``. This server is smaller and is only used to publish changes to our customer servers. It serves no SQL and uses AWS RDS.
Our client-servers. These servers are in AWS or Google Cloud and are controlled by our clients. They are usually about the same size and power as our AWS-server, to which they subscribe, but clients can scale them up and down as needed for their use case.

The data and schema of these servers is all kept in sync at all times.

The `proxy`

Because AWS RDS does not provide static externally-accessible IP addresses on RDS instances, we need a proxy that does (see #949). This server sits between our AWS server and our clients, so that their connections to replication can use a domain name instead of an IP address. Failure to do this causes headaches when the IP of our AWS-server changes, which it does from time to time (like when you scale it).

This proxy is only used by half of our clients because not all have been upgraded to use it at this time. In general, there's no need to think about the proxy server — it just provides a domain name and doesn't do much else.

Data flow

When a data changes comes in, it flows thus:

master --> old master --> our AWS --> (proxy server sometimes) --> client servers

The plan

There were two goals of this plan:

Add several new fields to the search_opinion and search_opinioncluster tables. These would add support for the harvard database.
Tweak three text fields so that they are no longer nullable, per Django's convention. (Text fields in Django are never NULL. If you lack a value, they're always simply blank, ''.)

To achieve this, the process is generally to use Django's built in migration system to migrate the master server, then use the SQL it generates to manually migrate all the other servers (old-master, AWS-server, and client-servers).

What went wrong?

Lots.

Postgres < 11, DEFAULT columns, and disk exhaustion

Before version 11 of PostgreSQL, if you add a column to a table, and if that column has a default value, the entire table is rewritten. For normal tables, this is fine — What's a few hundreds MB among friends? For tables containing the text of millions of legal cases, this will copy hundreds of GB of data around for each column you add.

In Django, by convention, text fields are always blank by default, and do not allow nulls. Thus, Django's standard migrations do exactly the wrong thing. They create new columns that have a DEFAULT of "". Again, this is fine, but not with big data like we have.

Our master server, due to its power and copious empty disk space, was able to do this migration without much sweat. Next, I began migrating old-master and this is where our first major problem arose.

When I began migrating old-master and aws-server, they crashed due to disk exhaustion. AWS was easily fixed by scaling up the disk, but that will incur an extra cost forever because disks cannot be scaled down.

On old-master, this didn't wreck the server because PostgreSQL kills itself before it fills the disk entirely (I think), but it's bad. With time of the essence I reached for the simplest solution, which was to move part of the PostgreSQL database to a different (slower) disk on the same server that had more space. When PostgreSQL rewrites tables, it stores temporary data in the pg_wal directory. Using the guide here, I moved that directory to a larger disk, symlinked to it, and restarted PostgreSQL and the migrations.

This was a mistake. I should have rewritten the migrations to avoid this problem instead, but that solution would come later, when I realized that the migrations themselves were too slow to migrate with DEFAULT column additions. Eventually, I did realize that, and I rewrote the migrations as described in #1106. Instead of creating columns with DEFAULT values, you create them null, then issue an update command. That UPDATE command must be in a separate transaction because, perhaps obviously, you cannot change a schema, then change its data in the same transaction.

Memory exhaustion

There's another problem though. PostgreSQL moves entire transactions into memory before replicating them over the network. First, it moves them to pg_replslot, and then from there, they get transferred into memory.

As a result, if you have very large transactions like the kind that'd be caused by updating several columns to have blank values instead of null ones, or that otherwise copy the entire table, you can expect that entire thing to go into memory. At least that's my hypothesis of what happened here.

After getting the updates on old-master, PostgreSQL entered a cycle of boom and bust. Memory would boom, Linux would kill PostgreSQL (bust), PostgreSQL would faithfully restart, the subscriber would request the changes again, and the cycle would repeat:

So the scene is this:

You have transactions on old-master that need to be replicated to the AWS-server. Some transactions are easy to re-run by hand, like the UPDATEs that were part of the schema changes, but most of the transactions are unknown due to user and automated updates.
You don't have enough memory to move those transactions to the subscriber.

I was able to stop the boom-bust cycle by DISABLE'ing the subscription on aws-server. That paused things so I could think, and I researched many solutions:

Delete the transaction log and let the replicated servers be slightly off. (Bad b/c of data integrity, but impossible anyway, because PostgreSQL won't boot if you delete transactions from the log.)
Advance the subscriber past the large transaction, thus skipping it. This can be done using the pg_replication_origin_advance function. (Better, b/c less data loss, but impossible anyway because it requires superuser permission, which nobody gets on an RDS machine.)
Install more memory in the machine. (Impractical b/c how much is enough? Plus, the machine isn't nearby, and getting RAM on the weekend could be difficult. Does it even have open slots? An AWS machine would have been easier to scale here, but remember this is our on-premise old-master server.)
Use swap space. Set up a disk to serve as additional RAM:
- Good: Doable!
- Bad: Will be very slow; will cause disk contention for other processes on the server; how much is enough?

In the end, I chose the only option available to me, and created a 100GB swap file using these instructions. That got things rolling, but 100GB wasn't enough. Luckily, you can create as many swap files as needed. In the end it took 300GB of swap space, plus the 64GB of RAM we had to get things moved. This caused downtime on our front end, as it contended for disk usage.

With the memory and disk exhaustion both solved by our one really large disk, things eventually replicated from old-master to aws-server. I knew this was complete because pg_wal shrunk from about 400GB to 2GB, but the swap space continued to be used until I restarted PostgreSQL, at which point about 350GB of swap and 50GB of RAM freed. I think this was just the kernel keeping data around in case it was needed again.

With the data in AWS, I re-enabled the client subscribers and the data flowed from our aws-server to theirs over the next few hours.

Null values in non-nullable fields

The problem, perhaps obviously, with separating schema changes from data changes is that the following can happen:

Migrate all servers to new schema with nullable fields.
Issue update command to set all null values to blank.
New data comes in. Sets field to null because that's the default.
A transaction log is written!
Disallow NULL values in the column across all servers.

The result of this is that a null value is temporarily created in the column and then put in the transaction log. From there, it gets replicated, but the server to which it is replicating will not allow null values because of step 5. This breaks replication until you allow nulls on that server, let the replication flow through, update the value to a blank, and then make it not-null again.

Ideally, you could prevent this by putting all five steps into a transaction, but remember that you cannot put schema and data changes in the same transaction. The solution, therefore, is to use the following process instead:

Disable all subscriptions.
In a transaction on master:
1. Add columns on allowing NULLs and with no DEFAULT.
2. Set DEFAULT to '' (this must be a separate step or face table rewrite!)
3. Set `NOT NULL
In a transactions on all other servers:
1. Add columns with no DEFAULT, NOT NULL
2. Set DEFAULT
Re-enable all subscriptions

Further analysis and tasks

Can the intermediate server be removed? Currently the data migrates from master to old-master, and then to aws-server. Is there a way to remove old-master from the loop?

The trick for pulling this off is to:
- In a transaction: Kill the connection from master to old-master, and start a new PUBLICATION on master that will begin queueing changes for aws-server.
- Drop the subscription between master and old-master.
- Let any changes on old-master flush to aws-server.
- DROP the subscription between aws-server and old-master.
- Subscribe aws-server to the new PUBLICATION
That might work?
Upgrade all clients to use the proxy server. This should be easy. Just DISABLE the connection, ALTER it, and then ENABLE it again.
Move pg_wal on old-master off the slow disk.
Nuke the extra swap files.
Make the fields NOT NULL in various places

The text was updated successfully, but these errors were encountered:

mlissner · 2019-12-31T00:45:04Z

Future Mitigations

An upgrade to PostgreSQL 11 will help with the proximate cause (new columns with DEFAULT values).
Until then, and even after, I've created a new file called MIGRATING.md that notes patterns and anti-patterns in migration files. This file can be supplemented with additional problems as they are discovered. I've also referenced a few documents that list similar problems that I discovered while dealing with this.
Getting old-master out of the loop will simplify all future migrations.
I'd like to figure out a testing system to avoid this in the future. I think the easiest way to do this is to use AWS's replication system to spin up a copy of our DB, where a migration can be quickly run. That won't necessarily test the replication of the migration, but it will demonstrate huge RAM, CPU, or disk usage. Additional suggestions welcome here. The tricky part is that you need all the data in the DB in order for the test to be valid.

Other ideas?

mlissner · 2020-01-01T00:40:36Z

I put a post on pgsql-general with the subject, "How to shorten a chain of logically replicated servers" to address how to get the old-master out of the chain of replication. Hopefully that'll get a few replies. The latest iteration in how to accomplish the task, from the email, is:

1. In a transaction, DISABLE the replication from A to B and start a
new PUBLICATION on A that C will subscribe to in step ③ below. The
hope is that this will simultaneously stop sending changes to B while
starting a log of new changes that can later be sent to C.

2. Let any changes queued on B flush to C. (How to know when they're
all flushed?)

3. Subscribe C to the new PUBLICATION created in step ①. Create the
subscription with copy_data=False. This should send all changes to C
that hadn't been sent to B, without sending the complete tables.

4. DROP all replication to/from B (this is just cleanup; the incoming
changes to B were disabled in step ①, and outgoing changes from B were
flushed in step ②).

mlissner · 2020-01-01T01:08:09Z

As described in #949, the proxy server is now used by all clients. It was easy, but I've learned a lot since May when it became a straggling issue. It was hard then.

mlissner · 2020-01-01T01:28:47Z

The symlinked pg_wal directory is cleaned up, though I left the old (smallish) directory on our slow big disk because I'm scared to delete it. I'll clean it up later when I'm confident everything is working as it should be. (I can't think of any reason it wouldn't be, but I'm a bit rushed so I'm being more careful than makes any sense.)

mlissner · 2020-01-01T06:43:12Z

Swap files are gone. Just ran sudo swapoff -v /swapfile then rm'ed 'em.

mlissner · 2020-01-03T00:28:03Z

Final schema updates are landing at customers and our DB:

master
old-master
aws-server
clients

mlissner · 2020-01-03T01:05:09Z

Shortening the replication chain is spun of into #1115 and the rest of this complete. Closing for now, thank goodness. All is well with the DB's, and in fact, is slightly better than when we began since the proxy implementation is now complete.

mlissner mentioned this issue Jan 3, 2020

Shorten replication chain #1115

Closed

mlissner closed this as completed Jan 3, 2020

mlissner mentioned this issue May 15, 2023

feat(search.models.py): Update opinion cluster model #2746

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB Migration and Replication Post-Mortem (2019-12-27 to 2019-12-30) #1109

DB Migration and Replication Post-Mortem (2019-12-27 to 2019-12-30) #1109

mlissner commented Dec 31, 2019 •

edited

Loading

mlissner commented Dec 31, 2019

mlissner commented Jan 1, 2020

mlissner commented Jan 1, 2020

mlissner commented Jan 1, 2020

mlissner commented Jan 1, 2020

mlissner commented Jan 3, 2020 •

edited

Loading

mlissner commented Jan 3, 2020

DB Migration and Replication Post-Mortem (2019-12-27 to 2019-12-30) #1109

DB Migration and Replication Post-Mortem (2019-12-27 to 2019-12-30) #1109

Comments

mlissner commented Dec 31, 2019 • edited Loading

Architecture and goals analysis

The servers

The proxy

Data flow

The plan

What went wrong?

Postgres < 11, DEFAULT columns, and disk exhaustion

Memory exhaustion

Null values in non-nullable fields

Further analysis and tasks

mlissner commented Dec 31, 2019

Future Mitigations

mlissner commented Jan 1, 2020

mlissner commented Jan 1, 2020

mlissner commented Jan 1, 2020

mlissner commented Jan 1, 2020

mlissner commented Jan 3, 2020 • edited Loading

mlissner commented Jan 3, 2020

mlissner commented Dec 31, 2019 •

edited

Loading

The `proxy`

mlissner commented Jan 3, 2020 •

edited

Loading