-
-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DB Migration and Replication Post-Mortem (2019-12-27 to 2019-12-30) #1109
Comments
Future Mitigations
Other ideas? |
I put a post on pgsql-general with the subject, "How to shorten a chain of logically replicated servers" to address how to get the
|
As described in #949, the proxy server is now used by all clients. It was easy, but I've learned a lot since May when it became a straggling issue. It was hard then. |
The symlinked |
Swap files are gone. Just ran |
Final schema updates are landing at customers and our DB:
|
Shortening the replication chain is spun of into #1115 and the rest of this complete. Closing for now, thank goodness. All is well with the DB's, and in fact, is slightly better than when we began since the proxy implementation is now complete. |
This issue notes the various causes, solutions, and issues that occurred during the database migration begun on 2019-12-27 and mostly completed on 2019-12-30. This migration, while successful, encountered numerous problems, which had to be solved on the fly.
Architecture and goals analysis
The CourtListener replication system uses the PostgreSQL system to replicate data from our servers to those of our clients. The current architecture has five components:
The servers
The
master
server. This is our core database that serves all SQL requests and which serves as the ground truth. This is an on-premise server that is incredibly powerful.The
old-master
server. This is our previous master server/database. It subscribes to themaster
publisher, and then publishes changes to ourAWS-server
. This server should be removed from the architecture, but we have not figured out how to do so (more below). This is an on-premise server with medium horsepower.The `AWS-server``. This server is smaller and is only used to publish changes to our customer servers. It serves no SQL and uses AWS RDS.
Our
client-servers
. These servers are in AWS or Google Cloud and are controlled by our clients. They are usually about the same size and power as ourAWS-server
, to which they subscribe, but clients can scale them up and down as needed for their use case.The data and schema of these servers is all kept in sync at all times.
The
proxy
Because AWS RDS does not provide static externally-accessible IP addresses on RDS instances, we need a
proxy
that does (see #949). This server sits between our AWS server and our clients, so that their connections to replication can use a domain name instead of an IP address. Failure to do this causes headaches when the IP of ourAWS-server
changes, which it does from time to time (like when you scale it).This
proxy
is only used by half of our clients because not all have been upgraded to use it at this time. In general, there's no need to think about the proxy server — it just provides a domain name and doesn't do much else.Data flow
When a data changes comes in, it flows thus:
The plan
There were two goals of this plan:
Add several new fields to the
search_opinion
andsearch_opinioncluster
tables. These would add support for the harvard database.Tweak three text fields so that they are no longer nullable, per Django's convention. (Text fields in Django are never NULL. If you lack a value, they're always simply blank,
''
.)To achieve this, the process is generally to use Django's built in migration system to migrate the
master
server, then use the SQL it generates to manually migrate all the other servers (old-master
,AWS-server
, andclient-servers
).What went wrong?
Lots.
Postgres < 11, DEFAULT columns, and disk exhaustion
Before version 11 of PostgreSQL, if you add a column to a table, and if that column has a default value, the entire table is rewritten. For normal tables, this is fine — What's a few hundreds MB among friends? For tables containing the text of millions of legal cases, this will copy hundreds of GB of data around for each column you add.
In Django, by convention, text fields are always blank by default, and do not allow nulls. Thus, Django's standard migrations do exactly the wrong thing. They create new columns that have a DEFAULT of
""
. Again, this is fine, but not with big data like we have.Our master server, due to its power and copious empty disk space, was able to do this migration without much sweat. Next, I began migrating
old-master
and this is where our first major problem arose.When I began migrating
old-master
andaws-server
, they crashed due to disk exhaustion. AWS was easily fixed by scaling up the disk, but that will incur an extra cost forever because disks cannot be scaled down.On
old-master
, this didn't wreck the server because PostgreSQL kills itself before it fills the disk entirely (I think), but it's bad. With time of the essence I reached for the simplest solution, which was to move part of the PostgreSQL database to a different (slower) disk on the same server that had more space. When PostgreSQL rewrites tables, it stores temporary data in thepg_wal
directory. Using the guide here, I moved that directory to a larger disk, symlinked to it, and restarted PostgreSQL and the migrations.This was a mistake. I should have rewritten the migrations to avoid this problem instead, but that solution would come later, when I realized that the migrations themselves were too slow to migrate with DEFAULT column additions. Eventually, I did realize that, and I rewrote the migrations as described in #1106. Instead of creating columns with DEFAULT values, you create them null, then issue an update command. That UPDATE command must be in a separate transaction because, perhaps obviously, you cannot change a schema, then change its data in the same transaction.
Memory exhaustion
There's another problem though. PostgreSQL moves entire transactions into memory before replicating them over the network. First, it moves them to
pg_replslot
, and then from there, they get transferred into memory.As a result, if you have very large transactions like the kind that'd be caused by updating several columns to have blank values instead of null ones, or that otherwise copy the entire table, you can expect that entire thing to go into memory. At least that's my hypothesis of what happened here.
After getting the updates on
old-master
, PostgreSQL entered a cycle of boom and bust. Memory would boom, Linux would kill PostgreSQL (bust), PostgreSQL would faithfully restart, the subscriber would request the changes again, and the cycle would repeat:So the scene is this:
You have transactions on
old-master
that need to be replicated to theAWS-server
. Some transactions are easy to re-run by hand, like the UPDATEs that were part of the schema changes, but most of the transactions are unknown due to user and automated updates.You don't have enough memory to move those transactions to the subscriber.
I was able to stop the boom-bust cycle by DISABLE'ing the subscription on
aws-server
. That paused things so I could think, and I researched many solutions:Delete the transaction log and let the replicated servers be slightly off. (Bad b/c of data integrity, but impossible anyway, because PostgreSQL won't boot if you delete transactions from the log.)
Advance the subscriber past the large transaction, thus skipping it. This can be done using the
pg_replication_origin_advance
function. (Better, b/c less data loss, but impossible anyway because it requires superuser permission, which nobody gets on an RDS machine.)Install more memory in the machine. (Impractical b/c how much is enough? Plus, the machine isn't nearby, and getting RAM on the weekend could be difficult. Does it even have open slots? An AWS machine would have been easier to scale here, but remember this is our on-premise
old-master
server.)Use swap space. Set up a disk to serve as additional RAM:
Good: Doable!
Bad: Will be very slow; will cause disk contention for other processes on the server; how much is enough?
In the end, I chose the only option available to me, and created a 100GB swap file using these instructions. That got things rolling, but 100GB wasn't enough. Luckily, you can create as many swap files as needed. In the end it took 300GB of swap space, plus the 64GB of RAM we had to get things moved. This caused downtime on our front end, as it contended for disk usage.
With the memory and disk exhaustion both solved by our one really large disk, things eventually replicated from
old-master
toaws-server
. I knew this was complete becausepg_wal
shrunk from about 400GB to 2GB, but the swap space continued to be used until I restarted PostgreSQL, at which point about 350GB of swap and 50GB of RAM freed. I think this was just the kernel keeping data around in case it was needed again.With the data in AWS, I re-enabled the client subscribers and the data flowed from our
aws-server
to theirs over the next few hours.Null values in non-nullable fields
The problem, perhaps obviously, with separating schema changes from data changes is that the following can happen:
Migrate all servers to new schema with nullable fields.
Issue update command to set all null values to blank.
New data comes in. Sets field to null because that's the default.
A transaction log is written!
Disallow NULL values in the column across all servers.
The result of this is that a null value is temporarily created in the column and then put in the transaction log. From there, it gets replicated, but the server to which it is replicating will not allow null values because of step 5. This breaks replication until you allow nulls on that server, let the replication flow through, update the value to a blank, and then make it not-null again.
Ideally, you could prevent this by putting all five steps into a transaction, but remember that you cannot put schema and data changes in the same transaction. The solution, therefore, is to use the following process instead:
Disable all subscriptions.
In a transaction on
master
:Add columns on allowing NULLs and with no DEFAULT.
Set DEFAULT to '' (this must be a separate step or face table rewrite!)
Set `NOT NULL
In a transactions on all other servers:
Add columns with no DEFAULT, NOT NULL
Set DEFAULT
Re-enable all subscriptions
Further analysis and tasks
Can the intermediate server be removed? Currently the data migrates from
master
toold-master
, and then toaws-server
. Is there a way to removeold-master
from the loop?The trick for pulling this off is to:
master
toold-master
, and start a new PUBLICATION onmaster
that will begin queueing changes foraws-server
.master
andold-master
.old-master
flush toaws-server
.aws-server
andold-master
.aws-server
to the new PUBLICATIONThat might work?
Upgrade all clients to use the proxy server. This should be easy. Just DISABLE the connection, ALTER it, and then ENABLE it again.
Move pg_wal on
old-master
off the slow disk.Nuke the extra swap files.
Make the fields NOT NULL in various places
The text was updated successfully, but these errors were encountered: