-
-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create new database replication offering #902
Comments
I'm working on getting a demo database put together to test this, using AWS to easily spin up a database server. Some notes:
The docs say that after SSL is set up, you have to do some weird stuff with the connection string, but this connection string is now working for me: psql --host SOMEHOST.rds.amazonaws.com --username xxx --password --dbname xxx So I think that server is set up and it's onwards to the CL server itself, to enable replication on it. |
OK, continuing along, we've got a pretty good tutorial from digital ocean: A couple things to note:
|
Alright! That was very hard, but I've got a replication of To avoid the pile up, I need to remove my subscription. That shouldn't be a huge deal. Just have to do it with:
Some other notes about what made this so hard:
There are a couple ways to monitor the replication, but I haven't found them super helpful. On the publisher you can run:
And on the subscriber you can run:
What I did find useful was running:
And watching my network monitor, which showed the data flowing in. The only remaining things TBD are:
|
Good news, I've got this replicating for our first client. No major issues, just had to add them to the hba file, add the new role for their user, and then a new publication for them. On their AWS server, had to set it up for logical replication, then set up a subscriber. |
Note that logical replication uses "replication slots". According to the docs: https://www.postgresql.org/docs/10/warm-standby.html#STREAMING-REPLICATION-SLOTS
So we'll need monitoring of the disk usage. If a client server goes away without telling us, that will fill up our drive until we kill the publication. Here's an example of this biting somebody and how they fixed it: https://stackoverflow.com/questions/47266359/out-of-space-due-to-wal-files |
One thing to note that can be super frustrating. When you create an RDS database with a default security group, you get a default rule that your IP can connect to that server. But, alas, none other can. This comes back to bite you later. I tell ya. The fix is to add specific rules for which servers can connect to the RDS server, which I've now done. The only IPs that can connect are our own and those of our clients. |
OK, finally launching this for our first client, with this on our replication server:
And create the role on the RDS publisher: CREATE ROLE user WITH LOGIN PASSWORD 'long-password';
grant rds_replication to user; And this on their subscriber: CREATE SUBSCRIPTION opendata CONNECTION 'host=xxx port=5432 password=xxx user=xxx dbname=courtlistener' PUBLICATION opendata; |
And I also had to grant select to the replication user on the replica publisher:
|
And if you want to see if data has come across the wire, you can also use this:
|
Note that #932 formalizes a lot of what's here (by design). |
The documentation for this feature is now live: https://www.courtlistener.com/api/replication/ |
One of our clients is tired of the APIs slowness and wants to get right down to business using a replica of our database. This is something we hadn't considered before, but it fits in nicely with our mission since it will give unfettered access to the data in a way that was never previously possible.
I believe it will also help us in a business sense to gain contributors to the API, something that hasn't been as successful as I'd like. It's a pretty common tool in business to go for the goldlilocks approach. This approach says that if you are selling two toasters, a premium and a regular, everybody will buy the regular. Add a super-premium to the shelf and everybody will buy what was previously the premium model, even if its price didn't change at all. In other words, by offering three price points, the middle one will be the one people want.
In a sense right now we have two "products" as described above: Our free API and an API where you contribute to FLP. Most people choose the free one when it comes down to it even though their usage is often pretty big. I think that just by adding a premium database replication feature, a lot more people will start contributing for their API usage. On top of this, obviously, some people will want the replication, which is great too since there's really nothing better than unfettered access to the raw SQL. Some people will continue not contributing and that often makes sense too.
Anyway, I've been researching this. So far I've investigated three options:
Amazon DMS. This option was suggested by the client and on the surface it makes a lot of sense. It's a system within AWS for migrating and then replicating a postgres DB. That's cool and we tried making it work over the past couple days. Some problems we ran into:
It has a whole pile of limitations including fun ones like not supporting TRUNCATE or setting default column values.
Setting up the networking using AWS's virtual private cloud offerings looked painful.
There is just a lot of documentation for things on AWS. After opening ten tabs I started getting pretty annoyed about the black-boxiness of it all.
I played with getting it all set up nicely and just found it rather frustrating in all kinds of ways. Web GUIs for server administration suck.
But on the other hand, it:
pglogical. I looked at this for a while today too. It:
Con: Doesn't support schema migrations.
Con: Has pretty bad documentation. It's literally one long HTML page, has typos, etc. It feels like it's written for employees of 2ndQuadrant, who then can walk over to the developer's desk to get clarifications.
Con: It requires superuser privileges, something that AWS DMS servers don't give you.It's true that you don't get superuser privileges, but you pglogical is indeed available in AWS.Con: It requires some kind of postgres extension to be installed, which just feels like perpetual baggage to carry around forever. Not thrilling.
Pro: It has column and row filtering, which our next option does not. This could be pretty useful if we want to avoid sharing things with clients.
Native logical replication. As of Postgresql 9.4, the
pglogical
code has slowly been added to postgres itself. Not all of the features are there, but the ones that are seem solid and the documentation is much better thanpglogical
.Con: No column filtering. I thought this would be a big deal, but I went back through our models and did an analysis of columns we wouldn't want to share. We can do table filtering. My conclusion was that right now we don't need column filtering. A few of our columns are things we probably don't want to share, like
view_count
, but the rest are fine. I think this is generally OK.Con: No schema migration. Of the solutions so far, none seem to have this. I thought AWS could figure out it, but nope. Anyway, I feel like by avoiding third party tools we can handle this ourselves. We will have to do this by process instead of code, which is a bit of a bummer, but so be it. I did find one blog post that shows how this might look in practice.
Pros: Built in. Getting enhancements (
TRUNCATE
came in postgres 11, for example). Best documentation.So I'm leaning towards using postgres's native logical replication feature at this point.
The process for setting up this feature will be something like:
The text was updated successfully, but these errors were encountered: