Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Live read replicas #8

Closed
benbjohnson opened this issue Jan 20, 2021 · 37 comments · Fixed by #321
Closed

Live read replicas #8

benbjohnson opened this issue Jan 20, 2021 · 37 comments · Fixed by #321
Labels
enhancement New feature or request
Milestone

Comments

@benbjohnson
Copy link
Owner

Currently, litestream replicas a database to cold storage (file or s3). It would be nice to provide a way to replicate streams of WAL frames to other live nodes to serve as read replicas.

This a network replication endpoint (probably http) as well as figuring out how to apply WAL writes to a live database.

@benbjohnson benbjohnson added the enhancement New feature or request label Jan 20, 2021
@benbjohnson benbjohnson added this to the v0.4.0 milestone Jan 27, 2021
@jackprophesy
Copy link

Would it be possible to sign frames with a private key? Think SQLite as a lightweight blockchain database. Read replicas could subscribe to a database by its public key identifier and then grab the latest frames over p2p transport of choice. There would be a ton of opportunities to use this for dapps.

@benbjohnson
Copy link
Owner Author

You could sign WAL frames with a private key but I'm not sure I understand the use case. What's the benefit rather than streaming the WAL frames from a known source over HTTPS?

@philips
Copy link

philips commented Feb 12, 2021

Signing the WAL before uploading could give a nice protection against the S3 bucket being compromised.

This should probably be a new issue though.

@benbjohnson
Copy link
Owner Author

@philips That's a good point. I was thinking about adding encryption too so it would mitigate an S3 compromise.

@jackprophesy
Copy link

jackprophesy commented Feb 14, 2021

You could sign WAL frames with a private key but I'm not sure I understand the use case. What's the benefit rather than streaming the WAL frames from a known source over HTTPS?

In the decentralized app world people are working on how do you write apps that are decentralized but still have good user experience without the severe limitations of blockchain apps. Now imagine if hackernews/reddit was open source golang with a sqlite database-- and every database change to hackernews is signed and broadcast. Any person could quickly stand-up a near-realtime clone at myhackernews.com. What's powerful is that now the entire community is forkable. If hackernews does something bad all it takes is myhackernews.com to alter the code/data and essentially fork the entire site. Or take it down another level. What if when you created a subreddit you were actually creating a new sqlite database with a keypair. Now it doesn't matter if reddit censors your subreddit. If users are represented by keypairs and (with this new litestream feature) all of their collective action is moderated by a keypair then it allows the person running the app to build out any features they want (without the constraints of traditional blockchains) but it keeps them honest because baked into the software anybody can "fork" the entire site and other users can decide to follow the new keypair if they want. Mastadon and other federated models fix some problems but you are still beholden to the operator of that particular domain. Git/github changed the game by making everything forkable. If entire website communities were forkable nobody would want to use anything else because it would keep operators honest and truly return ownership of shared data to the community.

@benbjohnson
Copy link
Owner Author

I added a new issue for encryption and digital signatures for Litestream: #88

@terem42
Copy link

terem42 commented Jul 14, 2021

Thanks, very much awaited feature indeed. I'm a long time user of SQLite on ZFS, and proponent and contributor to ZFS filesystem project , offering streaming replications and snapshots, and would really love to see it being executed on database level, rather than on filesystem.

@simonw
Copy link

simonw commented Nov 16, 2021

This a network replication endpoint (probably http) as well as figuring out how to apply WAL writes to a live database.

Would it be possible to support read replicas just using the existing S3 bucket mechanism?

I'd love to be able to have Litestream push changes up to an S3 bucket (as it does now) and then have a separate read-only Litestream that polls that bucket for changes and uses them to continuously update a copy of that SQLite database file elsewhere.

I guess the concern with this would be cost - but would polling S3 for changes once a minute really be a significant cost over time? I've not done the napkin maths.

@benbjohnson
Copy link
Owner Author

@simonw Yeah, that's definitely something I want to add with read replicas. There's about 108,000 minutes per month and each LIST request costs $0.000005 so that comes out to $0.54/month/replica.

There's a separate issue though in that you'd need to check for new generations as well in case replication was interrupted for some reason. That's much less likely so maybe that could check every hour instead to keep costs lower.

@simonw
Copy link

simonw commented Feb 17, 2022

Another fun opportunity for this is to take advantage of it on Fly.io volumes.

I did a bunch of work recently to get Datasette working with those, but it currently only makes sense if you have a single instance - so you don't get to take advantage of Fly's ability to load balance your app between multiple regions: https://simonwillison.net/2022/Feb/15/fly-volumes/

What I'd really love to do is build a SQLite emulation of the way Fly handles PostgreSQL replication as described in https://fly.io/blog/globally-distributed-postgres/ - so one of my Fly instances would have the SQLite database that accepts writes, but every other instance would maintain a replicated read-only copy via Litestream.

Any time someone tried to write to the DB, I would ensure that their HTTP request went to the location with the write database.

@anacrolix
Copy link
Contributor

I think this may have been closed prematurely.

@benbjohnson
Copy link
Owner Author

This has been merged into main but not released yet. I suppose I could close issues after release instead.

@anacrolix
Copy link
Contributor

I just mean there's no documentation or anything like that, so it's not usable without reading the code.

@benbjohnson
Copy link
Owner Author

Yeah, that’s fair. I’ll re-open it until that’s all fixed up. 👍

@benbjohnson benbjohnson reopened this Mar 22, 2022
@gedw99
Copy link

gedw99 commented Apr 8, 2022

You could sign WAL frames with a private key but I'm not sure I understand the use case. What's the benefit rather than streaming the WAL frames from a known source over HTTPS?

In the decentralized app world people are working on how do you write apps that are decentralized but still have good user experience without the severe limitations of blockchain apps. Now imagine if hackernews/reddit was open source golang with a sqlite database-- and every database change to hackernews is signed and broadcast. Any person could quickly stand-up a near-realtime clone at myhackernews.com. What's powerful is that now the entire community is forkable. If hackernews does something bad all it takes is myhackernews.com to alter the code/data and essentially fork the entire site. Or take it down another level. What if when you created a subreddit you were actually creating a new sqlite database with a keypair. Now it doesn't matter if reddit censors your subreddit. If users are represented by keypairs and (with this new litestream feature) all of their collective action is moderated by a keypair then it allows the person running the app to build out any features they want (without the constraints of traditional blockchains) but it keeps them honest because baked into the software anybody can "fork" the entire site and other users can decide to follow the new keypair if they want. Mastadon and other federated models fix some problems but you are still beholden to the operator of that particular domain. Git/github changed the game by making everything forkable. If entire website communities were forkable nobody would want to use anything else because it would keep operators honest and truly return ownership of shared data to the community.

This has sone merit. There are are some db systems where the db is versioned and forkable. They under the hood use sql.

Here is one in golang: https://github.com/sqlitebrowser/dbhub.io

@benbjohnson
Copy link
Owner Author

I cut a v0.4.0-beta.2 release, added a guide on the tip docs site, and there's also a litestream-read-replica-example repository if you want to try out the new live replication feature.

@anacrolix
Copy link
Contributor

I did not expect the initial implementation would add direct http syncing from another litestream instance, but rather syncing from a backup location like an S3 replica. Is that intended or is it not possible (and hence the direct http sync instead)?

@benbjohnson
Copy link
Owner Author

@anacrolix HTTP syncing was the goal of live read replication as it provides low latency read replicas. Syncing from S3 is also a possibility but it's more expensive. I wrote up an issue for it explaining some of the costs: #357

@gedw99
Copy link

gedw99 commented Apr 15, 2022

The fly.io setup is fantastic - thanks @benbjohnson

i have a small worry / question regarding reconciliation.

If the node on fly.io ( of any cloud ) is partitioned is there a risk that the slaves will not catch up to the master once the network partition heals ? I assume it’s all fine, because the WAL is stored with the master db right ?

Other question is to do with elections of a new master. if Master node is network partitioned , can a slave be elected to master auto - magically ? I have seen golang gossip setups with no other side system needed that can do this but wondering if it’s possible or planned / wanted for litestream.

For example if I need to kill a master node ( for whatever reason ), I want to tell all clients ( like app tier , or perhaps a GUI clients if I use client side load balancing ) to drain all existing connections / sessions , and use the new Master for all new connections. This ensures no interruptions for users. Maybe there is a smarter pattern then this , but for gossip based failover ( both automatic and forced ) it seems like a logical way to do it.

I see on line 160 that fly.op is responsible for tracking where the primary region is. See : https://github.com/benbjohnson/litestream-read-replica-example/blob/main/main.go. I guess that we are currently dependent of fly.io database load balancing then also ? It would be interesting to be cloud independent , which would assume we run our own gossip ( to know who is master ) and our own built in load balancing system such that a read goes to the nearest slave and a write goes to master. This is funnily enough very similar to how NATS failover works. In NATS, the clients are kept up to data with all nodes endpoints, who is nearest ( for them ) and who is master. That gives the clients enough knowledge to do their thing. If a server node dies half way through a client interaction, it times out and the client has the other endpoints to ask for the new topology ( nearest read only / new master ). It’s sort of like how all p2p networks need a list of initial nodes and a DHT but much simpler. I also understand that fly.io LB makes a client stick to the master node for about 1 to 5 seconds to allow the data to replicate to all slaves, thus ensuring consistency; otherwise a client that wrote a mutation to master and then read from a slave immediately after would NOT see the mutation on the slave. I like how simple and effective a solution it is to a tricky race condition.

thanks again Ben for sticking at this - it’s really a liberating system, as well as easy to use.

@gedw99
Copy link

gedw99 commented Apr 15, 2022

@anacrolix HTTP syncing was the goal of live read replication as it provides low latency read replicas. Syncing from S3 is also a possibility but it's more expensive. I wrote up an issue for it explaining some of the costs: #357

I am wondering if both options are needed . You could have a nice 3 node setup between London, Sydney and San Fran with London being Primary, but you also want a backup log going to s3 because sometimes stuff goes wrong .

With the s3 option you don’t have to turn it on. But it might’ve needed by some users because :

they want an audit log separate from the main db / app servers. Something that is always there.

they want to be able to do a restore in case they mess up a sql migration. For example where users have created data mutations that you can’t recover from.

You want to maintain an off cloud log of data to ensure you can bring up a new environment anywhere.

you want your devs to easily bring up a local copy of the db in order to easily test code changes against that is independent of production. This is also really useful for CI testing / regresssion testing.

Of course all the above ( to a degree ) can be achieved by tapping into a slave and pulling the data from it. But Not everything like an audit trail .

I would am curious to know what how others feel about this .

My requirements might be overly paranoid but the s3 functionality is already there and so it could be toggle on for developers if they want it when running on fly.io .

@mrEarls210
Copy link

mrEarls210 commented Apr 15, 2022 via email

@benbjohnson
Copy link
Owner Author

If the node on fly.io ( of any cloud ) is partitioned is there a risk that the slaves will not catch up to the master once the network partition heals ? I assume it’s all fine, because the WAL is stored with the master db right ?

@gedw99 There's no master election in Litestream so replicas are always behind the primary. It makes it easy to catch up in the event of a network partition. If your primary fails catastrophically and you restore from backup, the new primary will have a different "generation" and that causes replicas to re-snapshot.

Other question is to do with elections of a new master. if Master node is network partitioned , can a slave be elected to master auto - magically ? I have seen golang gossip setups with no other side system needed that can do this but wondering if it’s possible or planned / wanted for litestream.

There's no promotion planned for Litestream itself, however, I'm working on a different (but related) system for doing high availability with SQLite. That'll have proper failover. Litestream's goal is simplicity and adding leader election makes it significantly more complicated. You can probably put together a system for failover using Consul but you'll need to ensure you only have one primary at a time.

HTTP syncing was the goal of live read replication as it provides low latency read replicas. Syncing from S3 is also a possibility but it's more expensive. I wrote up an issue for it explaining some of the costs: #357
I am wondering if both options are needed . You could have a nice 3 node setup between London, Sydney and San Fran with London being Primary, but you also want a backup log going to s3 because sometimes stuff goes wrong .

You can currently enable both S3 replication and live read replication. I agree that it's useful to have a backup to a higher durability system like S3 and that's how I typically run it. I didn't include that in the litestream-read-replica-example because it makes it more complicated.

I think what @anacrolix is looking for is to be able to sync the primary & replica through S3 instead of over HTTP. Setting up a direct connection between two servers isn't always feasible.

@benbjohnson
Copy link
Owner Author

I’m completely lost I’m new to this

@mrEarls210 I know the feeling! It's a lot to take in & database replication is a difficult topic. There's a Getting Started page on the docs site that runs you through some of the basics. This issue is about a new feature around live replication to read-only replicas so it's quite a bit more advanced. Feel free to open another issue or discussion if you have questions.

@warmchang
Copy link

I'm working on a different (but related) system for doing high availability with SQLite.

Something like the rqlite or dqlite?

@benbjohnson
Copy link
Owner Author

benbjohnson commented Apr 16, 2022

I'm working on a different (but related) system for doing high availability with SQLite.

Something like the rqlite or dqlite?

@warmchang Not quite. Those both use Raft to replicate. The system I'm working on is using a two-node synchronous replication setup similar to what you'd find in Postgres or MySQL synchronous replication. I'll have more details in the near future once I get a little further along with it.

@ryanpbrewster
Copy link

In case anyone else makes the same mistake I did: live read replicas are not supported in litestream 0.3.8 (the latest non-beta release as of 2022-06-26), but they are supported at 0.4.0-beta2 and at main.

@benbjohnson
Copy link
Owner Author

benbjohnson commented Jul 4, 2022

hey folks, I wanted to give an update on live replication. The original implementation had some flaws so I reworked it, however, the new implementation still has some limitations & foot-guns that I'm not comfortable with. For example, you can only have one fixed primary. Also, you can corrupt your replicas by writing to them. Unfortunately, this is a consequence of Litestream running as a separate process so there's only so much it can do to manage those limitations.

As some of you may know, I've been working on a next generation tool for running SQLite on a distributed platform and, after some discussion, I've decided to keep Litestream as a pure disaster recovery tool and not include live replication. It's going to be difficult to maintain it long term—especially when when the new tooling will work better for this use case.

The bones of the new tooling are in place and we're moving it to internal testing this week. We're focusing on releasing a small but stable feature set at first and we'll grow it over time. I'll announce more here soon once we finish initial testing.

Thank you to everyone who has given feedback and tested out live replication in Litestream. I'm excited about the future of SQLite tooling and I hope y'all are too!

Update: We’ve open sourced the new replication tool which is called LiteFS. You can find it here: https://github.com/superfly/litefs

@c00kiemon5ter
Copy link

Thanks for the work Ben. Looking forward to the next-gen tooling. Keep us updated!

@CanRau
Copy link

CanRau commented Jul 4, 2022

I'm working on a different (but related) system for doing high availability with SQLite.

Something like the rqlite or dqlite?

@warmchang Not quite. Those both use Raft to replicate. The system I'm working on is using a two-node synchronous replication setup similar to what you'd find in Postgres or MySQL synchronous replication. I'll have more details in the near future once I get a little further along with it.

Would that mean write replicas or still read only replicas?

@gedw99
Copy link

gedw99 commented Jul 4, 2022

Hey @benbjohnson

will the new version only work on fly.io ? I ask because I figured the live replication will utilise the fly.io Infrastructure just like it does for PostgreSQL.
Assume than that the backup and restore will be i dependent of fly.io?

sorry about the direct questions but things are a bit of a blue and just trying to plan roadmaps myself .

Congrats on the fly.io deal. Freaking awesome system they have. KISS all the way ..

@benbjohnson
Copy link
Owner Author

@CanRau It will still be a single primary node that can accept writes at a time but it is able to hand off that responsibility to another node without downtime during deploys.

@benbjohnson
Copy link
Owner Author

@gedw99 Good question. I can’t believe I forgot to mention that part in my original comment. The new tooling is not fly.io specific. It is Apache 2 licensed. It will be able to run on any “stateful” server platforms like fly.io, Kubernetes, DigitalOcean, etc. We are also targeting pure serverless platforms like Vercel and Deno in the future too.

@gerhard
Copy link

gerhard commented Jul 4, 2022

I've decided to keep Litestream as a pure disaster recovery tool and not include live replication. It's going to be difficult to maintain it long term—especially when when the new tooling will work better for this use case.

Good call! 👍 to KISS & the UNIX philosophy. This theme continues in shipit.show/61 & shipit.show/62 (not out yet).

New link added to show notes:

image

@jamra
Copy link

jamra commented Mar 15, 2023

What is the reasoning behind not supporting leader election? CRDB is doing it really well. Is it due to latency or just overall complexity?

@benbjohnson
Copy link
Owner Author

@jamra Litestream runs as a separate process so it has a lot less control over the database file. It makes certain things tricky like preventing writes on replicas. Live replication was moved to LiteFS which can tightly control what can get written and when.

@fceruti
Copy link

fceruti commented Aug 3, 2023

Hey there @benbjohnson , sorry to bother you, but I wanted to ask you something.

I understand you've trimmed the scope of this project into a backup system and moved on to work on LiteFS. My question is: Do you believe this approach of having an external process monitoring and sending changes to replicas is inherently flawed and can never work, or you switched strategies just looking for a lower hanging fruit?

I ask because all other replication systems seem to work by wrapping sqlite with some process (except of course for LiteFS). This repo was quite unique in that regard: it was a parallel process. It could have given us a way of creating large edge systems with custom topologies at a very low cognitive (and $) cost.

If anyone was inclined to continue your work, what words of advice do you have?

@benbjohnson
Copy link
Owner Author

@fceruti I don't think that having an external process that manages replication is impossible but it makes things a lot harder. One of the benefits to the LiteFS approach is that it can detect every write so it's able to maintain a rolling checksum of the database to ensure consistency. When things go wrong in replication, the checksum is essential for determining when and where things go awry.

There's a lot of edge cases you need to worry about when the replication is an external process too. That external process may stop for some reason and then you'll need to have a way to detect changes since the last time it was up. You can do that quickly for small databases but it gets a lot slower for large ones.

We do have plans to add a VFS option for LiteFS. It's not quite an external process but it's useful for people that want to avoid FUSE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.