Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvdb: add postgres #5366

Merged
merged 3 commits into from
Sep 21, 2021
Merged

kvdb: add postgres #5366

merged 3 commits into from
Sep 21, 2021

Conversation

joostjager
Copy link
Contributor

@joostjager joostjager commented Jun 8, 2021

The Bottlepay team has been working on Postgres storage backend support in lnd.

Advantages of using postgres over the default bbolt storage backend include:

  • Postgres is widely used and has probably survived more battles than bbolt
  • Inspect data while lnd is running
  • On-the-fly database compaction (VACUUM)
  • Database replication
  • Usage of industry-standard tools to manage the stored data

However, in our opinion the greatest advantage of switching to Postgres is that it allows data structures to be migrated over to structured sql tables in a gradual way. The initial version will only be a single key-value table that holds all of the data that is currently in bbolt. Once on the Postgres platform, migrations can be carried out that move specific data to dedicated tables when the need arises.

Structured sql tables reduce the need for custom serialize/deserialize code in lnd itself, freeing up precious developer resources to work on less mundane logic, and also allow adding indexes and constraints to protect data integrity and improve performance.

One particular bottleneck in the current bbolt implementation is the global write lock. There can only be a single writer active at any time. Postgres offers elaborate locking models that hold great promise to reduce lock contention in high-performance scenarios.

Running

To run with Postgres, add the following command line flags. The specified database needs to be present, but tables will be created if they don't exist already.

lnd --db.backend=postgres --db.postgres.dsn=postgres://lnd:lnd@localhost:45432/lnd?sslmode=disable

Implementation

Thanks to the already abstracted kvdb interface, the change set can remain small and no application-level changes are needed. The key-value table has the following format:

id | key | value | parent_id | sequence

Rows representing a bucket have nil in the value column. Items reference their parent bucket via parent_id. The sequence column tracks the current sequence number for bucket rows.

Performance

We ran this branch on the Lightning benchmark rig that we released earlier. The benchmark application opens 10 channels between a pair of nodes and continuously executes keysend payments in 100 parallel threads.

Previous runs resulted in 35 tps for bbolt. With postgres, we measured 15 tps. This is a reduction that may or may not be relevant depending on how busy your node is. There is potential for considerable optimization still. Two main areas to look into are:

  • Global write lock: either move to structured sql tables with granular locking or switch over to Postgres' serializable locking model for the kv table (kvdb: add postgres #5366 (comment)).
  • Round-trips: network communication is slow and lnd sometimes needs many queries to reach certain data. By batching those queries, the number of round trips can be reduced greatly (kvdb: add postgres #5366 (comment)).

Update 2021-09-20: updated PR desc to reflect latest state of this PR and fresh performance numbers.

@guggero
Copy link
Collaborator

guggero commented Jun 8, 2021

Awesome work, this is really cool! Wouldn't have imagined this could be possible with such a small diff.
This definitely gets a concept ACK from me 💯

Two thoughts after a quick scan of the code:

  • My main concern here is the one single large table. Did you think about dynamically creating named tables for each CreateTopLevelBucket() call? That way we'd get probably much better write performance for smaller tables (such as the channel state) if that doesn't contain all payment/forwarding history as well.
  • Integration testing should be as easy as spinning up a docker image running PostgreSQL from the scripts/itest_part.sh file (so each parallel tranche can have their own instance).

@joostjager
Copy link
Contributor Author

My main concern here is the one single large table. Did you think about dynamically creating named tables for each CreateTopLevelBucket() call? That way we'd get probably much better write performance for smaller tables (such as the channel state) if that doesn't contain all payment/forwarding history as well.

Yes, I described that in the PR description as the 'hybrid approach'. Will be interesting to see what the actual gain is with that.

One hard choice will be to eventually drop bbolt and etcd support. Otherwise it still isn't possible to reap all the benefits. Interested to hear what the Lightning Labs stance is on this.

@guggero
Copy link
Collaborator

guggero commented Jun 8, 2021

I described that in the PR description

Oh, sorry, I must've skipped that. I think that hybrid approach is probably worth looking into, especially performance wise. But having data spread out over multiple tables might also have other benefits.

@Crypt-iQ Crypt-iQ added the database Related to the database/storage of LND label Jun 8, 2021
@yaslama
Copy link
Contributor

yaslama commented Jun 10, 2021

  • My main concern here is the one single large table. Did you think about dynamically creating named tables for each CreateTopLevelBucket() call? That way we'd get probably much better write performance for smaller tables (such as the channel state) if that doesn't contain all payment/forwarding history as well.

Indexes on Expressions can perhaps be used to reach the same performance as using one table per top level bucket.
I think it's better to clearly isolate DDL queries to initialization and not in the app itself.
Another advantage of using one table only to replace bbolt is to make clear that this is "legacy".
When I see the schema in https://github.com/ElementsProject/lightning/blob/76b8eb3afda4905dbeb059f88b2bbe994cbaa50a/wallet/db_postgres_sqlgen.c and compare it with the current bbolt structure, I really hope that one day lnd data will be as clean.

@yaslama
Copy link
Contributor

yaslama commented Jun 10, 2021

Performance

We ran this branch on the Lightning benchmark rig that we released earlier. The benchmark application opens 10 channels between a pair of nodes and continuously executes keysend payments in 100 parallel threads.

Previous runs resulted in 35 tps for bbolt and 4 tps for etcd. With postgres, we measured 25 tps. To us this looks like an excellent result given the added overhead for network communication. To take your own measurement, run https://github.com/bottlepay/lightning-benchmark/tree/postgres with ./run.sh lnd-postgres.

This is an excellent result! Using sqlite (for small client) can perhaps reach the same performance level as bbolt.

@joostjager
Copy link
Contributor Author

joostjager commented Jun 10, 2021

@guggero pushed commits that split the top level buckets into distinct tables. Ran the benchmark and observed that there is no additional performance increase with split tables.

image

I think I do favor this approach though, because when the data is already in separate tables it may be easier/faster to upgrade to a structured table. Not sure if it is a real advantage in practice.

This definitely gets a concept ACK from me

That is good to hear. We are willing to push forward with this PR, but is there sufficient review capacity available to get this landed in 0.14?

@joostjager
Copy link
Contributor Author

joostjager commented Jun 14, 2021

Downside of split tables is that there are some nasty top-level buckets like chainHash || chanPoint (briefcase.go) that exceed the max table name length of 63 chars.

Some options:

  • Truncate table name and hope for no collisions
  • Hash table name and encode in a format that can store 32 bytes in no more than 63 bytes
  • Hash table name and truncate
  • Use special single table that stores all top-level buckets with too long keys. The other buckets each get their own table
  • Perform a migration first to move long top-level buckets into a sub-bucket

@joostjager joostjager force-pushed the postgres branch 4 times, most recently from 9c41c38 to 7f550a0 Compare June 15, 2021 09:08
@bhandras bhandras self-requested a review June 15, 2021 10:57
@joostjager joostjager force-pushed the postgres branch 2 times, most recently from 706613b to 69cd6a5 Compare June 16, 2021 11:32
@joostjager
Copy link
Contributor Author

Integration tests pass with postgres

@joostjager
Copy link
Contributor Author

Also tried out this branch with sqlite: https://github.com/bottlepay/lnd/tree/sqlite in the single-table model.

Ran the usual benchmark on it and got a stable 18 tps.

Bbolt is undeniably faster at 35 tps, but I doubt that it matters much. In a hypothetical future where lnd users can choose between sqlite and postgres as their backend, power users will likely chose postgres anyway. How many tps does a phone need to handle?

@joostjager
Copy link
Contributor Author

Tested pure go sqlite https://pkg.go.dev/modernc.org/sqlite - it works as well.

Performance it slightly below the C version of sqlite, 14 tps.

@guggero guggero self-requested a review June 25, 2021 18:44
@joostjager
Copy link
Contributor Author

joostjager commented Jun 28, 2021

Discussed a possible game plan with @guggero offline:

Steps:

  • Test performance of high volumes payments and graph operations with all data moved over to postgres (not just channel db). See if it holds up also then. Code: https://github.com/guggero/lnd/tree/lndinit-migrate-db. (@joostjager)

  • Create fully-remote pr based on branch above (@guggero)

  • Refactor postgres PR for the following database schema: (@joostjager)

    • Split human-readable top level keys into separate tables
    • Aggregate all non-human-readable top level keys (such as chainhash|chanpt) in a single sql table
    • Prefix each table name with its namespace (channeldb, graphdb, macaroon, etc)

    Also:

    • Use the default golang sql infra (database/sql)
    • Abstract backend-specifics behind an interface, use sqlite as the test case to test the abstraction.
  • Create a migration tool to move from bbolt over to postgres. (@guggero + @joostjager)

Timeline: All merged on Jul 23 at the latest. (reminder to never put dates in PRs!)

@joostjager
Copy link
Contributor Author

Ran the bottlepay benchmark with all data moved over to postgres (https://github.com/guggero/lnd/tree/lndinit-migrate-db) and got 22 tps. No surprise that moving walletdb to postgres has some impact on performance, because it is used heavily to retrieve private keys.

@joostjager
Copy link
Contributor Author

Did some rough benchmarking on graph operations for bbolt and postgres.

Setup: lnd 0.13.0-beta connecting to yalls.org on testnet.

Initial graph download: Received 2413 channels from peer. This takes about 2 minutes on my machine. No noticeable difference between bbolt and postgres. I suppose that makes sense because network is the bottleneck here.

Pathfinding: Modified lnd to skip the local liquidity check. If a QueryRoutes command is then issued to a random node on the network (I used yalls.org), it will explore a large part of the graph before it concludes that there is no route. In the case of the test, lnd reports exploring 63 nodes and 774 edges. Here the difference between bbolt and postgres huge: 25 ms for bbolt vs 800 ms for postgres.

Looks unacceptable.

@joostjager
Copy link
Contributor Author

joostjager commented Jun 29, 2021

For pathfinding, there doesn't seem to be a way around caching data to improve performance. Already with bbolt, pathfinding is pretty slow on low-end devices, so a cache would be an improvement for all backends.

Going to explore the option to provide a write-through cache on the kvdb level. A sort of adapter that wraps a backend with the same interface and caches (some) data in memory.

A write-through cache on the kvdb level is probably going to be difficult because it needs to support rolling back (large) transactions.

Perhaps there is a simple way to just optimize this part (similar to etcd key prefetching for example).

image

@joostjager
Copy link
Contributor Author

joostjager commented Jun 30, 2021

Also ran a pathfinding test with postgres on mainnet with a larger graph (~50k channels). The 'path not found' test described above gave me a run time of 40s on my machine:

lnd-alice_1 | 2021-06-30 09:33:21.368 [DBG] CRTR: Pathfinding perf metrics: nodes=4831, edges=57105, time=40.880812351s

The same test with a bbolt instance:

lnd-alice_1 | 2021-06-30 09:31:51.674 [DBG] CRTR: Pathfinding perf metrics: nodes=4836, edges=57075, time=1.196043019s

Round about the same 30x factor as above.

@joostjager
Copy link
Contributor Author

joostjager commented Jun 30, 2021

One simple way to fix this is to just retrieve the full graph for the database at the start of every pathfinding operation. An extreme version of prefetching. Pathfinding only uses ForEachNodeChannel. This method would need a second implementation that operates on the prefetched data.

At first this may sound bad, but one also must keep in mind that the pathfinding access pattern is very inefficient. It needs to retrieve keys from all over the database. Loading it all in memory in an efficient way at the start cuts out the repeated network and database overhead for all those lookups.

This prefetching would be optional and recommended for remote databases, so low-end devices (cpu, memory) with bbolt have nothing to do with it.

I did a benchmark on fetching the full graph (complete contents of the nodes, edges and edge-index buckets) on mainnet. This took 200 ms.

Doesn't seem unreasonable especially in relation to the typical overall payment latency on lightning. For a 'route not found' scenario it is even faster than bbolt (1.2s, see above).

A proper write-through cache for the graph is definitely the preferred solution. But if complications arise (in-memory graph has been an elusive goal so far) or there isn't enough engineering capacity available, this could be a simpler alternative to bridge the gap.

@Roasbeef Roasbeef added this to the v0.14.0 milestone Jul 1, 2021
Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a high-level pass and left a few comments. Still a lot of work that needs to be done to get this into a mergeable state, but looks very promising!

kvdb/go.sum Outdated Show resolved Hide resolved
kvdb/postgres/db.go Outdated Show resolved Hide resolved
kvdb/postgres/db.go Outdated Show resolved Hide resolved
kvdb/postgres/readwrite_bucket.go Outdated Show resolved Hide resolved
kvdb/postgres/readwrite_bucket.go Outdated Show resolved Hide resolved
kvdb/postgres/readwrite_tx.go Outdated Show resolved Hide resolved
kvdb/postgres/readwrite_tx.go Outdated Show resolved Hide resolved
kvdb/postgres/readwrite_tx.go Outdated Show resolved Hide resolved
lntest/itest/lnd_test_list_on_test.go Outdated Show resolved Hide resolved
scripts/itest_part.sh Outdated Show resolved Hide resolved
@guggero
Copy link
Collaborator

guggero commented Sep 16, 2021

As discussed offline, I think using sub queries for implementing the prefetch in Postgres is a great idea. And seeing that a 4 level deep sub query doesn't even double the execution time but replaces the need for 3 round trips seems excellent.

Is the idea to add the prefetch right after merging this PR and #5640 and possibly getting it into 0.14.0 or more something for 0.14.1 as a performance boost after seeing things in action for a while?

@joostjager
Copy link
Contributor Author

Is the idea to add the prefetch right after merging this PR and #5640 and possibly getting it into 0.14.0 or more something for 0.14.1 as a performance boost after seeing things in action for a while?

I'd love to, but it doesn't seem to be possible with the kvdb extension that is proposed in #5640. More details in this thread #5640 (comment)

@joostjager joostjager force-pushed the postgres branch 3 times, most recently from 394b065 to db5fdd8 Compare September 17, 2021 13:59
@bhandras
Copy link
Collaborator

bhandras commented Sep 17, 2021

I am still undecided about adding a column for a hashed bucket id. Without it, it won't be easy to get a nested bucket value in a single roundtrip. This means that optimal performance cannot be achieved.

The direction in which #5640 (comment) goes is important too. If that results in an etcd-only interface, it will be more costly to implement a similar optimization for Postgres.

I think there's a small misunderstanding regarding #5640. Prefetch is optional and it's perfectly fine if the Postgres driver doesn't implement it. We can still add bucket.GetAll() as an additional API unlocking more performance gains for Postgres.

@joostjager
Copy link
Contributor Author

joostjager commented Sep 17, 2021

I think there's a small misunderstanding regarding #5640. Prefetch is optional and it's perfectly fine if the Postgres driver doesn't implement it. We can still add bucket.GetAll() as an additional API unlocking more performance gains for Postgres.

It was clear from the start that it is optional and that the pg driver doesn't need to implement it. But the thing is that I do want to implement it to reach the full performance potential also on pg. Only I can't because the interface proposed in #5640 isn't compatible with server-side txes. GetAll would be an improvement, but still doesn't minimize round-trips.

What I would have done for prefetch is implement a clean batch interface that allows single round-trip queries and is compatible with both etcd and postgres.

@joostjager joostjager force-pushed the postgres branch 2 times, most recently from 60e0fc9 to eb0245d Compare September 19, 2021 20:49
@joostjager
Copy link
Contributor Author

It looks like the Postgres itest is finally passing. One last modification is that I increased the remote db async payments timeout from 2 to 3 minutes.

Copy link
Collaborator

@guggero guggero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see this PR pass the itests! Excited to get this in, thanks for your work on this 🎉

Only a few nits left and needs a rebase, but otherwise LGTM!

docs/postgres.md Outdated Show resolved Hide resolved
sample-lnd.conf Outdated Show resolved Hide resolved
lntest/harness.go Outdated Show resolved Hide resolved
@joostjager joostjager force-pushed the postgres branch 3 times, most recently from 2f0fadf to a112ccb Compare September 20, 2021 12:06
@Roasbeef
Copy link
Member

Coveralls is still down (the reason why that unit test is failing): https://status.coveralls.io/incidents/8zbjrwpb4frv

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🦚

// cfg is the postgres connection config.
cfg *Config

// prefix is the table name prefix that is used to simulate namespaces.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, had initially missed that!

@Roasbeef Roasbeef merged commit 7970ffc into lightningnetwork:master Sep 21, 2021
@joostjager
Copy link
Contributor Author

Thanks for your thoughts and reviews @guggero @bhandras @Roasbeef!

@djkazic
Copy link
Contributor

djkazic commented Sep 29, 2021

Will the migration tool for etcd -> postgres be in a separate PR?

@joostjager
Copy link
Contributor Author

The migration tool PR is here: #5561

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
database Related to the database/storage of LND P2 should be fixed if one has time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants