Postgres v13.3 or v14.1 backend failing with LND v0.14.1-beta #6103

miketwenty1 · 2021-12-20T04:19:25Z

Background

I'm able to successfully run a node with bbolt/boltdb.
When I try to run a node with postgres, after doing a lncli create I get an error "Unable to process chain backend block connected notification: context canceled".

Note it does successfully create the tables before crashing:

channeldb_kv
decayedlogdb_kv
macaroondb_kv
towerclientdb_kv
towerserverdb_kv
walletdb_kv

Can reproduce this with clean environments. Verified Bitcoind node, postgres db is in working order.

Your environment

version of lnd
v0.14.1-beta
which operating system (uname -a on *Nix)
ubuntu 20.04
version of btcd, bitcoind, or other backend
bitcoind v22.0
any other relevant environment details
postgres version 13.3

Steps to reproduce

lnd.conf setup for postgres

[db]
db.backend=postgres

[postgres]
db.postgres.dsn=postgres://postgres:PASSWORD@hostname:5432/dbname
db.postgres.timeout=60s
db.postgres.maxconnections=1000

lncli create
crash with error in logs

Expected behaviour

No crash

Actual behaviour

crash [ERR] LNWL: Unable to process chain backend block connected notification: context canceled v13.3
or "error loading chain control: unable to create chain control: unable to create wallet: context canceled" v14.1
crash `

Logs when I use bbolt/botdb

2021-12-20 03:09:52.444 [INF] LTND: We're not running within systemd
2021-12-20 03:09:52.446 [INF] LTND: Waiting for chain backend to finish sync, start_height=714882
2021-12-20 03:09:53.479 [INF] LNWL: Started rescan from block 0000000000000000000a2daeed9f01f2ca39cd2839026c3f12ceb0f388f815ce (height 714553) for 0 addresses
2021-12-20 03:09:53.502 [INF] LNWL: Catching up block hashes to height 714562, this might take a while
2021-12-20 03:09:53.576 [INF] LNWL: Done catching up block hashes
2021-12-20 03:09:53.576 [INF] LNWL: Rescanned through block 000000000000000000081f53c492703c785f6fff23a92aa7325ff5603682e4f7 (height 714562)
2021-12-20 03:10:15.447 [INF] LNWL: Catching up block hashes to height 714882, this might take a while
2021-12-20 03:10:15.448 [INF] LNWL: Done catching up block hashes
2021-12-20 03:10:15.448 [INF] LNWL: Finished rescan for 0 addresses (synced to block 00000000000000000004a713a1136faaeb84bf34ff2117b6c5c65e7031772d21, height 714882)
2021-12-20 03:10:15.863 [INF] LTND: Chain backend is fully synced (end_height=714882)!
2021-12-20 03:10:15.863 [WRN] HLCK: check: disk space configured with 0 attempts, skipping it
2021-12-20 03:10:15.863 [WRN] HLCK: check: tls configured with 0 attempts, skipping it
2021-12-20 03:10:15.870 [INF] PRNF: PeerNotifier starting
2021-12-20 03:10:15.870 [INF] NTFN: New block epoch subscription
2021-12-20 03:10:15.871 [INF] NTFN: New block epoch subscription
2021-12-20 03:10:15.872 [INF] BRAR: Starting contract observer, watching for breaches.
2021-12-20 03:10:15.874 [INF] NTFN: New block epoch subscription
2021-12-20 03:10:15.874 [INF] DISC: Authenticated Gossiper is starting
2021-12-20 03:10:15.874 [INF] NTFN: New block epoch subscription
2021-12-20 03:10:15.877 [INF] CRTR: FilteredChainView starting
2021-12-20 03:10:15.883 [INF] CRTR: Filtering chain using 0 channels active
2021-12-20 03:10:15.886 [INF] CRTR: Prune tip for Channel Graph: height=714882, hash=00000000000000000004a713a1136faaeb84bf34ff2117b6c5c65e7031772d21
2021-12-20 03:10:15.888 [INF] NTFN: New block epoch subscription
2021-12-20 03:10:15.890 [INF] NTFN: New block epoch subscription
2021-12-20 03:10:15.890 [INF] HSWC: Starting HTLC Switch
2021-12-20 03:10:15.890 [INF] CHFT: Adding 0 channels to event store
2021-12-20 03:10:15.890 [INF] CHBU: Starting chanbackup.SubSwapper
2021-12-20 03:10:15.890 [INF] NTFN: New block epoch subscription

Logs when using postgres v13.3

2021-12-20 03:22:49.411 [INF] LTND: We're not running within systemd
2021-12-20 03:22:49.413 [INF] LTND: Waiting for chain backend to finish sync, start_height=714884
2021-12-20 03:22:50.446 [INF] LNWL: Started rescan from block 000000000000000000035fba7bdfed50a131d12685d9edb0ba0ae2f95993a0cd (height 714555) for 0 addresses
2021-12-20 03:22:50.458 [CRT] CHDB: Caught unhandled error: context canceled
2021-12-20 03:22:50.458 [INF] CHDB: Sending request for shutdown
2021-12-20 03:22:50.458 [INF] LTND: Received shutdown request.
2021-12-20 03:22:50.458 [INF] LTND: Shutting down...
2021-12-20 03:22:50.458 [INF] LTND: Gracefully shutting down.
2021-12-20 03:22:50.458 [ERR] LNWL: Unable to process chain backend block connected notification: context canceled
2021-12-20 03:22:50.465 [INF] LNWL: Catching up block hashes to height 714562, this might take a while
2021-12-20 03:22:50.566 [INF] LNWL: Done catching up block hashes
2021-12-20 03:22:50.566 [INF] LNWL: Rescanned through block 000000000000000000081f53c492703c785f6fff23a92aa7325ff5603682e4f7 (height 714562)
2021-12-20 03:22:51.420 [INF] RPCS: Stopping RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping RouterRPC Sub-RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping InvoicesRPC Sub-RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping VersionRPC Sub-RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping WatchtowerRPC Sub-RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping WatchtowerClientRPC Sub-RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping SignRPC Sub-RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping WalletKitRPC Sub-RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping AutopilotRPC Sub-RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping ChainRPC Sub-RPC Server
2021-12-20 03:22:51.421 [INF] LTND: Shutdown complete

Logs when using postgres v14.1

2021-12-20 04:11:56.336 [INF] LTND: We're not running within systemd
2021-12-20 04:11:56.336 [INF] LTND: Waiting for wallet encryption password. Use `lncli create` to create a wallet, `lncli unlock` to unlock an existing wallet, or `lncli changepassword` to change the password of an existing wallet and unlock it.
2021-12-20 04:12:28.182 [INF] LNWL: Opened wallet
2021-12-20 04:12:28.234 [INF] CHRE: Primary chain is set to: bitcoin
2021-12-20 04:12:28.246 [INF] CHRE: Initializing bitcoind backed fee estimator in CONSERVATIVE mode
2021-12-20 04:12:28.246 [INF] LNWL: Started listening for bitcoind transaction notifications via ZMQ on 100.0.10.82:28333
2021-12-20 04:12:28.246 [INF] LNWL: Started listening for bitcoind block notifications via ZMQ on 100.0.10.82:28332
2021-12-20 04:12:28.909 [INF] LNWL: The wallet has been unlocked without a time limit
2021-12-20 04:12:28.920 [CRT] CHDB: Caught unhandled error: context canceled
2021-12-20 04:12:28.921 [INF] CHDB: Sending request for shutdown
2021-12-20 04:12:28.921 [INF] LTND: Received shutdown request.
2021-12-20 04:12:28.921 [INF] LTND: Shutting down...
2021-12-20 04:12:28.921 [INF] LTND: Gracefully shutting down.
2021-12-20 04:12:28.921 [ERR] LTND: unable to create chain control: unable to create wallet: context canceled
2021-12-20 04:12:28.921 [ERR] LTND: Shutting down because error in main method: error loading chain control: unable to create chain control: unable to create wallet: context canceled
2021-12-20 04:12:28.922 [INF] LTND: Shutdown complete

error loading chain control: unable to create chain control: unable to create wallet: context canceled

The text was updated successfully, but these errors were encountered:

joostjager · 2021-12-20T14:13:29Z

I can reproduce this too. Will take a look.

joostjager · 2021-12-20T15:49:47Z

I may have found the issue. A context is cancelled after exiting the function:

lnd/kvdb/postgres/readwrite_tx.go

Line 163 in 088970e

defer cancel()

But I believe that the subsequent Scan call one level up will then fail on the cancelled context.

Workaround for now is to set timeout to 0.

joostjager · 2021-12-20T16:05:41Z

I still have doubts about the timeouts. It is really quite complicated, all the different levels of timeouts. And we also use it in an anti-pattern where we use a global context. See also #5366 (comment)

My proposal for fixing would be to remove the timeout config flag completely. The issue mentioned above can be fixed without too much trouble, but it seems safer to not do timeouts at all. There is no good way to recover from a timeout anyway.

gkrizek · 2021-12-20T18:28:45Z

~~Is this specific to a Bitcoin Core backend? I used LND 0.14.1 with Neutrino on Postgres 14.1 without issue.~~

Nevermind I see it's related to the timeout setting

miketwenty1 · 2021-12-20T19:37:13Z

@joostjager so currently it's recommended to set timeout to 0 if we want to test with pg? or should we hold tight until issue is fixed before continuing to test?

miketwenty1 · 2021-12-20T21:54:51Z

Verified removing "db.postgres.timeout" configuration results in LND not crashing.

joostjager · 2021-12-21T09:02:50Z

@guggero what's your thought on the timeout issue?

guggero · 2021-12-21T12:35:05Z

The problem isn't the timeout itself but the defer in

lnd/kvdb/postgres/readwrite_tx.go

Line 163 in 088970e

defer cancel()

.

If you look at how QueryRow is used, you see that the actual "work" is done in Scan() which requires QueryRow to have returned and the defer to be executed:

	err := b.tx.QueryRow(
		"SELECT id,value FROM "+b.table+" WHERE "+parentSelector(b.id)+
			" AND key=$1", key,
	).Scan(&id, &value)

What we instead need to do is to keep a list of all cancel() functions in the readWriteTx struct and execute them on commit or rollback.

joostjager · 2021-12-21T13:27:06Z

The problem isn't the timeout itself but the defer in

That is indeed what I tried to explain in my comment above, "a context is cancelled after exiting the function". Just setting timeout to zero prevents the problem from occurring because in that case cancel is an empty function. Or maybe we are referring to different things?

A list of all cancels on the tx level will probably solve it, but an alternative that may have less impact is to require a context to be passed into 'our' QueryRow. That context can then be managed and cancelled one level up. Of course still within the pg driver.

A third option could be to fold QueryRow and Scan into a single function with a context surrounding them.

However, don't you think that the timeout setting in general can cause more harm than good? Users setting it to a value that is too low and seeing their node stop or worst case reach an inconsistent state because these error paths are relatively new?

guggero · 2021-12-22T09:09:36Z

Ah yes, we were talking about the same thing then.

I like the first option where the QueryRow would take the context so it can live during both operations. They aren't being used often so it's probably okay to not add another abstraction for it.

I'm not sure about whether we should have the timeout in general or not. Perhaps it's nice to have for now while we're still figuring out how to best run with Postgres but don't need it in the future? But it seems it's hard to even choose a value that makes sense if the graph cache already takes multiple minutes to load...

joostjager · 2021-12-22T10:53:29Z

Ok, will create a pr to pull up the context to the next level. I also saw that it is only used a few times, so no big deal.

The graph cache may take minutes to load, but that won't lead to a timeout because each operation within that loading completes in time. There is no timeout on the transaction itself.

There is a gap there though, because afaik we also don't set a timeout on the transaction BEGIN statement. There is an alternative Begin call that takes a context, but from what I've seen that would apply the timeout also to the whole transaction, which is not what we want.

lnd/kvdb/postgres/readwrite_tx.go

Line 42 in 088970e

tx, err := db.db.Begin()

Yes, we can think a bit longer about the timeout question. Maybe we can advise against setting it to a value for now.

miketwenty1 mentioned this issue Dec 20, 2021

Error when restoring a wallet on a fresh LND install configured with postgresql db backend #6041

Closed

joostjager mentioned this issue Dec 22, 2021

kvdb/postgres: fix context cancellation #6108

Merged

guggero closed this as completed in #6108 Dec 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postgres v13.3 or v14.1 backend failing with LND v0.14.1-beta #6103

Postgres v13.3 or v14.1 backend failing with LND v0.14.1-beta #6103

miketwenty1 commented Dec 20, 2021

joostjager commented Dec 20, 2021

joostjager commented Dec 20, 2021

joostjager commented Dec 20, 2021 •

edited

Loading

gkrizek commented Dec 20, 2021 •

edited

Loading

miketwenty1 commented Dec 20, 2021

miketwenty1 commented Dec 20, 2021

joostjager commented Dec 21, 2021

guggero commented Dec 21, 2021

joostjager commented Dec 21, 2021 •

edited

Loading

guggero commented Dec 22, 2021

joostjager commented Dec 22, 2021

Postgres v13.3 or v14.1 backend failing with LND v0.14.1-beta #6103

Postgres v13.3 or v14.1 backend failing with LND v0.14.1-beta #6103

Comments

miketwenty1 commented Dec 20, 2021

Background

Your environment

Steps to reproduce

Expected behaviour

Actual behaviour

joostjager commented Dec 20, 2021

joostjager commented Dec 20, 2021

joostjager commented Dec 20, 2021 • edited Loading

gkrizek commented Dec 20, 2021 • edited Loading

miketwenty1 commented Dec 20, 2021

miketwenty1 commented Dec 20, 2021

joostjager commented Dec 21, 2021

guggero commented Dec 21, 2021

joostjager commented Dec 21, 2021 • edited Loading

guggero commented Dec 22, 2021

joostjager commented Dec 22, 2021

joostjager commented Dec 20, 2021 •

edited

Loading

gkrizek commented Dec 20, 2021 •

edited

Loading

joostjager commented Dec 21, 2021 •

edited

Loading