Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres v13.3 or v14.1 backend failing with LND v0.14.1-beta #6103

Closed
miketwenty1 opened this issue Dec 20, 2021 · 11 comments · Fixed by #6108
Closed

Postgres v13.3 or v14.1 backend failing with LND v0.14.1-beta #6103

miketwenty1 opened this issue Dec 20, 2021 · 11 comments · Fixed by #6108

Comments

@miketwenty1
Copy link

Background

I'm able to successfully run a node with bbolt/boltdb.
When I try to run a node with postgres, after doing a lncli create I get an error "Unable to process chain backend block connected notification: context canceled".

Note it does successfully create the tables before crashing:

  • channeldb_kv
  • decayedlogdb_kv
  • macaroondb_kv
  • towerclientdb_kv
  • towerserverdb_kv
  • walletdb_kv

Can reproduce this with clean environments. Verified Bitcoind node, postgres db is in working order.

Your environment

  • version of lnd
    v0.14.1-beta
  • which operating system (uname -a on *Nix)
    ubuntu 20.04
  • version of btcd, bitcoind, or other backend
    bitcoind v22.0
  • any other relevant environment details
    postgres version 13.3

Steps to reproduce

  • lnd.conf setup for postgres
[db]
db.backend=postgres

[postgres]
db.postgres.dsn=postgres://postgres:PASSWORD@hostname:5432/dbname
db.postgres.timeout=60s
db.postgres.maxconnections=1000
  • lncli create
  • crash with error in logs

Expected behaviour

No crash

Actual behaviour

crash [ERR] LNWL: Unable to process chain backend block connected notification: context canceled v13.3
or "error loading chain control: unable to create chain control: unable to create wallet: context canceled" v14.1
crash `

Logs when I use bbolt/botdb

2021-12-20 03:09:52.444 [INF] LTND: We're not running within systemd
2021-12-20 03:09:52.446 [INF] LTND: Waiting for chain backend to finish sync, start_height=714882
2021-12-20 03:09:53.479 [INF] LNWL: Started rescan from block 0000000000000000000a2daeed9f01f2ca39cd2839026c3f12ceb0f388f815ce (height 714553) for 0 addresses
2021-12-20 03:09:53.502 [INF] LNWL: Catching up block hashes to height 714562, this might take a while
2021-12-20 03:09:53.576 [INF] LNWL: Done catching up block hashes
2021-12-20 03:09:53.576 [INF] LNWL: Rescanned through block 000000000000000000081f53c492703c785f6fff23a92aa7325ff5603682e4f7 (height 714562)
2021-12-20 03:10:15.447 [INF] LNWL: Catching up block hashes to height 714882, this might take a while
2021-12-20 03:10:15.448 [INF] LNWL: Done catching up block hashes
2021-12-20 03:10:15.448 [INF] LNWL: Finished rescan for 0 addresses (synced to block 00000000000000000004a713a1136faaeb84bf34ff2117b6c5c65e7031772d21, height 714882)
2021-12-20 03:10:15.863 [INF] LTND: Chain backend is fully synced (end_height=714882)!
2021-12-20 03:10:15.863 [WRN] HLCK: check: disk space configured with 0 attempts, skipping it
2021-12-20 03:10:15.863 [WRN] HLCK: check: tls configured with 0 attempts, skipping it
2021-12-20 03:10:15.870 [INF] PRNF: PeerNotifier starting
2021-12-20 03:10:15.870 [INF] NTFN: New block epoch subscription
2021-12-20 03:10:15.871 [INF] NTFN: New block epoch subscription
2021-12-20 03:10:15.872 [INF] BRAR: Starting contract observer, watching for breaches.
2021-12-20 03:10:15.874 [INF] NTFN: New block epoch subscription
2021-12-20 03:10:15.874 [INF] DISC: Authenticated Gossiper is starting
2021-12-20 03:10:15.874 [INF] NTFN: New block epoch subscription
2021-12-20 03:10:15.877 [INF] CRTR: FilteredChainView starting
2021-12-20 03:10:15.883 [INF] CRTR: Filtering chain using 0 channels active
2021-12-20 03:10:15.886 [INF] CRTR: Prune tip for Channel Graph: height=714882, hash=00000000000000000004a713a1136faaeb84bf34ff2117b6c5c65e7031772d21
2021-12-20 03:10:15.888 [INF] NTFN: New block epoch subscription
2021-12-20 03:10:15.890 [INF] NTFN: New block epoch subscription
2021-12-20 03:10:15.890 [INF] HSWC: Starting HTLC Switch
2021-12-20 03:10:15.890 [INF] CHFT: Adding 0 channels to event store
2021-12-20 03:10:15.890 [INF] CHBU: Starting chanbackup.SubSwapper
2021-12-20 03:10:15.890 [INF] NTFN: New block epoch subscription

Logs when using postgres v13.3

2021-12-20 03:22:49.411 [INF] LTND: We're not running within systemd
2021-12-20 03:22:49.413 [INF] LTND: Waiting for chain backend to finish sync, start_height=714884
2021-12-20 03:22:50.446 [INF] LNWL: Started rescan from block 000000000000000000035fba7bdfed50a131d12685d9edb0ba0ae2f95993a0cd (height 714555) for 0 addresses
2021-12-20 03:22:50.458 [CRT] CHDB: Caught unhandled error: context canceled
2021-12-20 03:22:50.458 [INF] CHDB: Sending request for shutdown
2021-12-20 03:22:50.458 [INF] LTND: Received shutdown request.
2021-12-20 03:22:50.458 [INF] LTND: Shutting down...
2021-12-20 03:22:50.458 [INF] LTND: Gracefully shutting down.
2021-12-20 03:22:50.458 [ERR] LNWL: Unable to process chain backend block connected notification: context canceled
2021-12-20 03:22:50.465 [INF] LNWL: Catching up block hashes to height 714562, this might take a while
2021-12-20 03:22:50.566 [INF] LNWL: Done catching up block hashes
2021-12-20 03:22:50.566 [INF] LNWL: Rescanned through block 000000000000000000081f53c492703c785f6fff23a92aa7325ff5603682e4f7 (height 714562)
2021-12-20 03:22:51.420 [INF] RPCS: Stopping RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping RouterRPC Sub-RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping InvoicesRPC Sub-RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping VersionRPC Sub-RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping WatchtowerRPC Sub-RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping WatchtowerClientRPC Sub-RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping SignRPC Sub-RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping WalletKitRPC Sub-RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping AutopilotRPC Sub-RPC Server
2021-12-20 03:22:51.420 [INF] RPCS: Stopping ChainRPC Sub-RPC Server
2021-12-20 03:22:51.421 [INF] LTND: Shutdown complete

Logs when using postgres v14.1

2021-12-20 04:11:56.336 [INF] LTND: We're not running within systemd
2021-12-20 04:11:56.336 [INF] LTND: Waiting for wallet encryption password. Use `lncli create` to create a wallet, `lncli unlock` to unlock an existing wallet, or `lncli changepassword` to change the password of an existing wallet and unlock it.
2021-12-20 04:12:28.182 [INF] LNWL: Opened wallet
2021-12-20 04:12:28.234 [INF] CHRE: Primary chain is set to: bitcoin
2021-12-20 04:12:28.246 [INF] CHRE: Initializing bitcoind backed fee estimator in CONSERVATIVE mode
2021-12-20 04:12:28.246 [INF] LNWL: Started listening for bitcoind transaction notifications via ZMQ on 100.0.10.82:28333
2021-12-20 04:12:28.246 [INF] LNWL: Started listening for bitcoind block notifications via ZMQ on 100.0.10.82:28332
2021-12-20 04:12:28.909 [INF] LNWL: The wallet has been unlocked without a time limit
2021-12-20 04:12:28.920 [CRT] CHDB: Caught unhandled error: context canceled
2021-12-20 04:12:28.921 [INF] CHDB: Sending request for shutdown
2021-12-20 04:12:28.921 [INF] LTND: Received shutdown request.
2021-12-20 04:12:28.921 [INF] LTND: Shutting down...
2021-12-20 04:12:28.921 [INF] LTND: Gracefully shutting down.
2021-12-20 04:12:28.921 [ERR] LTND: unable to create chain control: unable to create wallet: context canceled
2021-12-20 04:12:28.921 [ERR] LTND: Shutting down because error in main method: error loading chain control: unable to create chain control: unable to create wallet: context canceled
2021-12-20 04:12:28.922 [INF] LTND: Shutdown complete

error loading chain control: unable to create chain control: unable to create wallet: context canceled
@joostjager
Copy link
Contributor

I can reproduce this too. Will take a look.

@joostjager
Copy link
Contributor

I may have found the issue. A context is cancelled after exiting the function:

defer cancel()

But I believe that the subsequent Scan call one level up will then fail on the cancelled context.

Workaround for now is to set timeout to 0.

@joostjager
Copy link
Contributor

joostjager commented Dec 20, 2021

I still have doubts about the timeouts. It is really quite complicated, all the different levels of timeouts. And we also use it in an anti-pattern where we use a global context. See also #5366 (comment)

My proposal for fixing would be to remove the timeout config flag completely. The issue mentioned above can be fixed without too much trouble, but it seems safer to not do timeouts at all. There is no good way to recover from a timeout anyway.

@gkrizek
Copy link
Contributor

gkrizek commented Dec 20, 2021

Is this specific to a Bitcoin Core backend? I used LND 0.14.1 with Neutrino on Postgres 14.1 without issue.

Nevermind I see it's related to the timeout setting

@miketwenty1
Copy link
Author

@joostjager so currently it's recommended to set timeout to 0 if we want to test with pg? or should we hold tight until issue is fixed before continuing to test?

@miketwenty1
Copy link
Author

Verified removing "db.postgres.timeout" configuration results in LND not crashing.

@joostjager
Copy link
Contributor

@guggero what's your thought on the timeout issue?

@guggero
Copy link
Collaborator

guggero commented Dec 21, 2021

The problem isn't the timeout itself but the defer in

defer cancel()
.

If you look at how QueryRow is used, you see that the actual "work" is done in Scan() which requires QueryRow to have returned and the defer to be executed:

	err := b.tx.QueryRow(
		"SELECT id,value FROM "+b.table+" WHERE "+parentSelector(b.id)+
			" AND key=$1", key,
	).Scan(&id, &value)

What we instead need to do is to keep a list of all cancel() functions in the readWriteTx struct and execute them on commit or rollback.

@joostjager
Copy link
Contributor

joostjager commented Dec 21, 2021

The problem isn't the timeout itself but the defer in

That is indeed what I tried to explain in my comment above, "a context is cancelled after exiting the function". Just setting timeout to zero prevents the problem from occurring because in that case cancel is an empty function. Or maybe we are referring to different things?

A list of all cancels on the tx level will probably solve it, but an alternative that may have less impact is to require a context to be passed into 'our' QueryRow. That context can then be managed and cancelled one level up. Of course still within the pg driver.

A third option could be to fold QueryRow and Scan into a single function with a context surrounding them.

However, don't you think that the timeout setting in general can cause more harm than good? Users setting it to a value that is too low and seeing their node stop or worst case reach an inconsistent state because these error paths are relatively new?

@guggero
Copy link
Collaborator

guggero commented Dec 22, 2021

Ah yes, we were talking about the same thing then.

I like the first option where the QueryRow would take the context so it can live during both operations. They aren't being used often so it's probably okay to not add another abstraction for it.

I'm not sure about whether we should have the timeout in general or not. Perhaps it's nice to have for now while we're still figuring out how to best run with Postgres but don't need it in the future? But it seems it's hard to even choose a value that makes sense if the graph cache already takes multiple minutes to load...

@joostjager
Copy link
Contributor

Ok, will create a pr to pull up the context to the next level. I also saw that it is only used a few times, so no big deal.

The graph cache may take minutes to load, but that won't lead to a timeout because each operation within that loading completes in time. There is no timeout on the transaction itself.

There is a gap there though, because afaik we also don't set a timeout on the transaction BEGIN statement. There is an alternative Begin call that takes a context, but from what I've seen that would apply the timeout also to the whole transaction, which is not what we want.

tx, err := db.db.Begin()

Yes, we can think a bit longer about the timeout question. Maybe we can advise against setting it to a value for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants