Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"[CRT] SRVR: Health check: chain backend failed after 3 calls" #658

Closed
HamishMacEwan opened this issue May 13, 2022 · 8 comments
Closed
Labels
bug Something isn't working

Comments

@HamishMacEwan
Copy link

Describe the bug
LND is deciding bitcoind has failed a health check and shutting down gracefully, to be restarted by the systemd service unit.

Expected behavior
Just keep running.

MyNode hardware (please complete the following information):

  • Device: Raspi4
  • Version 0.2.55

Additional context
This problem is described and discussed extensively here: lightningnetwork/lnd#4669

It may be worth extending the timeouts for the health check as described here: lightningnetwork/lnd#4669 (comment)

@HamishMacEwan HamishMacEwan added the bug Something isn't working label May 13, 2022
@tehelsper
Copy link
Collaborator

Thanks for letting me know. Have you tried changing the timeout values? If so, what worked for you?

@tehelsper
Copy link
Collaborator

Could you try these values? If I understand right, this should check every 5 minutes now and allow up to 3 minutes for bitcoin to become available if it happens to be under heavy load.

[healthcheck]
; Number of attempts before failure
healthcheck.chainbackend.attempts=3
; Timeout for a single healthcheck attempt
healthcheck.chainbackend.timeout=30s
; The amount of time we should backoff between failed attempts
healthcheck.chainbackend.backoff=30s
; How often should chainbeck checks occur
healthcheck.chainbackend.interval=5m

@tehelsper
Copy link
Collaborator

Those values are in the latest beta.

@HamishMacEwan
Copy link
Author

HamishMacEwan commented Jun 27, 2022

Since upgrading my "production" myNode instance to 64 bit OS and 0.2.57 I'm having similar (and sadly continuing) problems as described above. That is, both my Pi4 and Pi400 instances are regularly (currently a couple of times a day) shutting down due to:

Jun 27 17:59:41 flat6 lnd[19164]: 2022-06-27 17:59:41.312 [CRT] SRVR: Health check: chain backend failed after 3 calls
Jun 27 17:59:41 flat6 lnd[19164]: 2022-06-27 17:59:41.314 [INF] SRVR: Sending request for shutdown

I could extend the "backoff" but unless there's more guidance on what the backend failure is and why it occurs, this seems to be just fumbling in the dark. The check is for some purpose, it would be more useful to determine the reason. I'm not seeing anything wrong with bitcoind, its status looks normal:

2022-06-27T06:08:39Z BlockUntilSyncedToCurrentChain: txindex is catching up on block notifications
2022-06-27T06:08:39Z BlockUntilSyncedToCurrentChain: txindex is catching up on block notifications
2022-06-27T06:08:39Z BlockUntilSyncedToCurrentChain: txindex is catching up on block notifications
2022-06-27T06:08:39Z BlockUntilSyncedToCurrentChain: txindex is catching up on block notifications
2022-06-27T06:08:39Z UpdateTip: new best=000000000000000000074b48e193fb4ccc28bd929001aa1838abfda30c020ef4 height=742521 version=0x20002000 log2_work=93.598055 tx=744490128 date='2022-06-27T06:07:58Z' progress=1.000000 cache=41.2MiB(245824txo)
2022-06-27T06:03:20Z UpdateTip: new best=0000000000000000000081b382a03d1cff2bd1e2cc35da144192d44464c7ed95 height=742520 version=0x3fffe000 log2_work=93.598043 tx=744489328 date='2022-06-27T05:54:44Z' progress=0.999998 cache=40.4MiB(239179txo)
2022-06-27T05:57:50Z BlockUntilSyncedToCurrentChain: txindex is catching up on block notifications
2022-06-27T05:57:50Z UpdateTip: new best=000000000000000000006d7525f4e7e6ac30df9b901d0cfc02482bc1273d90c1 height=742519 version=0x28dae000 log2_work=93.598030 tx=744488598 date='2022-06-27T05:53:51Z' progress=0.999999 cache=39.5MiB(231692txo)
2022-06-27T05:53:24Z BlockUntilSyncedToCurrentChain: txindex is catching up on block notifications

@tehelsper
Copy link
Collaborator

How much load is your device under? The additional txindex messages are unusual. They typically finish and it doesn't need to do catchup any more. Do you think Bitcoin could still be doing its txindex processing for some reason?

@HamishMacEwan
Copy link
Author

The machine where the problem occurs is a Pi400, the keyboard one, and it has an HDD attached. It does struggle:
image

Though that txindex message has disappeared, and I'm not sure it didn't happen on the other node, sorry, got a bit confused.

Here's the log of the most recent HLCK failure. Note it took from:

13:38:23.570 [CRT] SRVR: Health check: chain backend failed after 3 calls
13:38:23.575 [INF] SRVR: Sending request for shutdown
13:38:23.585 [INF] LTND: Received shutdown request.
13:38:23.585 [INF] LTND: Shutting down...
13:38:23.586 [INF] LTND: Gracefully shutting down.
13:38:23.588 [INF] NANN: Channel Status Manager shutting down
13:38:23.588 [INF] HSWC: HTLC Switch shutting down
13:38:23.588 [ERR] CRTR: Resuming payment b7e9b0661567014a088f639b8b0f4a0cc4f5fc12eaaddc0b127c47c0fca505eb failed: htlcswitch shutting down.
13:41:56.940 [INF] CRTR: Processed channels=0 updates=0 nodes=5 in last 5m12.103115522s

till

14:45:12.334 [INF] HLCK: Health monitor shutting down
14:45:12.637 [INF] RPCS: Stopping RPC Server
14:45:12.638 [INF] RPCS: Stopping AutopilotRPC Sub-RPC Server
14:45:12.638 [INF] RPCS: Stopping ChainRPC Sub-RPC Server
14:45:12.638 [INF] RPCS: Stopping WatchtowerClientRPC Sub-RPC Server
14:45:12.641 [INF] RPCS: Stopping WalletKitRPC Sub-RPC Server
14:45:12.647 [INF] RPCS: Stopping RouterRPC Sub-RPC Server
14:45:12.647 [INF] RPCS: Stopping InvoicesRPC Sub-RPC Server
14:45:12.649 [INF] RPCS: Stopping VersionRPC Sub-RPC Server
14:45:12.650 [INF] RPCS: Stopping WatchtowerRPC Sub-RPC Server
14:45:12.656 [INF] RPCS: Stopping SignRPC Sub-RPC Server
14:45:12.656 [INF] TORC: Stopping tor controller
14:45:12.771 [INF] LTND: Shutdown complete
14:45:14 flat6 lnd[11262]: Checking if device is shutting down...
14:45:14 flat6 lnd[11262]: Not shutting down!

Over an hour, to complete the shutdown. It did seem to be lagging blocks quite badly and bitcoind was flat out doing megabytes/second of reads throughout…

flat6.lnd.log

I've extended the timeout and backoff in /usr/share/mynode/lnd.conf

healthcheck.chainbackend.backoff=1m0s
healthcheck.chainbackend.timeout=1m0s

@tehelsper
Copy link
Collaborator

OK, let me know if the timeout change helps. I don't run any nodes with HDDs anymore and my guess is that it's getting to the point where it can't keep up.

@HamishMacEwan
Copy link
Author

I pushed the timeout and backoff to 2m0s and haven't had any shutdowns due to health check problems since.

flat6 lnd[11275]: 2022-07-02 14:38:30.490 [INF] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 2m0s, backing off for: 2m0s

This one message is all I could find.

There is definitely something awry with the system, load averages are well out of whack, IMHO:

load average: 4.88, 5.08, 5.54

But I'll describe that in a separate issue if I find out what's happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants