"[CRT] SRVR: Health check: chain backend failed after 3 calls" #658

HamishMacEwan · 2022-05-13T17:33:32Z

Describe the bug
LND is deciding bitcoind has failed a health check and shutting down gracefully, to be restarted by the systemd service unit.

Expected behavior
Just keep running.

MyNode hardware (please complete the following information):

Device: Raspi4
Version 0.2.55

Additional context
This problem is described and discussed extensively here: lightningnetwork/lnd#4669

It may be worth extending the timeouts for the health check as described here: lightningnetwork/lnd#4669 (comment)

tehelsper · 2022-05-14T04:19:20Z

Thanks for letting me know. Have you tried changing the timeout values? If so, what worked for you?

tehelsper · 2022-05-14T04:33:29Z

Could you try these values? If I understand right, this should check every 5 minutes now and allow up to 3 minutes for bitcoin to become available if it happens to be under heavy load.

[healthcheck]
; Number of attempts before failure
healthcheck.chainbackend.attempts=3
; Timeout for a single healthcheck attempt
healthcheck.chainbackend.timeout=30s
; The amount of time we should backoff between failed attempts
healthcheck.chainbackend.backoff=30s
; How often should chainbeck checks occur
healthcheck.chainbackend.interval=5m

tehelsper · 2022-05-21T16:51:07Z

Those values are in the latest beta.

HamishMacEwan · 2022-06-27T06:23:16Z

Since upgrading my "production" myNode instance to 64 bit OS and 0.2.57 I'm having similar (and sadly continuing) problems as described above. That is, both my Pi4 and Pi400 instances are regularly (currently a couple of times a day) shutting down due to:

Jun 27 17:59:41 flat6 lnd[19164]: 2022-06-27 17:59:41.312 [CRT] SRVR: Health check: chain backend failed after 3 calls
Jun 27 17:59:41 flat6 lnd[19164]: 2022-06-27 17:59:41.314 [INF] SRVR: Sending request for shutdown

I could extend the "backoff" but unless there's more guidance on what the backend failure is and why it occurs, this seems to be just fumbling in the dark. The check is for some purpose, it would be more useful to determine the reason. I'm not seeing anything wrong with bitcoind, its status looks normal:

2022-06-27T06:08:39Z BlockUntilSyncedToCurrentChain: txindex is catching up on block notifications
2022-06-27T06:08:39Z BlockUntilSyncedToCurrentChain: txindex is catching up on block notifications
2022-06-27T06:08:39Z BlockUntilSyncedToCurrentChain: txindex is catching up on block notifications
2022-06-27T06:08:39Z BlockUntilSyncedToCurrentChain: txindex is catching up on block notifications
2022-06-27T06:08:39Z UpdateTip: new best=000000000000000000074b48e193fb4ccc28bd929001aa1838abfda30c020ef4 height=742521 version=0x20002000 log2_work=93.598055 tx=744490128 date='2022-06-27T06:07:58Z' progress=1.000000 cache=41.2MiB(245824txo)
2022-06-27T06:03:20Z UpdateTip: new best=0000000000000000000081b382a03d1cff2bd1e2cc35da144192d44464c7ed95 height=742520 version=0x3fffe000 log2_work=93.598043 tx=744489328 date='2022-06-27T05:54:44Z' progress=0.999998 cache=40.4MiB(239179txo)
2022-06-27T05:57:50Z BlockUntilSyncedToCurrentChain: txindex is catching up on block notifications
2022-06-27T05:57:50Z UpdateTip: new best=000000000000000000006d7525f4e7e6ac30df9b901d0cfc02482bc1273d90c1 height=742519 version=0x28dae000 log2_work=93.598030 tx=744488598 date='2022-06-27T05:53:51Z' progress=0.999999 cache=39.5MiB(231692txo)
2022-06-27T05:53:24Z BlockUntilSyncedToCurrentChain: txindex is catching up on block notifications

tehelsper · 2022-06-28T02:13:45Z

How much load is your device under? The additional txindex messages are unusual. They typically finish and it doesn't need to do catchup any more. Do you think Bitcoin could still be doing its txindex processing for some reason?

HamishMacEwan · 2022-06-29T03:48:55Z

The machine where the problem occurs is a Pi400, the keyboard one, and it has an HDD attached. It does struggle:

Though that txindex message has disappeared, and I'm not sure it didn't happen on the other node, sorry, got a bit confused.

Here's the log of the most recent HLCK failure. Note it took from:

13:38:23.570 [CRT] SRVR: Health check: chain backend failed after 3 calls
13:38:23.575 [INF] SRVR: Sending request for shutdown
13:38:23.585 [INF] LTND: Received shutdown request.
13:38:23.585 [INF] LTND: Shutting down...
13:38:23.586 [INF] LTND: Gracefully shutting down.
13:38:23.588 [INF] NANN: Channel Status Manager shutting down
13:38:23.588 [INF] HSWC: HTLC Switch shutting down
13:38:23.588 [ERR] CRTR: Resuming payment b7e9b0661567014a088f639b8b0f4a0cc4f5fc12eaaddc0b127c47c0fca505eb failed: htlcswitch shutting down.
13:41:56.940 [INF] CRTR: Processed channels=0 updates=0 nodes=5 in last 5m12.103115522s

till

14:45:12.334 [INF] HLCK: Health monitor shutting down
14:45:12.637 [INF] RPCS: Stopping RPC Server
14:45:12.638 [INF] RPCS: Stopping AutopilotRPC Sub-RPC Server
14:45:12.638 [INF] RPCS: Stopping ChainRPC Sub-RPC Server
14:45:12.638 [INF] RPCS: Stopping WatchtowerClientRPC Sub-RPC Server
14:45:12.641 [INF] RPCS: Stopping WalletKitRPC Sub-RPC Server
14:45:12.647 [INF] RPCS: Stopping RouterRPC Sub-RPC Server
14:45:12.647 [INF] RPCS: Stopping InvoicesRPC Sub-RPC Server
14:45:12.649 [INF] RPCS: Stopping VersionRPC Sub-RPC Server
14:45:12.650 [INF] RPCS: Stopping WatchtowerRPC Sub-RPC Server
14:45:12.656 [INF] RPCS: Stopping SignRPC Sub-RPC Server
14:45:12.656 [INF] TORC: Stopping tor controller
14:45:12.771 [INF] LTND: Shutdown complete
14:45:14 flat6 lnd[11262]: Checking if device is shutting down...
14:45:14 flat6 lnd[11262]: Not shutting down!

Over an hour, to complete the shutdown. It did seem to be lagging blocks quite badly and bitcoind was flat out doing megabytes/second of reads throughout…

flat6.lnd.log

I've extended the timeout and backoff in /usr/share/mynode/lnd.conf

healthcheck.chainbackend.backoff=1m0s
healthcheck.chainbackend.timeout=1m0s

tehelsper · 2022-06-29T03:59:31Z

OK, let me know if the timeout change helps. I don't run any nodes with HDDs anymore and my guess is that it's getting to the point where it can't keep up.

HamishMacEwan · 2022-07-04T02:36:03Z

I pushed the timeout and backoff to 2m0s and haven't had any shutdowns due to health check problems since.

flat6 lnd[11275]: 2022-07-02 14:38:30.490 [INF] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 2m0s, backing off for: 2m0s

This one message is all I could find.

There is definitely something awry with the system, load averages are well out of whack, IMHO:

load average: 4.88, 5.08, 5.54

But I'll describe that in a separate issue if I find out what's happening.

HamishMacEwan added the bug Something isn't working label May 13, 2022

HamishMacEwan closed this as completed Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"[CRT] SRVR: Health check: chain backend failed after 3 calls" #658

"[CRT] SRVR: Health check: chain backend failed after 3 calls" #658

HamishMacEwan commented May 13, 2022

tehelsper commented May 14, 2022

tehelsper commented May 14, 2022

tehelsper commented May 21, 2022

HamishMacEwan commented Jun 27, 2022 •

edited

Loading

tehelsper commented Jun 28, 2022

HamishMacEwan commented Jun 29, 2022

tehelsper commented Jun 29, 2022

HamishMacEwan commented Jul 4, 2022

"[CRT] SRVR: Health check: chain backend failed after 3 calls" #658

"[CRT] SRVR: Health check: chain backend failed after 3 calls" #658

Comments

HamishMacEwan commented May 13, 2022

tehelsper commented May 14, 2022

tehelsper commented May 14, 2022

tehelsper commented May 21, 2022

HamishMacEwan commented Jun 27, 2022 • edited Loading

tehelsper commented Jun 28, 2022

HamishMacEwan commented Jun 29, 2022

tehelsper commented Jun 29, 2022

HamishMacEwan commented Jul 4, 2022

HamishMacEwan commented Jun 27, 2022 •

edited

Loading