Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health check: chain backend failed after 3 calls. How to avoid the shutdown? #4669

Closed
alevchuk opened this issue Oct 3, 2020 · 32 comments · Fixed by #4755
Closed

Health check: chain backend failed after 3 calls. How to avoid the shutdown? #4669

alevchuk opened this issue Oct 3, 2020 · 32 comments · Fixed by #4755
Assignees

Comments

@alevchuk
Copy link

alevchuk commented Oct 3, 2020

Background

My LND node keeps shutting down. Seems like it's triggered by a short network failure preventing the node from reaching bitcoind

Your environment

  • lnd version 0.11.99-beta commit=v0.11.0-beta-255-g10a84f2c75c9de958c9dd07de830f0546a43d05a
  • which operating system: Debian GNU/Linux 10 4.19.0-8-amd64
  • version of bitcoind: Bitcoin Core version v0.20.1
  • lnd and bitocoind are on different servers

Steps to reproduce

Restart LND. Repros every 1 to 2 days.

Expected behaviour

Just keep running.

Actual behaviour

2020-10-03 02:18:26.258 [INF] NTFN: New block: height=651006, sha=000000000000000000059e4ba6945311b11cf30cea4389ca932ebd0c89f5186c
2020-10-03 02:18:26.258 [INF] UTXN: Attempting to graduate height=651006: num_kids=0, num_babies=0
2020-10-03 02:18:58.221 [INF] CRTR: Processed channels=0 updates=554 nodes=1 in last 59.999815896s

2020-10-03 02:19:24.910 [CRT] SRVR: Health check: chain backend failed after 3 calls  <------------

2020-10-03 02:19:24.910 [INF] SRVR: Sending request for shutdown

Question

How to avoid the shutdown?

@guggero
Copy link
Collaborator

guggero commented Oct 4, 2020

There seems to be a problem with the health check, some false positives we are investigating.
For now, you can disable the health check with --healthcheck.chainbackend.attempts=0.

@Roasbeef
Copy link
Member

Roasbeef commented Oct 4, 2020

Do you have any more logs of the failed attempts? The logic is relatively simple: every N minutes we perform a backend health check, if that fails 3 times, we shutdown with the assumption that lnd will be restarted by some hypervisor-type system. Also the response period before we mark the attempt failed is rather generous at 10 seconds or so (the default).

@Roasbeef
Copy link
Member

Roasbeef commented Oct 4, 2020

The current implementation will cal getbestblock for bitcoind backends. I run a few nodes against btcd and haven't yet noticed a spurious shutdown. If you run the HLCK on the debug log-level, then we should be able to see what error lnd is receiving to cause it to think that the heath check failed.

@Roasbeef
Copy link
Member

Roasbeef commented Oct 4, 2020

Related: #4671.

@alevchuk
Copy link
Author

alevchuk commented Oct 5, 2020

another log:

2020-09-29 02:31:28.866 [INF] NTFN: New block: height=650455, sha=000000000000000000057e352676f4e0847106aa5c7a4e35d8e0b319cac185c7
2020-09-29 02:31:28.874 [INF] UTXN: Attempting to graduate height=650455: num_kids=0, num_babies=0
2020-09-29 02:31:29.416 [INF] DISC: GossipSyncer(0334ca3166f220e08f3075b406bd009c4d71575d5131bc688a34049651e94b3d70): querying for 500 new channels
2020-09-29 02:32:34.386 [CRT] SRVR: Health check: chain backend failed after 3 calls
2020-09-29 02:32:34.386 [INF] SRVR: Sending request for shutdown

going to try catch the error message by doing repro again with

debuglevel=HLCK=debug

@carlaKC
Copy link
Collaborator

carlaKC commented Oct 6, 2020

triggered by a short network failure

If you suspect that this is being caused by a temporary networking glitch, there are a few other config settings you could change:

  • --healthcheck.chainbackend.attempts: increase to allow more tries before lnd shuts down
  • --healthcheck.chainbackend.timeout: increase the amount of time each call is allowed before we fail due to timeout
  • --healthcheck.chainbackend.backoff: increase the amount of time we backoff between attempts, which could allow the temporary issue more time to resolve itself

going to try catch the error message by doing repro again with

Thanks! Will be much easier to debug once we know what's going on with the check itself.

@alevchuk
Copy link
Author

alevchuk commented Oct 7, 2020

10 second timeout seems too small for default.

Average new block arrival time is 600 seconds. Is there any reason not to have the default at 120 seconds?

Here is the debug log:

2020-10-06 04:32:58.985 [INF] NTFN: New block: height=651471, sha=00000000000000000004eff1ca66b45e292a630901cde715222baf4b9aa6f50b
2020-10-06 04:32:58.986 [INF] UTXN: Attempting to graduate height=651471: num_kids=0, num_babies=0
2020-10-06 04:33:00.192 [INF] CRTR: Processed channels=0 updates=360 nodes=1 in last 1m0.000015454s
2020-10-06 04:33:25.091 [INF] DISC: Broadcasting 468 new announcements in 18 sub batches
2020-10-06 04:33:29.381 [INF] DISC: GossipSyncer(022df3659d511f156e3ee91ef608a095ce3399234d374c7105356a36a904469372): applying new update horizon: start=2042-07-14 01:04:00 +0000 UTC, end=2042-07-14 01:04:00 +0000 UTC, backlog_size=0
2020-10-06 04:34:00.192 [INF] CRTR: Processed channels=1 updates=133 nodes=2 in last 59.999820789s
2020-10-06 04:34:55.091 [INF] DISC: Broadcasting 178 new announcements in 18 sub batches
2020-10-06 04:35:00.191 [INF] CRTR: Processed channels=2 updates=131 nodes=0 in last 59.999928167s
2020-10-06 04:36:00.191 [INF] CRTR: Processed channels=0 updates=195 nodes=2 in last 59.999924549s
2020-10-06 04:36:25.091 [INF] DISC: Broadcasting 319 new announcements in 18 sub batches
2020-10-06 04:36:53.710 [INF] DISC: GossipSyncer(02cf93d75855011806318f235a0d7e02735c6d5b961862c8ae3aaf58415d7c24ca): applying new update horizon: start=2020-10-06 04:36:53 +0000 UTC, end=2156-11-12 11:05:08 +0000 UTC, backlog_size=0
2020-10-06 04:36:55.091 [INF] DISC: GossipSyncer(02a2c53bc475cb92e4ab2f38a5bca56df695034ce90ad78c2f47c05911e3f79e41): applying gossipFilter(start=0001-01-01 00:00:00 +0000 UTC, end=0001-01-01 00:00:00 +0000 UTC)
2020-10-06 04:36:55.091 [INF] DISC: GossipSyncer(031015a7839468a3c266d662d5bb21ea4cea24226936e2864a7ca4f2c3939836e0): applying gossipFilter(start=2020-10-06 04:36:55.091711013 +0000 UTC m=+48033.098810450, end=2156-11-12 11:05:10.091711013 +0000 UTC m=+4295015328.098810450)
2020-10-06 04:37:00.191 [INF] CRTR: Processed channels=0 updates=196 nodes=0 in last 59.999873136s
2020-10-06 04:37:55.091 [INF] DISC: Broadcasting 278 new announcements in 18 sub batches
2020-10-06 04:38:00.192 [INF] CRTR: Processed channels=0 updates=203 nodes=3 in last 1m0.000011358s
2020-10-06 04:38:03.950 [ERR] BTCN: Can't accept connection: unable to accept connection from N.N.N.N:64836: EOF
2020-10-06 04:39:00.191 [INF] CRTR: Processed channels=0 updates=177 nodes=7 in last 59.999835524s
2020-10-06 04:39:25.091 [INF] DISC: Broadcasting 314 new announcements in 18 sub batches
2020-10-06 04:40:00.191 [INF] CRTR: Processed channels=0 updates=265 nodes=0 in last 1m0.000007686s
2020-10-06 04:40:04.822 [DBG] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-10-06 04:40:44.822 [DBG] HLCK: Health check: chain backend, call: 2 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-10-06 04:40:55.091 [INF] DISC: Broadcasting 459 new announcements in 18 sub batches
2020-10-06 04:41:00.191 [INF] CRTR: Processed channels=0 updates=330 nodes=5 in last 59.999848626s
2020-10-06 04:41:24.823 [CRT] SRVR: Health check: chain backend failed after 3 calls
2020-10-06 04:41:24.823 [INF] SRVR: Sending request for shutdown

@sendbitcoin
Copy link

sendbitcoin commented Oct 8, 2020

I'm getting this issue.
lnd has shutdown twice already because of this. backend is bitcoin core 0.18.1 on the same computer.

2020-10-08 08:09:51.192 [CRT] SRVR: Health check: chain backend failed after 3 calls
2020-10-08 08:09:51.192 [INF] SRVR: Sending request for shutdown
2020-10-08 08:09:51.193 [INF] LTND: Received shutdown request.
2020-10-08 08:09:51.193 [INF] LTND: Shutting down...
2020-10-08 08:09:51.193 [INF] LTND: Gracefully shutting down.

I increased the healthcheck.chainbackend.attempts to 10 and healthcheck.chainbackend.timeout to 30s and will keep an eye on it to see if it happens again

@carlaKC
Copy link
Collaborator

carlaKC commented Oct 9, 2020

Is there any reason not to have the default at 120 seconds?

@alevchuk no reason not to bump the default. It does seem bizarre to me that the call is taking 10 seconds to complete, so we're still investigating to make sure nothing is wrong on our side. If we can't figure anything out, will likely bump the default and see how that goes.

I increased the healthcheck.chainbackend.attempts to 10 and healthcheck.chainbackend.timeout to 30s and will keep an eye on it to see if it happens again

Thanks for the report @sendbitcoin! I'd strongly recommend running lnd with debuglevel=HLCK=debug so that we can see the reason your check is failing. The logging was only bumped to info level in 35a2dbc so you may not see the reason otherwise.

@alevchuk
Copy link
Author

relevant portion of my log

2020-10-06 04:40:04.822 [DBG] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-10-06 04:40:44.822 [DBG] HLCK: Health check: chain backend, call: 2 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-10-06 04:40:55.091 [INF] DISC: Broadcasting 459 new announcements in 18 sub batches
2020-10-06 04:41:00.191 [INF] CRTR: Processed channels=0 updates=330 nodes=5 in last 59.999848626s
2020-10-06 04:41:24.823 [CRT] SRVR: Health check: chain backend failed after 3 calls
2020-10-06 04:41:24.823 [INF] SRVR: Sending request for shutdown

@alevchuk
Copy link
Author

my lnd now running with --healthcheck.chainbackend.timeout=120s started doing a lot a continuous disk writes and some occasional [DBG] HLCK timeouts showed up:

2020-10-10 15:56:05.738 [DBG] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 2m0s, backing off for: 30s
...
2020-10-11 00:10:05.738 [DBG] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 2m0s, backing off for: 30s
...
2020-10-11 01:10:05.738 [DBG] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 2m0s, backing off for: 30s
...

kernel consistency maxed out on iowaiting:

$ vmstat 3
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  1      0 758488  84440 647044    0    0     0   405  263  536  2  0  0 98  0
 0  1      0 758456  84440 647044    0    0     0   388  230  476  2  0  0 98  0
 0  1      0 758172  84444 647044    0    0     0   403  243  508  2  0  0 98  0
 0  1      0 758140  84452 647044    0    0     0   401  252  512  2  0  0 98  0
 0  1      0 758108  84452 647044    0    0     0   416  241  485  2  0  0 98  0

I checked with atop that the writes are coming from lnd only

@carlaKC
Copy link
Collaborator

carlaKC commented Oct 13, 2020

The health checks don't touch disk at all, just make a GetBestBlock call to bitcoind (although it's possible that GetBestBlock does). I suspect that there's something else up with your node, are you managing normal operation (send payments, open channels etc)? Could be related to #4689, you could disable the health check entirely and see whether this problem persists, but I imagine that it will.

Are you running lnd and bitcoind on the same hardware? How much memory are you working with?

@alevchuk
Copy link
Author

Looks very related to #4689

I'm running lnd and bitcoind on a separate host. 2 GB RAM. Problem reproduces almost every day, going to get back to you if turning off health check had any effect.

If #4689 is not caused by the health check, then it still makes sense that the helthcheck would report LND unhealthy. It's reasonable because bitcoind cannot be reached and it seems plausible to me because in different kinds of software, I've seen client timeouts when there is I/O issue on the client host.

@carlaKC
Copy link
Collaborator

carlaKC commented Oct 13, 2020

going to get back to you if turning off health check had any effect

Thanks that would be great!

If #4689 is not caused by the health check

Most likely not, because we disabled health checks by default in the version they're running (0.11.1-beta.rc5) because we were worried about false-positives

If you're happy to send me a few hours of logs (carla on the lnd slack) I can also have a look and see if there's anything that stands out.

@Roasbeef
Copy link
Member

It's reasonable because bitcoind cannot be reached and it seems plausible to me because in different kinds of software, I've seen client timeouts when there is I/O issue on the client host.

Sure, but being unreachable for 2 minutes? The code here is also pretty simple: send the request, then wait for it to come back. I don't think the healthchecks themselves are causing high I/O as it's just an RPC call, and lnd makes several of these to bitcoind on a normal basis for routing operation.

@hpbock
Copy link

hpbock commented Oct 14, 2020

I experienced the same issue since upgrading to 0.11.1-beta.rcX. Since upgrade to rc5 it seems to be gone. I'll keep an eye on it.

@carlaKC
Copy link
Collaborator

carlaKC commented Oct 14, 2020

I experienced the same issue since upgrading to 0.11.1-beta.rcX. Since upgrade to rc5 it seems to be gone. I'll keep an eye on it.

Health checks are disable in rc5, so you wouldn't see any shutdowns. Were you also seeing healthcheck timeouts?

@emplexity
Copy link

emplexity commented Oct 14, 2020

Same issue here, after having upgraded from 0.10 to 0.11.99-beta on 13 Oct 2020 (following the security warning of last week), without changing any bitcoin or lnd setting.

the issue: lnd shutdown after few hours of operation while you would expect it to run continuously unattended like in previous version...

  • bitcoind v 0.20.0, bitcoin and lnd on the same machine,
  • lnd launched with nohup
  • Running on a Ubuntu 16.04 virtual machine, with 1 TB disk, 14 GB RAM

Find in attachement two extracts of nohup.out:

  1. the occurances of health check warning during the last operation period (approx 4 hrs)
  2. the sequence which triggered the shutdown and few of its ensuing events.

Will try meanwhile the work-arounds suggested above.

shutDown.txt
healthCheck.txt

update 2020.10.17: stable operation of lnd 0.11 since workarounds applied as proposed by @guggero & @carlaKC above

@floundies
Copy link

My setup is almost exactly the same as @emplexity , and I was getting the same Health Check shutdowns after upgrading to those versions. I loaded with --healthcheck.chainbackend.attempts=0 and it has been steady since (will report back if it closes out again). I just wanted to also point out that loading lnd with chainbackend.attempts=0 seems to have loaded much faster. I am not sure if this was the result of something else or having something cached.

But from old log with no flag:
2020-10-18 20:09:01.345 [INF] LTND: Database now open (time_to_open=4m4.826767561s)!

Versus with the flag:
2020-10-18 22:21:02.195 [INF] LTND: Database now open (time_to_open=356.557358ms)!

This long load time for lnd started a version or two ago, but I cannot remember if it was the same version where the health checks started shutting down lnd. It also said "this could take a few minutes" so I assumed the long load time was expected.

@Roasbeef
Copy link
Member

@floundies That flag shouldn't affect how long it takes to open the database. There're a set of flags that affect that, but all this new flag does it hit an RPC call periodically.

@Roasbeef
Copy link
Member

Other questions for those running into this: is lnd the only application that's hitting bitcoind's RPC interface?

@emplexity
Copy link

lnd is the only application hitting bitcoind

@C-Otto
Copy link
Contributor

C-Otto commented Nov 7, 2020

I experience the same issue, and I can reproduce very slow responses to (for example) "getbestblockhash" calls:

$ time bitcoin-cli -conf=/etc/bitcoin/bitcoin.conf getbestblockhash
000000000000000000056cfad3dda5c6ebde81a545f10a2a76c65650779a3dd3

real	0m3.893s
user	0m0.002s
sys	0m0.000s

Sometimes the calls block for minutes (!), and I think this is due to other requests waiting (and blocking). I'll investigate.
I don't think lnd can be blamed for this.

@C-Otto
Copy link
Contributor

C-Otto commented Nov 7, 2020

Bitcoin Core acquires a lock (cs_main) at the start of every interesting RPC call (getbestblockhash as an example). The RPC call "uptime" does not acquire this lock, and it is extremely fast on my machine (without any hiccups).

Example: https://github.com/bitcoin/bitcoin/blob/867dbeba5f91be15ca0d4a7303a71957ff9a37ad/src/rpc/blockchain.cpp#L225

Related discussion: https://bitcoincore.reviews/16426

I have no idea how to fix this issue, but from a lnd perspective I think increasing the backend check values or disabling the feature would help.

@carlaKC
Copy link
Collaborator

carlaKC commented Nov 8, 2020

Bitcoin Core acquires a lock (cs_main) at the start of every interesting RPC call (getbestblockhash as an example). The RPC call "uptime" does not acquire this lock, and it is extremely fast on my machine (without any hiccups).

Thanks for the info @C-Otto! We suspected something like this would be the case, but didn't actually go look in core itself. I'm pretty surprised that it blocks for minutes though! I'll switchover to a call that doesn't need the lock (just need one which we have for neutrino/btcd as well, but not critical that it's the same endpoint imo). Will get this in for 0.12.

@C-Otto
Copy link
Contributor

C-Otto commented Nov 8, 2020

The list of relevant endpoints seems to be defined here:
https://github.com/btcsuite/btcwallet/blob/master/chain/interface.go#L31

  • ❌ Start/Stop/WaitForShutdown/BackEnd: not relevant.
  • ❌ GetBestBlock: "getblockchaininfo"
  • ❌ GetBlock - "getblock"
  • ❌ GetBlockHash - "getblockhash"
  • ❌ IsCurrent - uses "GetBestBlock"
  • ❌ FilterBlocks - uses "GetBlock"
  • ❌ BlockStamp - is passive, does not attempt to contact bitcoind
  • ❌ SendRawTransaction - I guess sending a transaction every so often isn't a good idea :)
  • ❌ Rescan - doesn't seem to be a good idea
  • ❌ NotifyReceived - uses "NotifyBlocks"
  • ❌ NotifyBlocks - uses "GetBestBlock"
  • ❌ Notifications - is passive, does not attempt to contact bitcoind

Based on this I think it's best to add a new endpoint (like "uptime" maybe?). In reality, I think it's best to speed up the bitcoind instance. In my case the (new) storage backend seems to be too slow for meaningful bitcoind operation.

@C-Otto
Copy link
Contributor

C-Otto commented Nov 12, 2020

I'm running c7eea13 but still experience issues with the backend health check.

However, I see this when starting lnd: Version: 0.11.99-beta commit=v0.11.0-beta-307-g9e75c3eadede1816d4be19a7923cd5d0b44b78a0, build=development, logging=default, debuglevel=info

2020-11-12 15:56:35.835 [INF] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-11-12 15:57:15.835 [INF] HLCK: Health check: chain backend, call: 2 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-11-12 16:58:47.772 [INF] HLCK: Health check: chain backend, call: 1 failed with: health check: chain backend timed out after: 10s, backing off for: 30s
2020-11-12 16:59:27.772 [INF] HLCK: Health check: chain backend, call: 2 failed with: health check: chain backend timed out after: 10s, backing off for: 30s

@C-Otto
Copy link
Contributor

C-Otto commented Nov 12, 2020

"getblockchaininfo" is used according to ngrep. @carlaKC could you have a look? Is it a compilation issue? Can I help debug this?

@carlaKC
Copy link
Collaborator

carlaKC commented Nov 12, 2020

version: 0.11.99-beta commit=v0.11.0-beta-307-g9e75c3eadede1816d4be19a7923cd5d0b44b78a0

This version looks wrong to me, are you sure you're running the latest compiled version at c7eea13f9?

When I checkout that commit I get the following when I check my version:

./lnd-debug --version
version 0.11.99-beta commit=v0.11.0-beta-436-gc7eea13f95cfa1d5362ad65a1502b70b2d5d6aef-dirty

What's tipping me off is that the value beta-? indicates the number of commits since the last tag, and yours looks too low?
Feel free to ping me on lnd slack btw (carla) if it's easier

@C-Otto
Copy link
Contributor

C-Otto commented Nov 12, 2020

I forgot make install, and just verified that "uptime" is used with the current version. Thanks for the quick help :)

@schildbach
Copy link

This is still an issue, with version 0.17.4. Bitcoind startup takes longer than the default 3 health checks, lnd shuts down (which in my opinion it should avoid at all costs), wallet locks again, can't pay at restaurants etc.

@guggero
Copy link
Collaborator

guggero commented Mar 11, 2024

@schildbach you can configure that:

$ lnd --help | grep healthcheck.chainbackend

      --healthcheck.chainbackend.interval=                   How often to run a health check. (default: 1m0s)
      --healthcheck.chainbackend.attempts=                   The number of calls we will make for the check before failing. Set this value to 0 to disable a check. (default: 3)
      --healthcheck.chainbackend.timeout=                    The amount of time we allow the health check to take before failing due to timeout. (default: 30s)
      --healthcheck.chainbackend.backoff=                    The amount of time to back-off between failed health checks. (default: 2m0s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants