-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Health check: chain backend failed after 3 calls. How to avoid the shutdown? #4669
Comments
There seems to be a problem with the health check, some false positives we are investigating. |
Do you have any more logs of the failed attempts? The logic is relatively simple: every N minutes we perform a backend health check, if that fails 3 times, we shutdown with the assumption that lnd will be restarted by some hypervisor-type system. Also the response period before we mark the attempt failed is rather generous at 10 seconds or so (the default). |
The current implementation will cal |
Related: #4671. |
another log:
going to try catch the error message by doing repro again with
|
If you suspect that this is being caused by a temporary networking glitch, there are a few other config settings you could change:
Thanks! Will be much easier to debug once we know what's going on with the check itself. |
10 second timeout seems too small for default. Average new block arrival time is 600 seconds. Is there any reason not to have the default at 120 seconds? Here is the debug log:
|
I'm getting this issue. 2020-10-08 08:09:51.192 [CRT] SRVR: Health check: chain backend failed after 3 calls I increased the healthcheck.chainbackend.attempts to 10 and healthcheck.chainbackend.timeout to 30s and will keep an eye on it to see if it happens again |
@alevchuk no reason not to bump the default. It does seem bizarre to me that the call is taking 10 seconds to complete, so we're still investigating to make sure nothing is wrong on our side. If we can't figure anything out, will likely bump the default and see how that goes.
Thanks for the report @sendbitcoin! I'd strongly recommend running lnd with |
relevant portion of my log
|
my lnd now running with
kernel consistency maxed out on iowaiting:
I checked with |
The health checks don't touch disk at all, just make a Are you running |
Looks very related to #4689 I'm running If #4689 is not caused by the health check, then it still makes sense that the helthcheck would report LND unhealthy. It's reasonable because |
Thanks that would be great!
Most likely not, because we disabled health checks by default in the version they're running (0.11.1-beta.rc5) because we were worried about false-positives If you're happy to send me a few hours of logs (carla on the lnd slack) I can also have a look and see if there's anything that stands out. |
Sure, but being unreachable for 2 minutes? The code here is also pretty simple: send the request, then wait for it to come back. I don't think the healthchecks themselves are causing high I/O as it's just an RPC call, and |
I experienced the same issue since upgrading to 0.11.1-beta.rcX. Since upgrade to rc5 it seems to be gone. I'll keep an eye on it. |
Health checks are disable in rc5, so you wouldn't see any shutdowns. Were you also seeing healthcheck timeouts? |
Same issue here, after having upgraded from 0.10 to 0.11.99-beta on 13 Oct 2020 (following the security warning of last week), without changing any bitcoin or lnd setting. the issue: lnd shutdown after few hours of operation while you would expect it to run continuously unattended like in previous version...
Find in attachement two extracts of nohup.out:
Will try meanwhile the work-arounds suggested above. update 2020.10.17: stable operation of lnd 0.11 since workarounds applied as proposed by @guggero & @carlaKC above |
My setup is almost exactly the same as @emplexity , and I was getting the same Health Check shutdowns after upgrading to those versions. I loaded with --healthcheck.chainbackend.attempts=0 and it has been steady since (will report back if it closes out again). I just wanted to also point out that loading lnd with chainbackend.attempts=0 seems to have loaded much faster. I am not sure if this was the result of something else or having something cached. But from old log with no flag: Versus with the flag: This long load time for lnd started a version or two ago, but I cannot remember if it was the same version where the health checks started shutting down lnd. It also said "this could take a few minutes" so I assumed the long load time was expected. |
@floundies That flag shouldn't affect how long it takes to open the database. There're a set of flags that affect that, but all this new flag does it hit an RPC call periodically. |
Other questions for those running into this: is |
lnd is the only application hitting bitcoind |
I experience the same issue, and I can reproduce very slow responses to (for example) "getbestblockhash" calls:
Sometimes the calls block for minutes (!), and I think this is due to other requests waiting (and blocking). I'll investigate. |
Bitcoin Core acquires a lock (cs_main) at the start of every interesting RPC call (getbestblockhash as an example). The RPC call "uptime" does not acquire this lock, and it is extremely fast on my machine (without any hiccups). Related discussion: https://bitcoincore.reviews/16426 I have no idea how to fix this issue, but from a lnd perspective I think increasing the backend check values or disabling the feature would help. |
Thanks for the info @C-Otto! We suspected something like this would be the case, but didn't actually go look in core itself. I'm pretty surprised that it blocks for minutes though! I'll switchover to a call that doesn't need the lock (just need one which we have for neutrino/btcd as well, but not critical that it's the same endpoint imo). Will get this in for 0.12. |
The list of relevant endpoints seems to be defined here:
Based on this I think it's best to add a new endpoint (like "uptime" maybe?). In reality, I think it's best to speed up the bitcoind instance. In my case the (new) storage backend seems to be too slow for meaningful bitcoind operation. |
I'm running c7eea13 but still experience issues with the backend health check. However, I see this when starting lnd:
|
"getblockchaininfo" is used according to ngrep. @carlaKC could you have a look? Is it a compilation issue? Can I help debug this? |
This version looks wrong to me, are you sure you're running the latest compiled version at When I checkout that commit I get the following when I check my version:
What's tipping me off is that the value |
I forgot |
This is still an issue, with version 0.17.4. Bitcoind startup takes longer than the default 3 health checks, lnd shuts down (which in my opinion it should avoid at all costs), wallet locks again, can't pay at restaurants etc. |
@schildbach you can configure that:
|
Background
My LND node keeps shutting down. Seems like it's triggered by a short network failure preventing the node from reaching
bitcoind
Your environment
bitcoind
: Bitcoin Core version v0.20.1lnd
andbitocoind
are on different serversSteps to reproduce
Restart LND. Repros every 1 to 2 days.
Expected behaviour
Just keep running.
Actual behaviour
Question
How to avoid the shutdown?
The text was updated successfully, but these errors were encountered: