Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node crashes with "Failed to write response" #6147

Closed
RiccardoM opened this issue Feb 19, 2021 · 18 comments
Closed

Node crashes with "Failed to write response" #6147

RiccardoM opened this issue Feb 19, 2021 · 18 comments

Comments

@RiccardoM
Copy link
Contributor

Tendermint version (use tendermint version or git rev-parse --verify HEAD if installed from source):

tendermint: ""
abci: 0.17.0
blockprotocol: 11
p2pprotocol: 8

ABCI app (name for built-in, URL for self-written if it's publicly available):
Desmos v0.15.1

Environment:

  • OS (e.g. from /etc/os-release):
    NAME="Ubuntu"
    VERSION="20.04.1 LTS (Focal Fossa)"
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME="Ubuntu 20.04.1 LTS"
    VERSION_ID="20.04"
    HOME_URL="https://www.ubuntu.com/"
    SUPPORT_URL="https://help.ubuntu.com/"
    BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
    PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
    VERSION_CODENAME=focal
    UBUNTU_CODENAME=focal
    
  • Install tools:
  • Others:

What happened:
Yesterday, one of our chain nodes has stopped with error Failed to write response for no apparent reason. This is not the first time this happens, and we still have to identify why.

What you expected to happen:
The node should not crash

Have you tried the latest version:
No

How to reproduce it (as minimally and precisely as possible):
I have yet to know this

Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file):
https://pastebin.com/rbaw6FVH

Config (you can paste only the changes you've made):

pruning = "nothing"
@tessr
Copy link
Contributor

tessr commented Feb 19, 2021

Thanks for this report. One of the first things I notice is the the Tendermint version is missing 🤔 We'll look into that. Did you use tendermint version to get the version?

Also, I see that Desmos v0.15.1 is running Cosmos SDK v0.40. There's a new SDK release series out which fixes some halting bugs. The latest release is v0.41.3. I recommend you try upgrading first and if this is still happening we can try to take a closer look.

@RiccardoM
Copy link
Contributor Author

Thanks for this report. One of the first things I notice is the the Tendermint version is missing We'll look into that. Did you use tendermint version to get the version?

I used desmos tendermint version to get the version

Also, I see that Desmos v0.15.1 is running Cosmos SDK v0.40. There's a new SDK release series out which fixes some halting bugs. The latest release is v0.41.3. I recommend you try upgrading first and if this is still happening we can try to take a closer look.

Thank, we'll surely update on our next chain upgrade.

@tac0turtle
Copy link
Contributor

tac0turtle commented Feb 19, 2021

@RiccardoM
Hey,

if you'd like to get the Tendermint version in here:

tendermint: ""
abci: 0.17.0
blockprotocol: 11
p2pprotocol: 8

add:

VERSION := $(shell go list -m github.com/tendermint/tendermint | sed 's:.* ::')
LD_FLAGS = -X github.com/tendermint/tendermint/version.TMCoreSemVer=$(VERSION)

to your makefile.

@tessr
Copy link
Contributor

tessr commented Feb 19, 2021

Whoa, good tip, @marbar3778 - do we have that documented anywhere?

@tac0turtle
Copy link
Contributor

Whoa, good tip, @marbar3778 - do we have that documented anywhere?

I think we only put it in the upgrading doc, but doesn't seem like anyone read it 😃

@tac0turtle
Copy link
Contributor

do we have that documented anywhere?

added here: #6151

@melekes
Copy link
Contributor

melekes commented Feb 22, 2021

The node should not crash

I don't see any stacktrace. Are you sure the node has crashed and not simply hanged?

if crashed

Could you paste the stacktrace of the panic that lead to a crash? Usually it's the last line of the log / stdout.

if hanged

Do you have a goroutine list? Note you can get it by killing the frozen node with kill -6 <PID> or using tendermint debug kill <pid> </path/to/out.zip> — home=</path/to/app.d> https://docs.tendermint.com/master/tools/debugging/pro.html

@RiccardoM
Copy link
Contributor Author

@melekes You right, it hangs. I'm now running tendermint debug kill but it also seems to be hanging 😅

@melekes
Copy link
Contributor

melekes commented Feb 22, 2021

kill -6 <PID> works all the time

@RiccardoM
Copy link
Contributor Author

kill -6 <PID> works all the time

# tendermint debug kill 1858129 ~/debug.zip --home=~/.desmos
I[2021-02-22|10:05:50.012] getting node status...                       
I[2021-02-22|10:05:50.021] getting node network info...                 
I[2021-02-22|10:05:50.030] getting node consensus state...              
I[2021-02-22|10:05:50.050] copying node WAL...                          
ERROR: stat ~/.desmos/data/cs.wal/wal: no such file or directory
# ls ~/.desmos/data
application.db  blockstore.db  cs.wal  evidence.db  priv_validator_state.json  snapshots  state.db  tx_index.db

@melekes
Copy link
Contributor

melekes commented Feb 22, 2021

kill -6 <PID> works all the time

# tendermint debug kill 1858129 ~/debug.zip --home=~/.desmos
I[2021-02-22|10:05:50.012] getting node status...                       
I[2021-02-22|10:05:50.021] getting node network info...                 
I[2021-02-22|10:05:50.030] getting node consensus state...              
I[2021-02-22|10:05:50.050] copying node WAL...                          
ERROR: stat ~/.desmos/data/cs.wal/wal: no such file or directory
# ls ~/.desmos/data
application.db  blockstore.db  cs.wal  evidence.db  priv_validator_state.json  snapshots  state.db  tx_index.db

sorry, maybe I should've been more specific. I meant Linux kill command https://linux.die.net/man/1/kill

@RiccardoM
Copy link
Contributor Author

sorry, maybe I should've been more specific. I meant Linux kill command https://linux.die.net/man//kill

Yeah, I've run the kill -6 command. Then I restarted the node and tried with tendermint debug kill but I got that error

@melekes
Copy link
Contributor

melekes commented Mar 1, 2021

Yeah, I've run the kill -6 command.

cool. so what was the stacktrace?

@RiccardoM
Copy link
Contributor Author

cool. so what was the stacktrace?

I could not get any stacktrace. I just used that command to kill the service as the tendermint debug kill was not working. But I was not able to get any stracktrace from it. The node just stopped without any error and the only way to solve was to kill it or restart it.

@alexanderbez
Copy link
Contributor

Since it hangs, we'll need to see the stacktrace/goroutine list. I fixed the debug kill command, but I'm not sure if that landed in a point release or not.

@tessr
Copy link
Contributor

tessr commented Mar 3, 2021

Let's figure out if the debug kill command was fixed or not. @alexanderbez can you point me to the commit/PR where you fixed it, and I can see if it was released? Otherwise we can backport and include it in 0.34.9.

@alexanderbez
Copy link
Contributor

Here is the PR. I didn't add a backport label, so I don't think it exists in any release yet.

@melekes
Copy link
Contributor

melekes commented Mar 9, 2021

Closing as duplicate of #6184

@melekes melekes closed this as completed Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants