Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hermes Causes Osmosis Sentry to Crash When Accessing gRPC #1220

Closed
5 tasks
lightiv opened this issue Jul 21, 2021 · 15 comments
Closed
5 tasks

Hermes Causes Osmosis Sentry to Crash When Accessing gRPC #1220

lightiv opened this issue Jul 21, 2021 · 15 comments
Labels
A: question Admin: further information is requested O: usability Objective: cause to improve the user experience (UX) and ease using the product
Milestone

Comments

@lightiv
Copy link

lightiv commented Jul 21, 2021

Crate

Hermes Relayer

Summary of Bug

Hermes Causes Osmosis Sentry to Crash When Accessing gRPC. gRPC is enable by default and the Osmosis Sentry node runs fine BUT when I enable Hermes to access the node it cause it to crash. I see that a transaction will output a long, long string of data and the node stops reading blocks.

Version

git version 2.25.1

Steps to Reproduce

For me, In stall the Osmosis sentry and then configure Hermes to access it. The problem occurs withing 5 minutes

Jul 21 05:35:24 osmosis-sentry-02 cosmovisor[2904]: 5:35AM INF Error reconnecting to peer. Trying again addr={"id":"00c328a33578466c711874ec5ee7ada75951f99a","ip":"35.82.201.64","port":26656} err="auth failure: secret conn failed: read tcp 94.16.108.152:40766->35.82.201.64:26656: i/o timeout" module=p2p tries=4
Jul 21 05:35:25 osmosis-sentry-02 cosmovisor[2904]: 5:35AM INF ABCIQuery data=0A087472616E7366657212096368616E6E656C2D301A0B18FFFFFFFFFFFFFFFFFF01 module=rpc path=/ibc.core.channel.v1.Query/PacketAcknowledgements result={"code":0,"codespace":"","height":"436445","index":"0","info":"","key":null,"log":"","proofOps":null,"value":"CjkKCHRyYW5zZmVyEgljaGFubmVsLTAYASIgCPdVftUYJv4Y2EUSvyTsdQAe268hI6R333KgqfNkCnwKOQoIdHJhbnNmZXISCWNoYW5uZWwtMBgKIiAI91V+1Rgm/hjYRRK/JOx1AB7bryEjpHffcqCp82QKfAo5Cgh0cmFuc2ZlchIJY2hhbm5lbC0wGGQiIAj3VX7VGCb+GNhFEr8k7HUAHtuvISOkd99yoKnzZAp8CjoKCHRyYW5zZmVyEgljaGFubmVsLTAY6AciIAj3VX7VGCb+GNhFEr8k7HUAHtuvISOkd99yoKnzZAp8CjoKCHRyYW5zZmVyEgljaGFubmVsLTAYkE4iIAj3VX7VGCb+GNhFEr8k7HUAHtuvISOkd99yoKnzZAp8CjoKCHRyYW5zZmVyEgljaGFubmVsLTAYkU4iIAj3VX7VGCb+GNhFEr8k7HUAHtuvISOkd99yoKnzZAp8CjoKCHRyYW5zZmVyEgljaGFubmVsLTAYkk4iIAj3VX7VGCb+GNhFEr8k7...........

Acceptance Criteria

Hermes 6 is able to access gRPC for Osmosis witout killing the node


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate milestone (priority) applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@adizere
Copy link
Member

adizere commented Jul 21, 2021

Hello,

Thanks for opening this issue, seems quite critical!

Could you please provide further info, concretely:

  • how do you run Hermes? Is it hermes start, or some other command?
    • it may be useful to paste somewhere the whole Hermes log file so we can see what might have gone wrong
  • can we also see you Hermes config.toml file?

@jackzampolin
Copy link

I've also experienced this. Requires restart of the node.

@lightiv
Copy link
Author

lightiv commented Jul 21, 2021

I run Hermes using a service that starts it with hermes start

Do you have an e-mail address I can send the log and config.toml file to?

@andynog
Copy link
Contributor

andynog commented Jul 22, 2021

I've also experienced this. Requires restart of the node.

@jackzampolin are you also using cosmovisor to run the node ?

@ancazamfir
Copy link
Collaborator

gRPC should not crash full nodes so some fix needs to go into abci/SDK for that.
But we should fix hermes to not give the full node a cause to crash: Not seeing any logs attached here but the little snippet seems to indicate the crash happens on acknowledgment query. Maybe the output is too big. hermes uses pagination: ibc_proto::cosmos::base::query::pagination::all() in the request. I guess hermes could limit the number of items in the response by doing multiple requests/ pages.

@adizere adizere added the O: usability Objective: cause to improve the user experience (UX) and ease using the product label Sep 9, 2021
@adizere adizere added this to the 10.2021 milestone Sep 9, 2021
@adizere adizere added the A: question Admin: further information is requested label Sep 9, 2021
@hu55a1n1
Copy link
Member

Hi @lightiv, can you provide an update on this, please? Is this still a problem with the latest release?

@lightiv
Copy link
Author

lightiv commented Sep 21, 2021

Hi Need to update the be latest binary which I will do today and get back with you. Thanks.

@lightiv
Copy link
Author

lightiv commented Sep 22, 2021

I am running 7.0.1 and gRPC still kills the node which this output being the final gashp.

IBC_OSMO_Death
.

@lightiv
Copy link
Author

lightiv commented Sep 22, 2021

Oh, wait. By the time I finished the above post the console had timed out and when I reconnected the node was running. I will let it go for a bit. The above is Osmosis. I will try Akash also.

@romac
Copy link
Member

romac commented Sep 28, 2021

@lightiv Are you still seeing the same issue with Hermes 0.7.2? Even if we haven't gotten to using pagination yet, this version should be much lighter load on the nodes now, both at startup and during normal operations.

@lightiv
Copy link
Author

lightiv commented Sep 28, 2021

I have had this running for a couple of hours on the Akash and Osmosis nodes having issues and they have not become unresponsive. The Akash node is running fine and appears to be keeping up with the current block height. The Osmosis node is crawling and continually slowing. It will not be able to get to the current block height.

@lightiv
Copy link
Author

lightiv commented Sep 28, 2021

Oh, sorry need to refresh. Let me update to v0.7.2 and get back to you. I am currently on v7.0.1.

@lightiv
Copy link
Author

lightiv commented Oct 18, 2021

I have upgraded to v0.7.3 and I am not having any issues with the Osmosis node crashing. The problem now is that hermes times out accessing gRPC:

Oct 18 03:23:03 hermes[661828]: Oct 18 03:23:03.701 ERROR ThreadId(01) skipped workers for connection connection-13 on chain osmosis-1, reason:
Oct 18 03:23:03 hermes[661828]:    0: relayer error
Oct 18 03:23:03 hermes[661828]:    1: error in underlying transport when making GRPC call
Oct 18 03:23:03 hermes[661828]:    2: transport error: error trying to connect: tcp connect error: Connection timed out (os error 110)
Oct 18 03:23:03 hermes[661828]:    3: error trying to connect: tcp connect error: Connection timed out (os error 110)
Oct 18 03:23:03 hermes[661828]:    4: tcp connect error: Connection timed out (os error 110)
Oct 18 03:23:03 hermes[661828]:    5: Connection timed out (os error 110)
Oct 18 03:23:03 hermes[661828]: Location:
Oct 18 03:23:03 hermes[661828]:    /home/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/flex-error-0.4.3/src/tracer_impl/eyre.rs:31

I have verified that gRPC is accessible externally:

grpcurl -plaintext 0.0.0.0:9090 list

cosmos.auth.v1beta1.Query
cosmos.bank.v1beta1.Query
cosmos.base.reflection.v1beta1.ReflectionService
cosmos.base.tendermint.v1beta1.Service
cosmos.distribution.v1beta1.Query
cosmos.evidence.v1beta1.Query
cosmos.gov.v1beta1.Query
cosmos.params.v1beta1.Query
cosmos.slashing.v1beta1.Query
cosmos.staking.v1beta1.Query
cosmos.tx.v1beta1.Service
cosmos.upgrade.v1beta1.Query
grpc.reflection.v1alpha.ServerReflection
ibc.applications.transfer.v1.Query
ibc.core.channel.v1.Query
ibc.core.client.v1.Query
ibc.core.connection.v1.Query
osmosis.claim.v1beta1.Query
osmosis.epochs.v1beta1.Query
osmosis.gamm.v1beta1.Query
osmosis.incentives.Query
osmosis.lockup.Query
osmosis.mint.v1beta1.Query
osmosis.poolincentives.v1beta1.Query
testdata.Query

@adizere
Copy link
Member

adizere commented Oct 20, 2021

Can you please share with us the relevant section of your Hermes config.toml? It should look similar to this:

https://github.com/informalsystems/ibc-rs/blob/91a4d1d0db84fdcc7f780a775f8f25eb57412cf4/config.toml#L83

Also, what is the output of hermes health-check ?

@adizere adizere modified the milestones: 10.2021, 12.2021 Oct 20, 2021
@lightiv
Copy link
Author

lightiv commented Oct 20, 2021

The issues I was having has been resolved by installing everything on one box. I am now relaying.

@lightiv lightiv closed this as completed Oct 20, 2021
@adizere adizere modified the milestones: 12.2021, 10.2021 Oct 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: question Admin: further information is requested O: usability Objective: cause to improve the user experience (UX) and ease using the product
Projects
None yet
Development

No branches or pull requests

7 participants