Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive node missing blocks when under heavy gRPC pressure #8602

Closed
4 tasks
RiccardoM opened this issue Feb 17, 2021 · 14 comments
Closed
4 tasks

Archive node missing blocks when under heavy gRPC pressure #8602

RiccardoM opened this issue Feb 17, 2021 · 14 comments
Labels
C: comet C: gRPC Issues and PRs related to the gRPC service and HTTP gateway.

Comments

@RiccardoM
Copy link
Contributor

RiccardoM commented Feb 17, 2021

Summary of Bug

When under heavy gRPC pressure (a lot of requests being made), full node can start lacking behind in blocks validation.

Version

v0.40.1

Steps to Reproduce

  1. Start a full node with pruning = "nothing"
  2. Start performing a lot of gRPC requests (around 100 per block)
  3. The node will start to slowly lack behind in block syncing

Context

We are currently developing BDJuno, a tool that allows to listen to a chain state and parses the data into a PostgreSQL database. In order to do so, it acts in two ways at the same time:

  1. Listens for new blocks
  2. Parses all old blocks

For each block, it then reads the different modules' states and stores them inside the PostgreSQL database. What we do is we a snapshot of the state for each block and store it. To do so, we use gRPC to get all the data that can change from one block to another (i.e. delegations, unbonding delegations, redelegations, staking commissions, etc).

As we also need to parse old blocks and get the state at very old heights, we setup an archive node with pruning = "nothing".

When we first started our parser, everything was working properly. The node was able to keep up with syncing new blocks and answering to gRPC calls properly.

Recently, however, we noticed that the node started to lack behind the chain state, was over 500 blocks behind. So, we stopped the parser and let the node catch up again with the chain state. Then, we restarted the parser. One week later and the node is once again more than 1,000 blocks behind the current chain height.

Note
I have no idea if this happens only because the pruning is set to nothing. However, I believe this should be investigated as it might result in some tools (eg. explorers) making the nodes stop in the future if too many requests are done to them. It could even be exploited via a DDoS attack to validator nodes if this results to happen also to nodes that have the pruning option set to default or everything.


For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@tac0turtle
Copy link
Member

I believe this is a tendermint issue. The RPC is blocking and causes consensus to slow down. This is a known issue and why we recommend validators not expose their rpc to the public network.

@RiccardoM
Copy link
Contributor Author

I believe this is a tendermint issue. The RPC is blocking and causes consensus to slow down. This is a known issue and why we recommend validators not expose their rpc to the public network.

Are you referring to the RPC or gRPC? Cause we noticed this problem only when querying using gRPC. When we only use RPC it has not problems

@tac0turtle
Copy link
Member

All requests in the sdk requests are routed through tendermint. The request goes through the abci_query abci method.

@RiccardoM
Copy link
Contributor Author

All requests in the sdk requests are routed through tendermint. The request goes through the abci_query abci method.

Ok thanks. Is there an issue opened in Tendermint about this? Maybe we can link it here for future reference

@tac0turtle
Copy link
Member

There doesn't seem to be one, it's also a mix of multiple issues. Do you want to open an issue that links to this one?

@amaury1093 amaury1093 added C: gRPC Issues and PRs related to the gRPC service and HTTP gateway. C: comet labels Feb 17, 2021
@alexanderbez
Copy link
Contributor

That still doesn't describe it @marbar3778. Why do RPC and legacy API endpoints work "fine", i.e. no regressions, yet gRPC slows down nodes considerably?

@tac0turtle
Copy link
Member

I can reproduce this on tendermint RPC as well. It's a bit harder than gRPC but still present. gRPC was built to handle concurrent requests, but I don't think any of our stack can handle concurrent requests at high volume.

To reproduce with Tendermint:

  • spin up two nodes using kv store app.
  • use tm-load-test on the node that isn't the validator
  • you may need two instances of tm-load-test
  • observe the node falling behind

@alexanderbez
Copy link
Contributor

I'm curious why this is so exacerbated by gRPC then, which is supposed to be more efficient? Why did block explorers and clients never report such issues for RPC and the legacy API?

@tac0turtle
Copy link
Member

I'm curious why this is so exacerbated by gRPC then, which is supposed to be more efficient?

It is more efficient in almost all possible ways if tendermint was not used as a global mutex. Right now all calls are routed through tendermint and the known mutex contention when using RPC is being felt.

Why did block explorers and clients never report such issues for RPC and the legacy API?

I am guessing no one was making so many requests per block. This has been a known issue in Tendermint for as long as I can remember. This is one of the core reasons we tell people to not expose their RPC endpoints to the public.

@alexanderbez
Copy link
Contributor

I am guessing no one was making so many requests per block. This has been a known issue in Tendermint for as long as I can remember. This is one of the core reasons we tell people to not expose their RPC endpoints to the public.

They were though. Juno for example did this w/o slowing down the connected node at all. Block explorers continuously use and call the RPC to fetch and index data to external data sources.

@aaronc
Copy link
Member

aaronc commented Mar 9, 2021

Can someone from our team investigate if there has indeed been a performance regression with gRPC related to these cases? My guess is that it's likely not gRPC per se, but something else in the query handling... Can you triage @clevinson ?

@tac0turtle
Copy link
Member

I think this #10045 may help out. Grpc is natively concurrent, but all the queries are queued behind a single mutex. 0.34.13 makes this mutex a RWmutex but the mentioned pr should not require grpc requires to be routed through tendermint. @RiccardoM would love to see if the pr helps

@github-actions github-actions bot removed the stale label Sep 3, 2021
@faddat
Copy link
Contributor

faddat commented Sep 21, 2021

I am almost certain this is related in some way to

cosmos/gaia#972
cosmos/gaia#704

...and I've definitely seen similar behavior to this on any node I've used for relaying.

@tac0turtle
Copy link
Member

closing this for now since grpc is no longer routed through tendermint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: comet C: gRPC Issues and PRs related to the gRPC service and HTTP gateway.
Projects
None yet
Development

No branches or pull requests

6 participants