Archive node missing blocks when under heavy gRPC pressure #8602

RiccardoM · 2021-02-17T08:09:13Z

Summary of Bug

When under heavy gRPC pressure (a lot of requests being made), full node can start lacking behind in blocks validation.

Version

v0.40.1

Steps to Reproduce

Start a full node with pruning = "nothing"
Start performing a lot of gRPC requests (around 100 per block)
The node will start to slowly lack behind in block syncing

Context

We are currently developing BDJuno, a tool that allows to listen to a chain state and parses the data into a PostgreSQL database. In order to do so, it acts in two ways at the same time:

Listens for new blocks
Parses all old blocks

For each block, it then reads the different modules' states and stores them inside the PostgreSQL database. What we do is we a snapshot of the state for each block and store it. To do so, we use gRPC to get all the data that can change from one block to another (i.e. delegations, unbonding delegations, redelegations, staking commissions, etc).

As we also need to parse old blocks and get the state at very old heights, we setup an archive node with pruning = "nothing".

When we first started our parser, everything was working properly. The node was able to keep up with syncing new blocks and answering to gRPC calls properly.

Recently, however, we noticed that the node started to lack behind the chain state, was over 500 blocks behind. So, we stopped the parser and let the node catch up again with the chain state. Then, we restarted the parser. One week later and the node is once again more than 1,000 blocks behind the current chain height.

Note
I have no idea if this happens only because the pruning is set to nothing. However, I believe this should be investigated as it might result in some tools (eg. explorers) making the nodes stop in the future if too many requests are done to them. It could even be exploited via a DDoS attack to validator nodes if this results to happen also to nodes that have the pruning option set to default or everything.

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate contributors tagged
Contributor assigned/self-assigned

The text was updated successfully, but these errors were encountered:

tac0turtle · 2021-02-17T08:59:06Z

I believe this is a tendermint issue. The RPC is blocking and causes consensus to slow down. This is a known issue and why we recommend validators not expose their rpc to the public network.

RiccardoM · 2021-02-17T09:00:59Z

I believe this is a tendermint issue. The RPC is blocking and causes consensus to slow down. This is a known issue and why we recommend validators not expose their rpc to the public network.

Are you referring to the RPC or gRPC? Cause we noticed this problem only when querying using gRPC. When we only use RPC it has not problems

tac0turtle · 2021-02-17T09:02:43Z

All requests in the sdk requests are routed through tendermint. The request goes through the abci_query abci method.

RiccardoM · 2021-02-17T09:07:17Z

All requests in the sdk requests are routed through tendermint. The request goes through the abci_query abci method.

Ok thanks. Is there an issue opened in Tendermint about this? Maybe we can link it here for future reference

tac0turtle · 2021-02-17T09:13:33Z

There doesn't seem to be one, it's also a mix of multiple issues. Do you want to open an issue that links to this one?

alexanderbez · 2021-03-09T13:23:57Z

That still doesn't describe it @marbar3778. Why do RPC and legacy API endpoints work "fine", i.e. no regressions, yet gRPC slows down nodes considerably?

tac0turtle · 2021-03-09T13:29:32Z

I can reproduce this on tendermint RPC as well. It's a bit harder than gRPC but still present. gRPC was built to handle concurrent requests, but I don't think any of our stack can handle concurrent requests at high volume.

To reproduce with Tendermint:

spin up two nodes using kv store app.
use tm-load-test on the node that isn't the validator
you may need two instances of tm-load-test
observe the node falling behind

alexanderbez · 2021-03-09T13:33:31Z

I'm curious why this is so exacerbated by gRPC then, which is supposed to be more efficient? Why did block explorers and clients never report such issues for RPC and the legacy API?

tac0turtle · 2021-03-09T13:38:55Z

I'm curious why this is so exacerbated by gRPC then, which is supposed to be more efficient?

It is more efficient in almost all possible ways if tendermint was not used as a global mutex. Right now all calls are routed through tendermint and the known mutex contention when using RPC is being felt.

Why did block explorers and clients never report such issues for RPC and the legacy API?

I am guessing no one was making so many requests per block. This has been a known issue in Tendermint for as long as I can remember. This is one of the core reasons we tell people to not expose their RPC endpoints to the public.

alexanderbez · 2021-03-09T13:50:01Z

I am guessing no one was making so many requests per block. This has been a known issue in Tendermint for as long as I can remember. This is one of the core reasons we tell people to not expose their RPC endpoints to the public.

They were though. Juno for example did this w/o slowing down the connected node at all. Block explorers continuously use and call the RPC to fetch and index data to external data sources.

aaronc · 2021-03-09T19:56:32Z

Can someone from our team investigate if there has indeed been a performance regression with gRPC related to these cases? My guess is that it's likely not gRPC per se, but something else in the query handling... Can you triage @clevinson ?

tac0turtle · 2021-09-01T20:20:51Z

I think this #10045 may help out. Grpc is natively concurrent, but all the queries are queued behind a single mutex. 0.34.13 makes this mutex a RWmutex but the mentioned pr should not require grpc requires to be routed through tendermint. @RiccardoM would love to see if the pr helps

faddat · 2021-09-21T16:40:15Z

I am almost certain this is related in some way to

cosmos/gaia#972
cosmos/gaia#704

...and I've definitely seen similar behavior to this on any node I've used for relaying.

tac0turtle · 2022-01-18T11:08:35Z

closing this for now since grpc is no longer routed through tendermint

RiccardoM mentioned this issue Feb 17, 2021

High queries amount can make miss blocks tendermint/tendermint#6122

Closed

amaury1093 added C: gRPC Issues and PRs related to the gRPC service and HTTP gateway. C: comet labels Feb 17, 2021

greg-szabo mentioned this issue Aug 17, 2021

Reward distribution network issues osmosis-labs/osmosis#414

Closed

github-actions bot added the stale label Aug 22, 2021

github-actions bot closed this as completed Aug 31, 2021

alexanderbez reopened this Sep 1, 2021

github-actions bot removed the stale label Sep 3, 2021

tac0turtle closed this as completed Jan 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archive node missing blocks when under heavy gRPC pressure #8602

Archive node missing blocks when under heavy gRPC pressure #8602

RiccardoM commented Feb 17, 2021 •

edited

Loading

tac0turtle commented Feb 17, 2021

RiccardoM commented Feb 17, 2021

tac0turtle commented Feb 17, 2021

RiccardoM commented Feb 17, 2021

tac0turtle commented Feb 17, 2021

alexanderbez commented Mar 9, 2021

tac0turtle commented Mar 9, 2021

alexanderbez commented Mar 9, 2021

tac0turtle commented Mar 9, 2021

alexanderbez commented Mar 9, 2021

aaronc commented Mar 9, 2021

tac0turtle commented Sep 1, 2021

faddat commented Sep 21, 2021

tac0turtle commented Jan 18, 2022

Archive node missing blocks when under heavy gRPC pressure #8602

Archive node missing blocks when under heavy gRPC pressure #8602

Comments

RiccardoM commented Feb 17, 2021 • edited Loading

Summary of Bug

Version

Steps to Reproduce

Context

For Admin Use

tac0turtle commented Feb 17, 2021

RiccardoM commented Feb 17, 2021

tac0turtle commented Feb 17, 2021

RiccardoM commented Feb 17, 2021

tac0turtle commented Feb 17, 2021

alexanderbez commented Mar 9, 2021

tac0turtle commented Mar 9, 2021

alexanderbez commented Mar 9, 2021

tac0turtle commented Mar 9, 2021

alexanderbez commented Mar 9, 2021

aaronc commented Mar 9, 2021

tac0turtle commented Sep 1, 2021

faddat commented Sep 21, 2021

tac0turtle commented Jan 18, 2022

RiccardoM commented Feb 17, 2021 •

edited

Loading