Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default ordering in gRPC responses assumes lexicographic values of Height, not numerical value #1399

Open
adizere opened this issue May 20, 2022 · 6 comments
Labels
client-UX needs discussion Issues that need discussion before they can be worked on type: bug Something isn't working as expected

Comments

@adizere
Copy link

adizere commented May 20, 2022

Anca pointed out that the default ordering in gRPC responses works on lexicographic value of Height, not on their numerical value.

Here is an example (trimmed from an actual query). If the chain has states for heights:
830, 821, 766, 222, 104, 76, 30, 25, 20, 10
The gRPC query (QueryConsensusStatesRequest) returns:
10, 104, 20, 222, 25, 30, 76, 766, 8, 821, 830

Originally posted by @adizere in #798 (comment)

@crodriguezvega
Copy link
Contributor

Thank you, @adizere and Anca!

@crodriguezvega crodriguezvega added type: bug Something isn't working as expected client-UX labels May 20, 2022
@catShaark
Copy link
Contributor

I suggest we change how we index ConsensusState in the store, I've posted an issue to describe my proposal.

@catShaark
Copy link
Contributor

catShaark commented Jun 7, 2022

I suggest we change how we index ConsensusState in the store, I've posted an issue to describe my proposal.

So It turns out we can't do that as @AdityaSripal explain here, he suggested we instead use the metadata stored in tendermint client that does store the consensus state in numbering order. I think that's the best way to kill this problem. Here is what I think we should do in detail :

  • so the metadata that @AdityaSripal mentioned is all the ConsensusState stored in the client store of each of 07-tendermint clients under the key "iterateConsensusStates". This is managed only by 07-tendermint client.
  • add this "iterateConsensusStates" key to keys.go in 02-client/types
  • add necessary methods to 02-client keeper for setting, getting and iterating ConsensusState with this key (like what 07-client is doing currently).
  • remove functions associated with that key in 07-tendermint client ?
  • now 02-client will mange the work of setting ConsensusState with that key instead of 07-tendermint.

plz let me know what you guys think.

@AdityaSripal
Copy link
Member

Thanks for the summary and suggestions @catShaark

add this "iterateConsensusStates" key to keys.go in 02-client/types

I'm a bit hesitant to do this since it's very tendermint client specific. Ideally we get to a point where we can easily upgrade the 24-host keys without major disruptions. (Connection Upgradability 👀 ). So upgrading core IBC to remove this technical debt will eventually be possible but is not a high priority for us at the moment.

Given that, i'd prefer the least disruptive short-term solution to get in for now.

Isn't it feasible to just get the consensus states in lexicographical order and then do a numerical sort in the rpc handler itself? If that's not too difficult, that might be fine to do for now.

If that isn't feasible then we can look to the next "least disruptive" solution.

Also, would be good to understand what if any are the concrete negative effects of this issue.

cc: @colin-axner @crodriguezvega

@crodriguezvega
Copy link
Contributor

Isn't it feasible to just get the consensus states in lexicographical order and then do a numerical sort in the rpc handler itself? If that's not too difficult, that might be fine to do for now.

@catShaark and I discussed this a few weeks ago and the problem is that even if you order in the handler itself the consensusStateHeights that you retrieve from the store here this would only order the results for a given page (assuming that there are many results and you need pagination). So you might order the results returned for the page, but those results are not ordered globally for the whole set of results potentially available.

Also, would be good to understand what if any are the concrete negative effects of this issue.

@adizere can you (or maybe Anca) please explain us how this problem affects hermes? And how bad would it be if this cannot be solved until we have connection upgradability, as @AdityaSripal mentions in his comment above as the preferred solution?

@crodriguezvega crodriguezvega added the needs discussion Issues that need discussion before they can be worked on label Jul 10, 2022
@adizere
Copy link
Author

adizere commented Jul 15, 2022

please explain us how this problem affects hermes?

This affects Hermes in the sense that whenever the relayer does the QueryConsensusStatesRequest it puts a lot of pressure on the full node. This translated into operators needing very expensive machines (to handle this query) and instability of the node itself (when under high query pressure). More details on the production requirements follow.

The relayer needs to:

  1. as part of misbehavior checking: the relayer traverses the consensus states of a client and checks each of them; this traversal of consensus states is usually done in reverse order, starting from the latest one going back towards the oldest state (checking each state that is within the trusting period)

  2. as part of client upgrade and client update CLIs: the relayer needs a trusted height (a height at which that client has a consensus state installed). to resolve this trusted height, it needs to query all of the client's consensus states and traverse the list in reverse order

  3. (most important for production) as part of constructing proofs during normal-case packet relaying: if there are concurrent relayers updating a client, then Hermes cannot use the client's latest height (because another relayer has updated the client in the meantime), and it consenquently has to resolve the trusted height (similar to example 2 above). to resolve the trusted height, it again needs to fetch consensus states and traverse them in reverse order of their height until it finds one that is smaller than X (X is the height at which Hermes wants to construct proofs)

if you find it useful, the requirements in all of these examples are as follows:

  • the relayer needs to fetch consensus states (for example 1)
  • fetch only consensus state heights (for example 2 and 3)
  • whether it is full consensus state or just heights, the data needed is in reverse order of their height (for all 3 examples)
  • never needs consensus states which are outside of trusting period (the relayer cannot use them if returned as part of a query response)
  • needs a single consensus state height which is smaller than a certain value X (for resolving trusted height for examples 2 and 3)

For complete context, I would add that the QueryConsensusStatesRequest gRPC request is one of the most expensive calls that the relayers are doing in production. I did some quick benchmarks (in Sept last year) and this query was one order of magnitude slower than other client queries (it was 10 seconds I think, while most queries were at most 1 second on cheap machines). It is a major cause for pressure on full nodes (and costs, because full nodes have to be high-powered nodes that support high query volume). For reference, one of the most impactful optimization we did which improved the stability of relaying in Sept last year was related to skipping this gRPC request (informalsystems/hermes#1379).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client-UX needs discussion Issues that need discussion before they can be worked on type: bug Something isn't working as expected
Projects
None yet
Development

No branches or pull requests

4 participants