Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use application latest height in query_status #2021

Merged
merged 11 commits into from
Apr 28, 2022
Merged

Conversation

ancazamfir
Copy link
Collaborator

@ancazamfir ancazamfir commented Mar 28, 2022

Closes: #1970

Description


PR author checklist:

  • Added changelog entry, using unclog.
  • Added tests: integration (for Hermes) or unit/mock tests (for modules).
  • Linked to GitHub issue.
  • Updated code comments and documentation (e.g., docs/).

Reviewer checklist:

  • Reviewed Files changed in the GitHub PR explorer.
  • Manually tested (in case integration/unit/mock tests are absent).

@ancazamfir ancazamfir changed the title Anca/app latest height Use application latest height in query_status Mar 28, 2022
Comment on lines 1234 to 1258
/// Query the application status
fn query_application_status(&self) -> Result<StatusResponse, Error> {
crate::time!("query_application_status");
crate::telemetry!(query, self.id(), "query_application_status");

let status = self.status()?;
// query the chain status
let status = self.chain_status()?;

// query the application status
let abci_status = self
.block_on(self.rpc_client.abci_info())
.map_err(|e| Error::rpc(self.config.rpc_addr.clone(), e))?;

// has the application been updated with latest block?
let time = if abci_status.last_block_height == status.sync_info.latest_block_height {
// yes, use the chain latest block time
status.sync_info.latest_block_time
} else {
// no, retrieve the time of the header at application latest height
self.block_on(self.rpc_client.commit(abci_status.last_block_height))
.map_err(|e| Error::rpc(self.config.rpc_addr.clone(), e))?
.signed_header
.header
.time
};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main change for this PR. It requires 3 RPCs instead of one. And this happens for handling of every IBC send packet event with impact on relayer performance and extra churn on the full node. So I am a bit hesitant to move this forward until performance tests are done and/or get all info in one RPC.
I opened tendermint/tendermint#8248 to get input from tendermint team.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also wary of increasing the pressure on the node, especially for every SendPacket event.

Are we seeing this only when issuing queries manually or is it also happening during relaying? If only the former, I would vote for waiting until we can get away with a single RPC call or until something like this is implemented.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is seen during relaying and can cause failures for any query. In one extreme case reported by Mircea (see issue) hermes was not relaying at all on a channel as it was not able to spawn workers due to connection not found.
This was fixed with a partial fix, similar to the one in this PR but without the time fix (i.e. we would only fix the height to be the application height). When I did some tests with custom chains (with increased block execution times) the lack of time fix would sometime cause header in the future chain error for client updates.

@adizere
Copy link
Member

adizere commented Apr 22, 2022

I did some basic benchmarking with two RPC nodes:

  1. ecostake: https://rpc-cosmoshub.ecostake.com/
  2. pupmos: https://rpc.cosmoshub.pupmos.network

Both found from here: https://chain-registry.netlify.app

I looked at these 5 endpoints:

  1. /status
  2. /abci_info
  3. /commit
  4. /header?height=X
  5. /blockchain?minHeight=X&maxHeight=X

These numbers are preliminary, and may not reflect other operator's endpoints performance.

Basic takeaway: If we can, we should avoid the /commit endpoint, which can be ~30% slower than the others. /abci_info can also be problematic.

Data

The numbers are in milliseconds.

/status /abci_info /commit /header /blockchain
pupmos 108 105 134 101 106
ecostake 107 152 143 98 99

@ancazamfir
Copy link
Collaborator Author

ancazamfir commented Apr 23, 2022

I did some basic benchmarking with two RPC nodes:

Thanks @adizere for this! For some reason I don't see the header RPC. What is the exact command you used for this?

@adizere adizere added the P-high label Apr 25, 2022
@adizere
Copy link
Member

adizere commented Apr 25, 2022

For some reason I don't see the header RPC. What is the exact command you used for this?

Good point. The header endpoint is still a draft PR in tendermint-rs: informalsystems/tendermint-rs#1101.
The data I extracted above was by querying via time curl http:$RPC/$ENDPOINT

@adizere adizere marked this pull request as ready for review April 25, 2022 14:51
@adizere adizere self-requested a review as a code owner April 25, 2022 14:51
Copy link
Member

@adizere adizere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to reproduce the problem reliably. I haven't yet checked Anca's instructions from here, but that's probably the best place to start from.

let blocks = self
.block_on(
self.rpc_client
.blockchain(abci_info.last_block_height, abci_info.last_block_height),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers:

  • we query /blockchain?minHeight=last_block_height&maxHeight=last_block_height to fetch the metadata for a single block


// Check that transaction indexing is enabled
if status.node_info.other.tx_index != TxIndexStatus::On {
return Err(Error::tx_indexing_disabled(chain_id.clone()));
}

// Check that the chain identifier matches the network name
Copy link
Member

@adizere adizere Apr 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This change has not much to do with the present PR. It's just an additional non-critical health check that can reveal misconfigurations of config.toml.

@romac
Copy link
Member

romac commented Apr 27, 2022

I believe I am still seeing the same issue, when running the script against Gaia v6.0.3 using a patched Tendermint:

Query Chain ID: ibc-0, Conn ID: connection-0, Height: 62, ABCI_Height: 61
2022-04-27T07:44:42.335551Z DEBUG ThreadId(01) Options: QueryConnectionEndCmd { chain_id: ChainId { id: "ibc-0", version: 0 }, connection_id: ConnectionId("connection-0"), height: Some(62) }
Error:
   0: connection not found: connection-0



Query Chain ID: ibc-0, Conn ID: connection-0, Height: 62, ABCI_Height: 62
2022-04-27T07:44:45.386849Z DEBUG ThreadId(01) Options: QueryConnectionEndCmd { chain_id: ChainId { id: "ibc-0", version: 0 }, connection_id: ConnectionId("connection-0"), height: Some(62) }
Success: ConnectionEnd {
    state: Open,
    client_id: ClientId(
        "07-tendermint-0",
    ),
    counterparty: Counterparty {
        client_id: ClientId(
            "07-tendermint-0",
        ),
        connection_id: Some(
            ConnectionId(
                "connection-0",
            ),
        ),
        prefix: ibc,
    },
    versions: [
        Version {
            identifier: "1",
            features: [
                "ORDER_ORDERED",
                "ORDER_UNORDERED",
            ],
        },
    ],
    delay_period: 0ns,
}

@romac
Copy link
Member

romac commented Apr 27, 2022

Detailed instructions for reproducing the issue and results:

  1. cd /tmp

  2. mkdir gaia-test

  3. cd gaia-test

  4. git clone https://github.com/tendermint/tendermint

  5. cd tendermint

  6. git switch v0.34.14

  7. apply

    diff --git a/state/execution.go b/state/execution.go
    index 2cc3db6e7..16f150702 100644
    --- a/state/execution.go
    +++ b/state/execution.go
    @@ -323,6 +323,8 @@ func execBlockOnProxyApp(
            }
        }
    
    +  time.Sleep(time.Millisecond * 500)
    +
        // End block.
        abciResponses.EndBlock, err = proxyAppConn.EndBlockSync(abci.RequestEndBlock{Height: block.Height})
        if err != nil {
  8. cd ..

  9. git clone https://github.com/cosmos/gaia

  10. cd gaia

  11. apply

    diff --git a/go.mod b/go.mod
    index 7465a0be..e3a508c5 100644
    --- a/go.mod
    +++ b/go.mod
    @@ -124,4 +124,5 @@ replace (
        github.com/cosmos/cosmos-sdk => github.com/cosmos/cosmos-sdk v0.44.2
        github.com/gogo/protobuf => github.com/regen-network/protobuf v1.3.3-alpha.regen.1
        google.golang.org/grpc => google.golang.org/grpc v1.33.2
    +   github.com/tendermint/tendermint => /tmp/gaia-test/tendermint
     )
  12. make build

  13. vim ~/.gm/gm.toml

    [global]
    add_to_hermes        = false
    auto_maintain_config = true
    extra_wallets        = 0
    gaiad_binary         = "/private/tmp/gaia-test/gaia/build/gaiad"
    hdpath               = ""
    home_dir             = "$HOME/.gm"
    ports_start_at       = 27040
    
    [global.hermes]
    binary            = "$HOME/Informal/Code/Current/ibc-rs/target/debug/build/hermes"
    config            = "$HOME/.hermes/config.toml"
    log_level         = "info"
    telemetry_enabled = true
    telemetry_host    = "127.0.0.1"
    telemetry_port    = 3001
    
    [ibc-0]
    ports_start_at = 27000
    add_to_hermes  = true
    
    [ibc-1]
    ports_start_at = 27010
    add_to_hermes  = true
  14. gm reset

  15. gm start

  16. gm hermes config

  17. remove filter = false from ~/.hermes/config.toml

  18. Go to ibc-rs working directory

  19. git fetch origin

  20. git switch anca/app_latest_height

  21. git pull

  22. cargo build

  23. gm hermes keys

  24. cargo run -- create channel ibc-0 -c ibc-1 --port-a transfer --port-b transfer --new-client-connection

  25. paste this in query_conn.sh

    #!/bin/bash
    
    addr=$1
    chain=$2
    conn=$3
    for i in {1..100}
    do
        read -r height < <(
            curl -s "$addr"/status | jq -r '.result.sync_info.latest_block_height')
        read -r abci_height < <(
            curl -s "$addr"/abci_info | jq -r '.result.response.last_block_height')
        echo "Query Chain ID: ${chain}, Conn ID: ${conn}, Height: ${height}, ABCI_Height: ${abci_height}"
        ./target/debug/hermes query connection end $chain $conn -H $height
        sleep 1
    done
  26. bash query_conn.sh 'http://localhost:27000' ibc-0 connection-0

@adizere
Copy link
Member

adizere commented Apr 27, 2022

Great instruction for testing Romain, thank you for sharing that!

I was able to see the same issue as you. E.g.:

Query Chain ID: ibc-0, Conn ID: connection-0, Height: 101, ABCI_Height: 100
2022-04-27T09:15:28.911674Z INFO ThreadId(01) using default configuration from '/Users/adi/.hermes/config.toml'
Error: connection not found: connection-0


After digging into it, I realize that hermes query connection end which we use in this query_conn.sh testing script does not invoke query_application_status, which is what this PR solves. So I think it would be expected that we encounter this problem when running query_conn.sh script.

To summarize, I think the purpose of query_conn.sh is to simulate the problem that Hermes encounters when querying with blockchain height (-H $height) instead of querying with application height (-H $abci_height). I don't understand how this problem relates to the original issue (#1970), however, because I don't see where exactly does Hermes rely on blockchain height in the supervisor or scan part as these logs show:

2022-03-03T19:40:34.805800Z ERROR ThreadId(01) skipped workers for connection connection-20 on chain gravity-bridge-3, reason: relayer error: connection not found: connection-464
2022-03-03T19:40:34.805834Z INFO ThreadId(01) channel channel-17 on chain gravity-bridge-3 is: OPEN; state on dest. chain (cosmoshub-4) is: UNINITIALIZED

That being said, this PR only replaces the use of blockchain height with application height specifically for query_application_status. I'm not convinced this is enough to solve the error reported in the issue (#1970).

@ancazamfir
Copy link
Collaborator Author

The script retrieves the blockchain height then queries the connection with that height. If the height is higher than the application query state height you see that error. I used this script because in the original issue the command used was query channel ends which queries the connection with the blockchain latest height.
Also in hermes we sometimes query the latest height and then we do subsequent queries and gather proofs with that fixed height.
You can try to run tests (e.g. e2e) from master with the repro setup and you will see quite a few failures.

@romac
Copy link
Member

romac commented Apr 28, 2022

You can try to run tests (e.g. e2e) from master with the repro setup and you will see quite a few failures.

Alright so I've done just that a couple times, both against master and against this branch, and I did indeed see a bunch of errors with master when querying channel and connection ends but not a single one with this branch 🎉

To me this is good to merge :)

Copy link
Member

@adizere adizere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Anca for kicking off the fix and your guidance! Amazing work Romain with testing! LGTM 🚀

@adizere adizere merged commit cb33212 into master Apr 28, 2022
@adizere adizere deleted the anca/app_latest_height branch April 28, 2022 15:10
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
* Test with abci_status

* Fix status also

* Remove redundant code

* Move changes in chain query_application_status()

* Cargo fmt

* Added /blockchain-based implementation for query_application_status

* Documenting impl of query_application_status.

* changelog

* Cleanup

Co-authored-by: Adi Seredinschi <adi@informal.systems>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gRPC connection query at latest blockchain height may fail
3 participants