Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting "failed to get trusted headers" error #373

Closed
orkunkl opened this issue Jan 17, 2021 · 30 comments · Fixed by #416
Closed

Getting "failed to get trusted headers" error #373

orkunkl opened this issue Jan 17, 2021 · 30 comments · Fixed by #416
Labels
T: Bug 🪲 TYPE: Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@orkunkl
Copy link
Contributor

orkunkl commented Jan 17, 2021

relayer version: 1.0.0-rc
I am trying to connect musselnet-2 and stargate-final networks via IBC. this is the error I am getting at the moment:

rly tx link musselnet-2_stargate-final
I[2021-01-18|00:02:01.812] ★ Clients created: client(07-tendermint-0) on chain[musselnet-2] and client(07-tendermint-0) on chain[stargate-final]
I[2021-01-18|00:02:42.784] failed to get trusted headers: chain(stargate-final): post failed: Post "http://x.x.x.x:26657": EOF
I[2021-01-18|00:02:42.784] retrying transaction...
I[2021-01-18|00:05:22.766] failed to get trusted headers: chain(stargate-final): post failed: Post "http://x.x.x.x:26657": EOF
I[2021-01-18|00:05:22.766] retrying transaction...
I[2021-01-18|00:05:52.751] failed to get trusted headers: chain(stargate-final): post failed: Post "http://x.x.x.x:26657": EOF
I[2021-01-18|00:05:52.751] retrying transaction...

When I followed up the bug I found out this line is causing the error https://github.com/cosmos/relayer/blob/v1.0.0-rc1/relayer/headers.go#L122
which looks like might be related to this issue cosmos/cosmos-sdk#8341

@colin-axner
Copy link
Contributor

If the line you linked is causing the issue then it is probably because of #8327. Looking at the error it seems to me there might be a connection issue (but the returned errors might just be deceiving). What parameters do you have set for historical entries in genesis? Those determine the length of time the relayer will support interacting with past events. The higher number of historical entries, the longer you can interact with past events

@orkunkl
Copy link
Contributor Author

orkunkl commented Jan 18, 2021

@colin-axner it is set to 10000 both in musselnet genesis and stargate-final-4

@colin-axner
Copy link
Contributor

@orkunkl could you try checking out the branch colin/test-upstream-fix, it contains the 8327 and 8341 fixes. Would be good if we could rule those out as possible errors

@orkunkl
Copy link
Contributor Author

orkunkl commented Jan 18, 2021

Tried it with your branch.

rly version
version: 1.0.0-rc1-56-gfa0d02a
commit: fa0d02a07606798d3a67fe0306f20925f9d96835
cosmos-sdk: v0.40.0-rc7.0.20210118130753-646ea5545e2e
go: go1.15.6 linux/amd64

This is the error I am getting now which is different than before. I guess your branch fixed EOF error and now historical entries limit is not enough now.

rly tx link new-path
I[2021-01-18|16:21:31.214] ★ Clients created: client(07-tendermint-0) on chain[musselnet-2] and client(07-tendermint-0) on chain[stargate-final]
I[2021-01-18|16:21:32.106] failed to get trusted headers: chain(musselnet-2): rpc error: code = NotFound desc = historical info for height 99443 not found: key not found
I[2021-01-18|16:21:32.106] retrying transaction...

So I have not run relayer from the beginning of the chain. What can I do to make it run now?

@colin-axner
Copy link
Contributor

what is the current height of musselnet-2?

Go into ~/.relayer/config/config.yaml and delete the client id's that have been set. Then rerun the rly tx link command

This will create new clients, which will rely on recent blocks. My guess is the clients currently created are older than the historical info in state, which results in this error.

@orkunkl
Copy link
Contributor Author

orkunkl commented Jan 18, 2021

/Workspace/relayer on colin/test-upstream-fix λ rly tx link new-path -d -o 3s
I[2021-01-18|18:59:13.063] ★ Clients created: client(07-tendermint-2) on chain[musselnet-2] and client(07-tendermint-2) on chain[stargate-final]

This command got stuck there for a while. What could be the reason?

~/.wasmd λ rly paths list
 1: new-path             -> chns(✔) clnts(✔) conn(✔) chan(✘) (musselnet-2:transfer<>stargate-final:transfer)

@colin-axner
Copy link
Contributor

Did it work? I can only see that the clients were created.

My guess is relayer had a hard time querying the chain. I believe it queries the chains for the clnts conn and chan to see if they are created

@jaybxyz
Copy link

jaybxyz commented Jan 22, 2021

@colin-axner I also confront this failed to get trusted headers: chain(stargate-final): post failed: Post:... error when I tried to link bifrost-2 and stargate-final chains.

@jaybxyz
Copy link

jaybxyz commented Jan 22, 2021

I solved the issue by changing our full node pruning strategy to nothing

@colin-axner
Copy link
Contributor

we should update the error message to indicate that the headers have likely been pruned

@orkunkl
Copy link
Contributor Author

orkunkl commented Jan 22, 2021

Updated wasmd to v0.14.1 (runs cosmos-sdk v0.40.1) and gaiad to v3.0.1. Set pruning option to nothing and launched bare new nodes. relayer is running on master(d9decfa)
Here is the relayer config:

global:
  timeout: 10s
  light-cache-size: 20
chains:
- key: testkey
  chain-id: musselnet-2
  rpc-addr: https://rpc-ibc.musselnet.cosmwasm.com:443
  account-prefix: wasm
  gas-adjustment: 1.1
  gas-prices: 0.01umayo
  trusting-period: 336h
- key: testkey
  chain-id: stargate-final
  rpc-addr: http://188.34.177.78:26657
  account-prefix: cosmos
  gas-adjustment: 1.3
  gas-prices: 0.01umuon
  trusting-period: 336h
paths:
  testpath:
    src:
      chain-id: musselnet-2
      client-id: 
      port-id: transfer
      order: ORDERED
      version: ics20-1
    dst:
      chain-id: stargate-final
      client-id:
      port-id: transfer
      order: ORDERED
      version: ics20-1
    strategy:
      type: naive

I am getting the same error when linking:

rly tx link testpath -d -o 3s
I[2021-01-22|16:56:16.235] - [musselnet-2] -> creating client on musselnet-2 for stargate-final header-height{120236} trust-period(336h0m0s)
I[2021-01-22|16:56:16.596] - [stargate-final] -> creating client on stargate-final for musselnet-2 header-height{176352} trust-period(336h0m0s)
I[2021-01-22|16:56:16.972] ★ Clients created: client(07-tendermint-0) on chain[musselnet-2] and client(07-tendermint-0) on chain[stargate-final]
I[2021-01-22|16:56:18.156] failed to get trusted headers: chain(stargate-final): rpc error: code = NotFound desc = historical info for height 47056 not found: key not found
I[2021-01-22|16:56:18.156] retrying transaction...
I[2021-01-22|16:56:24.102] failed to get trusted headers: chain(stargate-final): rpc error: code = NotFound desc = historical info for height 47056 not found: key not found
I[2021-01-22|16:56:24.102] retrying transaction...
I[2021-01-22|16:56:29.904] failed to get trusted headers: chain(musselnet-2): rpc error: code = NotFound desc = historical info for height 99443 not found: key not found
I[2021-01-22|16:56:29.904] retrying transaction...
I[2021-01-22|16:56:35.803] failed to get trusted headers: chain(stargate-final): rpc error: code = NotFound desc = historical info for height 47056 not found: key not found
I[2021-01-22|16:56:35.803] retrying transaction...
Error: ! Connection failed: [musselnet-2]client{07-tendermint-0}conn{} -> [stargate-final]client{07-tendermint-0}conn{}

@colin-axner
Copy link
Contributor

colin-axner commented Jan 22, 2021

This is very odd, I just looked at the sequence of calls and the code seems to be fine. I have two theories

  1. The time between the clients being created and the connection handshake initiating is enough time for the historical entries to pass and the historical info being pruned. This seems highly unlikely since you said 10000 was the value you set on both chains
  2. The client creation function doesn't actually update the off-chain light clients before creating the on-chain clients (despite calling a function UpdatesWithHeader). Therefore the question is the time between when you created the off-chain light client (rly light init) and when you created the on-chain light client/initiated the connection handshake (rly tx link).

I think 2) is causing the issue, since we see on musselnet-2 at height 120236 a client was created, but shortly after on musselnet-2 we try to query for historical info at height 99443. This is likely the height you ran rly light init -f.

Short term fix try rly light update musselnet-2 and rly light update stargate-final. Then rerun the linking command.

Long term fix if the above works, we should adjust relayer/client-tx.go to update both light clients before attempting to create the clients. I'd be a good opportunity to simplify the code, there are a lot of update functions, but we should have 1 that clearly update off-chain light clients, and another which returns an on-chain update header for on-chain clients (the latter function should likely call the off-chain update first)

@orkunkl
Copy link
Contributor Author

orkunkl commented Jan 25, 2021

@colin-axner I tried to update light clients but got the same error again.

@colin-axner
Copy link
Contributor

@orkunkl thanks for trying. Did the heights in the error message change?

I will try to reproduce your error this week and then continue debugging.

As a side note, you'll need to update your channel order to be UNORDERED. ICS-20 does not support ordered channels

@colin-axner colin-axner added the T: Bug 🪲 TYPE: Inconsistencies or issues which will cause an issue or problem for users or implementors. label Jan 25, 2021
@jaybxyz
Copy link

jaybxyz commented Jan 26, 2021

Same error message again

I had the same error message and solved it by changing pruning strategy to nothing before. Since bifrost-2 requires trusting-period to be lower than unbonding-period which was 24 hours, I set it 23 hours. So, as trusting-period is over, the connection is expired. So I deleted $HOME/.relayer direcotry and tried to start from the beginning; re-initialize config, add configs and link both chains. And, now I get the same error message. I wonder in what cases this error message occurs (I should probably check out the codebase).

I[2021-01-26|11:20:52.917] - [stargate-final] -> creating client on stargate-final for bifrost-2 header-height{97610} trust-period(23h0m0s)
I[2021-01-26|11:20:52.964] - [bifrost-2] -> creating client on bifrost-2 for stargate-final header-height{172718} trust-period(23h0m0s)
I[2021-01-26|11:20:52.997] ★ Clients created: client(07-tendermint-10) on chain[stargate-final] and client(07-tendermint-3) on chain[bifrost-2]
I[2021-01-26|11:20:53.243] failed to get trusted headers: chain(bifrost-2): rpc error: code = NotFound desc = historical info for height 46003 not found: key not found
I[2021-01-26|11:20:53.243] retrying transaction...
I[2021-01-26|11:20:58.498] failed to get trusted headers: chain(bifrost-2): rpc error: code = NotFound desc = historical info for height 46003 not found: key not found
I[2021-01-26|11:20:58.498] retrying transaction...
I[2021-01-26|11:21:03.670] failed to get trusted headers: chain(bifrost-2): rpc error: code = NotFound desc = historical info for height 46003 not found: key not found
I[2021-01-26|11:21:03.670] retrying transaction...
I[2021-01-26|11:21:08.886] failed to get trusted headers: chain(stargate-final): rpc error: code = NotFound desc = historical info for height 117611 not found: key not found
I[2021-01-26|11:21:08.886] retrying transaction...
Error: ! Connection failed: [stargate-final]client{07-tendermint-10}conn{} -> **[bifrost-2]client{07-tendermint-3}conn{}**

Things I've tried

  1. Update light client and try to link both chains
$ rly l update stargate-final
$ rly l update bifrost-2

$ rly tx link <path> -d -o 3s

No luck

  1. Change path name and use fresh new keys for both chains

No luck

  1. Restart both chains

No luck

@colin-axner
Copy link
Contributor

colin-axner commented Jan 26, 2021

@kogisin thanks for the report!! This was very useful. It made me realize I forgot to check if a client is expired or frozen when reusing existing clients.

Since bifrost-2 requires trusting-period to be lower than unbonding-period which was 24 hours

This is a Tendermint light client security requirement. Trusting period must always be less than the unbonding-period

Here is what is happening:

  1. Clients are being created with the default configs
  2. For testing purposes, it is decided to abandon this client and restart
  3. A new linking attempt is made using the same default config
  4. To avoid redundancy, the code checks to see if a client created already matches the necessary configs
  5. What I forgot to check is that the client is not expired.
  6. The code is using the on-chain client to query the trusted height in order to construct proof correctly. The on-chain client has a very old height and thus the historical info is long gone. Code logically follows

Fixes:

  • Check that the client is not expired (there's already a helper function we can use)
  • Update error messages. There's a few things that could have gone wrong here. We should try to pin point exactly where and update the error message so it says "your client is expired" or "check that your node isn't pruning height" etc
  • ensure off-chain clients are being updated before creating on-chain clients

@colin-axner
Copy link
Contributor

colin-axner commented Jan 26, 2021

@akhilkumarpilli if you have bandwidth, these would be some good issues to take on

Edit: I'll handle the first fix of checking for expiration

@akhilkumarpilli
Copy link
Contributor

@akhilkumarpilli if you have bandwidth, these would be some good issues to take on

Edit: I'll handle the first fix of checking for expiration

Sure @colin-axner, will look into this.

@colin-axner
Copy link
Contributor

colin-axner commented Jan 26, 2021

@orkunkl @kogisin I believe the main issue described above should be fixed with the latest commit on master. You will need to delete any existing client identifiers in your .relayer/config/config.yaml. Let me know if it still doesn't work

@cryptophonic
Copy link
Contributor

I have pruning set to nothing on both chains here but if I leave the relayer running I always get this error in the morning. Re-initializing the light clients does not help.

@colin-axner
Copy link
Contributor

I have pruning set to nothing on both chains here but if I leave the relayer running I always get this error in the morning. Re-initializing the light clients does not help.

What version are you using?

@cryptophonic
Copy link
Contributor

I've tried various versions, right now I'm on the latest master commit f201839

@cryptophonic
Copy link
Contributor

It's failing to pull the TrustedLightBlock in GetLightSignedHeaderAtHeight() in the db. not sure why yet. The block header exists on-chain, I can query it with gaiad query block 19407

@cryptophonic
Copy link
Contributor

I[2021-02-25|20:56:26.848] - [gaiad-microtick-testnet]@{32776} - try(1/5) query packet commitment: portID (transfer), channelID (channel-1), sequence (14): packet commitment not found
I[2021-02-25|20:56:27.345] - [gaiad-microtick-testnet]@{32776} - try(2/5) query packet commitment: portID (transfer), channelID (channel-1), sequence (14): packet commitment not found
I[2021-02-25|20:56:28.223] - [gaiad-microtick-testnet]@{32776} - try(3/5) query packet commitment: portID (transfer), channelID (channel-1), sequence (14): packet commitment not found
I[2021-02-25|20:56:29.862] - [gaiad-microtick-testnet]@{32776} - try(4/5) query packet commitment: portID (transfer), channelID (channel-1), sequence (14): packet commitment not found
failed to execute message; message index: 1: receive packet verification failed: couldn't verify counterparty packet commitment: failed packet commitment verification for client (07-tendermint-1): client state height < proof height ({0 19407} < {0 32778}): invalid height: invalid request
failed to execute message; message index: 1: receive packet verification failed: couldn't verify counterparty packet commitment: failed packet commitment verification for client (07-tendermint-1): client state height < proof height ({0 19407} < {0 32778}): invalid height: invalid request
failed to execute message; message index: 1: receive packet verification failed: couldn't verify counterparty packet commitment: failed packet commitment verification for client (07-tendermint-1): client state height < proof height ({0 19407} < {0 32778}): invalid height: invalid request
failed to execute message; message index: 1: receive packet verification failed: couldn't verify counterparty packet commitment: failed packet commitment verification for client (07-tendermint-1): client state height < proof height ({0 19407} < {0 32778}): invalid height: invalid request
failed to execute message; message index: 1: receive packet verification failed: couldn't verify counterparty packet commitment: failed packet commitment verification for client (07-tendermint-1): client state height < proof height ({0 19407} < {0 32778}): invalid height: invalid request
E[2021-02-25|20:56:39.600] microtick-testnet-rc4: err(failed to send packets, see above logs for details)

@colin-axner
Copy link
Contributor

Based on the error logs, it looks like the proof constructed for the receive message is at height 32778, but the on-chain client is at height 19407. It cannot verify a proof from the future. It seems an UpdateMsg is not being sent directly before the receive message or it is being sent with the wrong updateMsg.

Message index 1 tells me the UpdateMsg is there. So I guess the updateMsg is using the wrong height? Maybe the chain heights got swapped? This wouldn't make sense because the updateMsg, updates the off-chain light client before constructing the updateMsg. It could be related to #425 - cc @akhilkumarpilli

I will look deeper into this next week. What commands are you running? Does 19407 height look like the height of the chain which is receiving the transfer?

@cryptophonic
Copy link
Contributor

I will attempt to create a local setup that can reproduce, and better document the steps. I want to leave the chains in question running in case we need to debug on them..

I'll report back later today (morning here and just getting some coffee)

@colin-axner
Copy link
Contributor

#437 should fix this issue

@cryptophonic
Copy link
Contributor

We had another failure using this release. The relayer had been running for 6-7 hours when the failure occurred. Stopping it at that point and restarting resulted in the "Light block not found" error.

$ rly version
version: 1.0.0-rc1-99-gca3c9eb
commit: ca3c9eb3f38c307d1d4b1849466989ecc9d5bc1b
cosmos-sdk: v0.41.3
go: go1.15.6 linux/amd64
I[2021-02-26|21:54:37.260] - [gaiad-microtick-testnet]@{50701} - try(1/5) query packet commitment: portID (transfer), channelID (channel-3), sequence (6): packet commitment not found 
I[2021-02-26|21:54:37.779] - [gaiad-microtick-testnet]@{50701} - try(2/5) query packet commitment: portID (transfer), channelID (channel-3), sequence (6): packet commitment not found 
I[2021-02-26|21:54:38.719] - [gaiad-microtick-testnet]@{50701} - try(3/5) query packet commitment: portID (transfer), channelID (channel-3), sequence (6): packet commitment not found 
I[2021-02-26|21:54:40.402] - [gaiad-microtick-testnet]@{50701} - try(4/5) query packet commitment: portID (transfer), channelID (channel-3), sequence (6): packet commitment not found 
I[2021-02-26|21:54:42.633] • [microtick-testnet-rc4]@{36184} - actions(0:update) hash(A1E809515830F875E9148B727983FA2C9E36C98941E8B4F26BD32A8FEB5FC53A) 
I[2021-02-26|21:54:42.633] • [microtick-testnet-rc4]@{36184} - actions(0:update) hash(3FF23434C59CF2FC55DFE24EAE37D424ACDAB8913D6BE49FB49CC53653ADA1A1) 
I[2021-02-26|21:54:42.633] • [microtick-testnet-rc4]@{36184} - actions(0:update) hash(66D4AE9E5885F3FEE1DB23FA6A969C44277CC364B882E28AA6BD0463006E3EAA) 
I[2021-02-26|21:54:43.701] - failed to construct update message for client on chain microtick-testnet-rc4, retrying: failed to get trusted header, please ensure the connected node is pruning nothing: light block not found 
I[2021-02-26|21:54:44.169] - failed to construct update message for client on chain microtick-testnet-rc4, retrying: failed to get trusted header, please ensure the connected node is pruning nothing: light block not found 
I[2021-02-26|21:54:45.072] - failed to construct update message for client on chain microtick-testnet-rc4, retrying: failed to get trusted header, please ensure the connected node is pruning nothing: light block not found 
I[2021-02-26|21:54:46.712] - failed to construct update message for client on chain microtick-testnet-rc4, retrying: failed to get trusted header, please ensure the connected node is pruning nothing: light block not found 
I[2021-02-26|21:54:47.630] • [microtick-testnet-rc4]@{36185} - actions(0:update) hash(C3CB65BE8EA0B9D9A870BE850D4F9A3622DB4D3DF2F5AA432EC8D3B6F12A3489) 
I[2021-02-26|21:54:50.007] - failed to construct update message for client on chain microtick-testnet-rc4, retrying: failed to get trusted header, please ensure the connected node is pruning nothing: light block not found 
E[2021-02-26|21:54:50.008] microtick-testnet-rc4: err(failed to get trusted header, please ensure the connected node is pruning nothing: light block not found) 
I[2021-02-26|21:54:52.632] • [microtick-testnet-rc4]@{36186} - actions(0:update) hash(1092536A689E347F1DD3A5B1EC3FA63B7C3BF3EC7917DC885224A3E58495C62F) 
I[2021-02-26|21:55:12.705] • [microtick-testnet-rc4]@{36190} - actions(0:update) hash(726F58CCD8AEE9B10603DCFC454237E8E399D5FF7B3086C5208E594057D8D743) 
I[2021-02-26|21:55:47.834] • [microtick-testnet-rc4]@{36197} - actions(0:update) hash(D2A1F145267ED1E9537325542D6FC4C3B8F49232AFD3573C71357F089B71B948) 
I[2021-02-26|21:55:47.836] • [microtick-testnet-rc4]@{36197} - actions(0:update) hash(6F71F406DD95C45943990B4BEB3F8033A2A2D5C08BBE8DE0C0BAEAEA2D342E6D) 
I[2021-02-26|21:55:52.852] • [microtick-testnet-rc4]@{36198} - actions(0:update) hash(796FB7C3268F5BC16CBE8FB235C0DE052B54942B1F89B982495A7674A919A08F) 
I[2021-02-26|21:55:57.881] • [microtick-testnet-rc4]@{36199} - actions(0:update) hash(03172A1FCF0EBB3413263B61ACDBA5ED0BE0D8D7FF9F97510F401D29FF0F72D1) 
^C
microtick@microtick-testnet:~/testnet$ rly start microtick-ibc -d
I[2021-02-26|21:56:30.846] - listening to tx events from gaiad-microtick-testnet... 
I[2021-02-26|21:56:30.846] - listening to block events from gaiad-microtick-testnet... 
I[2021-02-26|21:56:30.848] - listening to tx events from microtick-testnet-rc4... 
I[2021-02-26|21:56:30.848] - listening to block events from microtick-testnet-rc4... 
Error: failed to get trusted header, please ensure the connected node is pruning nothing: light block not found

@cryptophonic
Copy link
Contributor

Release v0.8.2 appears to have fixed this for us. Haven't noticed any issues and have been hitting it pretty hard for a couple hours. Nice work!

@colin-axner
Copy link
Contributor

Amazing! Happy this has been resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T: Bug 🪲 TYPE: Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
5 participants