Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for incremental packet clearing #2167

Merged
merged 31 commits into from
May 13, 2022
Merged

Support for incremental packet clearing #2167

merged 31 commits into from
May 13, 2022

Conversation

adizere
Copy link
Member

@adizere adizere commented May 2, 2022

Closes: #2087

Basic manual tests

This PR fixes an issue with Hermes taking a very long time, being unresponsive or being unable to clear packets in production. We cannot reproduce the production conditions, so the manual tests will be more rudimental.

Setup

  1. Use gm to create and start two networks
    • create a channel between them
    • create channel ibc-0 -c ibc-1 --port-a transfer --port-b transfer --new-client-connection
  2. gm hermes config and gm hermes keys
  3. Adjust the gas parameters to allow large transactions
    • - gas_price = { price = 0.001, denom = 'stake' }
      + gas_price = { price = 0.01, denom = 'stake' }
      - max_gas = 1000000
      + max_gas = 10000000```
    • do the above for both of your networks

Test with this branch

  1. Create 2000 ft token transfers

    • hermes tx raw ft-transfer ibc-0 ibc-1 transfer channel-0 9999 -o 1000 -n 2000
    • this should take 10~20 seconds or so depending on your setup
  2. Clear the outstanding packets

    • hermes clear packets ibc-1 transfer channel-0
    • every so often, the output will show the progress report

2022-05-02T13:41:10.560540Z INFO ThreadId(01) PacketRecvCmd{src_chain=ibc-1 src_port=transfer src_channel=channel-0 dst_chain=ibc-0}: built packet data for 50 events total=2010 left=1810

After all the event data is pulled from the chain, Hermes proceeds to submit the transactions one by one

2022-05-02T13:43:24.955983Z INFO ThreadId(01) PacketRecvCmd{src_chain=ibc-1 src_port=transfer src_channel=channel-0 dst_chain=ibc-0}:relay{odata=2diAXcipxM ->Destination @1-113; len=50}: success
2022-05-02T13:43:24.956192Z INFO ThreadId(01) PacketRecvCmd{src_chain=ibc-1 src_port=transfer src_channel=channel-0 dst_chain=ibc-0}: ready to fetch a scheduled op. data with batch of size 50 targeting Destination
2022-05-02T13:43:24.964239Z INFO ThreadId(01) PacketRecvCmd{src_chain=ibc-1 src_port=transfer src_channel=channel-0 dst_chain=ibc-0}:relay{odata=-_50M9IdL2 ->Destination @1-113; len=50}: optimistically resolved consensus_state_height upper_bound=1-114 result=1-35
2022-05-02T13:43:24.984645Z INFO ThreadId(01) PacketRecvCmd{src_chain=ibc-1 src_port=transfer src_channel=channel-0 dst_chain=ibc-0}:relay{odata=-_50M9IdL2 ->Destination @1-113; len=50}: assembled batch of 51 message(s)
2022-05-02T13:43:25.048935Z INFO ThreadId(21) wait_for_block_commits: waiting for commit of tx hashes(s) 7750973471BC89F98D4C70876E15E79A908FEC4B37CA8DDD4E25DF44B6CD360C, 9DB59FD0337166E4EAA87CB2ED7F5BC6CE9B660E167A12D2F718A74AE613D252 id=ibc-0
2022-05-02T13:43:30.081510Z INFO ThreadId(01) PacketRecvCmd{src_chain=ibc-1 src_port=transfer src_channel=channel-0 dst_chain=ibc-0}:relay{odata=-_50M9IdL2 ->Destination @1-113; len=50}: [Sync->ibc-0] result events:

Test with master branch

  1. Create 2000 ft token transfers

    • hermes tx raw ft-transfer ibc-0 ibc-1 transfer channel-0 9999 -o 1000 -n 2000
    • this should take 10~20 seconds or so depending on your setup
    1. Clear the outstanding packets
    • hermes clear packets ibc-1 transfer channel-0
    • This time, the process will take a long time to build the packet data for all 2000 packets in a single query
    • This will be reflected in a very long initial waiting time. On my setup this takes ~1 minute, but in production it can be on the order of tens of minutes (which can cause subsequent errors)

2022-05-02T13:48:46.040035Z INFO ThreadId(01) using default configuration from '/Users/adi/.hermes/config.toml'
2022-05-02T13:49:32.861348Z INFO ThreadId(01) PacketRecvCmd{src_chain=ibc-1 src_port=transfer src_channel=channel-0 dst_chain=ibc-0}: found unprocessed SendPacket events for [Sequence(2011), Sequence(2012), Sequence(2013), Sequence(2014), Sequence(2015), Sequence(2016), Sequence(2017), Sequence(2018), Sequence(2019), Sequence(2020)] (first 10 shown here; total=2000)


PR author checklist:

  • Added changelog entry, using unclog.
  • Added tests: integration (for Hermes) or unit/mock tests (for modules).
  • Linked to GitHub issue.
  • Updated code comments and documentation (e.g., docs/).

Reviewer checklist:

  • Reviewed Files changed in the GitHub PR explorer.
  • Manually tested (in case integration/unit/mock tests are absent).

@lyqingye
Copy link

lyqingye commented May 2, 2022

hi @adizere

I read your PR, and the main change is that you divide the build packet data into multiple batches, each with a size of 50. I use my setup for testing. 10000 packets are generated at one tx. It can be seen from the log that the average cost of building packet data for 50 events is 2.5s(that is take much time)
When Hermes successfully builds 10000 packet data and fails in the send TX process (account sequence mismatch), Hermes will rebuild the non relayed packets again.

@lyqingye
Copy link

lyqingye commented May 2, 2022

I try to add events store(for improve performance query_txs) to my branch , it clear 10000packet, only take 2min.
however, storing events takes a lot of space. It takes 20m for 20000 packets, including send packets and write ack events. so I wrote a command to clear the relayed events!

@adizere
Copy link
Member Author

adizere commented May 2, 2022

Thanks for the feedback!

When Hermes successfully builds 10000 packet data and fails in the send TX process (account sequence mismatch), Hermes will rebuild the non relayed packets again.

It's not clear if this is the expected behavior or it's unexpected (i.e., a problem). Could you clarify?

I try to add events store(for improve performance query_txs) to my branch , it clear 10000packet, only take 2min.
however, storing events takes a lot of space. It takes 20m for 10000 packets, including send packets and write ack events. so I wrote a command to clear the relayed events!

This is impressive! Do you use this approach in production? Also: how much space and time does it take exactly?

  • Is it 20Megabytes for 10.000 packets?
  • How much time?

@lyqingye
Copy link

lyqingye commented May 2, 2022

Thanks for the feedback!

When Hermes successfully builds 10000 packet data and fails in the send TX process (account sequence mismatch), Hermes will rebuild the non relayed packets again.

It's not clear if this is the expected behavior or it's unexpected (i.e., a problem). Could you clarify?

I try to add events store(for improve performance query_txs) to my branch , it clear 10000packet, only take 2min.
however, storing events takes a lot of space. It takes 20m for 10000 packets, including send packets and write ack events. so I wrote a command to clear the relayed events!

This is impressive! Do you use this approach in production? Also: how much space and time does it take exactly?

  • Is it 20Megabytes for 10.000 packets?
  • How much time?

The process of clear packet is as follows:
query_unreceived_ packets -> build_packet_events -> genertate_msgs_from_events -> send_msgs
when a send_msgs failed , the whole process will restart. I don't know why the sequence mismatch error occurs.

Events store takes 20MB to store 20000 packets(10000send_packet + 10000write_ack)
It takes about 1 ~ 2s(or faster) to query 10000 packets from store

your can use my branch for testing

setup chain config is:

max_msg_num=100
# enable event store 
enable_event_store=true
# rpc poll interval for event store fetching events 
event_store_rpc_poll_interval="500ms"
# wallet balance decimal place , use for wallet balances metrics 
decimal_place=6

relayer-cli/src/conclude.rs Show resolved Hide resolved

let (commit_sequences, _) = commitments_on_chain(
path: &PathIdentifiers,
) -> Result<(Vec<u64>, Height), Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the doc comment for this method to detail the new return type? It would also be helpful if the doc comment explained why the Height is returned in the Ok case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps even introduce a struct in place of the tuple, and a newtype for the u64 while we are at it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestions, thanks! I'm on it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the doc comment for this method to detail the new return type? It would also be helpful if the doc comment explained why the Height is returned in the Ok case.

fixed

Perhaps even introduce a struct in place of the tuple, and a newtype for the u64 while we are at it?

Maybe we do it in a different PR?

Copy link
Member Author

@adizere adizere May 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree let's do it in a separate PR.
I started working on this but it wasn't as easy as it sounds. Two complications are that:

  1. A struct already exists that could do the job, but that needs to be moved from relayer-cli to relayer crate and some adaptations are necessary
    https://github.com/informalsystems/ibc-rs/blob/f7cbe1cfea7f4b7aeb1bf0e99a1791f46abb2e89/relayer-cli/src/commands/query/packet/acks.rs#L16-L19

  2. the method unreceived_packets has multiple uses and we have adapt all callers if we change its signature

relayer/src/chain/counterparty.rs Outdated Show resolved Hide resolved
relayer/src/chain/counterparty.rs Show resolved Hide resolved
relayer/src/foreign_client.rs Outdated Show resolved Hide resolved
relayer/src/foreign_client.rs Outdated Show resolved Hide resolved
@lyqingye
Copy link

lyqingye commented May 4, 2022

This is the scenario of relaying 2000 packets (using event store). In the generate-opdata phase check whether the packets are already handled
why they are checked again in the refresh schedule,

image

@adizere
Copy link
Member Author

adizere commented May 4, 2022

why they are checked again in the refresh schedule,

Currently, the "generate" and the "refresh" phases are completely decoupled. This means there may be several minutes that elapse between the two phases. For this reason, the "refresh" phase double-checks that something was not relayed (by another relayer instance) or the packet did not time-out (since it was generated).

Does this make sense?

This is a place where we can optimize things and I can imagine we can skip the "refresh" in some cases. But we have not experimented with this idea yet.

Copy link
Collaborator

@ancazamfir ancazamfir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some review comments and addressed them in 2775ed0 if you want to take a look. I tested quite a bit with this and looks good, encountering still issues with seq mismatch during simulation but not made worse by these changes.

relayer/src/foreign_client.rs Outdated Show resolved Hide resolved
relayer/src/foreign_client.rs Outdated Show resolved Hide resolved
relayer/src/foreign_client.rs Outdated Show resolved Hide resolved
relayer/src/foreign_client.rs Outdated Show resolved Hide resolved
relayer/src/link/relay_path.rs Outdated Show resolved Hide resolved
relayer/src/link/cli.rs Outdated Show resolved Hide resolved
relayer/src/link/relay_path.rs Outdated Show resolved Hide resolved
@adizere
Copy link
Member Author

adizere commented May 12, 2022

@ancazamfir the branch is now even with master and I undid the foreign_client.rs changes.

Co-authored-by: Anca Zamfir <ancazamfir@users.noreply.github.com>
@adizere adizere requested a review from ancazamfir May 12, 2022 16:16
Copy link
Collaborator

@ancazamfir ancazamfir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, great stuff @adizere !!!
Tested all scenarios I could think of and things look good for this issue. There are still things to improve but not here.
I opened #2211 for the redundant client updates that cause not only the warning message and more fees to pe paid by hermes.

Also there is the issue of fragmenting in batches even earlier, on packet commitment query as this might also take a long time (i couldn't find the issue for this but i believe there is one?)

@adizere adizere merged commit 39606d2 into master May 13, 2022
@adizere adizere deleted the adi/2087 branch May 13, 2022 13:02
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
* Added docs for unreceived_acknowledgements and unreceived_packets

* Flipped order of sentences

* First iteration for informalsystems#2087; packet-recv works, packet-ack left

* Incremental processing of packet-ack relaying

* Added consensus_state_height_bounded with optimistic path

* Nits

* Simplified progress tracking

* Comment cleanup

* fmt and better logs

* Document the alternative path in consensus_state_height_bounded

* renamed method to make it more explicit h/t SeanC

* better var name

* Added workaround to tonic::code inconsistencies.

Ref:
informalsystems#1971 (comment)

* refactored the packet query methods

* Refactored CLIs for incremental relaying

* Experiment

* Fix for packet send for non-zero delay

* Using app status instead of network status

* fmt

* Undoing unnecessary chenges on foreign_client.rs

* Removed outdated comments

* Apply suggestions from code review

Co-authored-by: Anca Zamfir <ancazamfir@users.noreply.github.com>

* Small refactor in cli.rs

* Address review comments

* Remove duplicate code

* Undo one-chain change

* Fix documentation

* Changelog

Co-authored-by: Anca Zamfir <zamfiranca@gmail.com>
Co-authored-by: Anca Zamfir <ancazamfir@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hermes takes a very long time to clear high number of pending packets
6 participants