Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast start for chains configured with an allow list #1705

Merged
merged 29 commits into from
Jan 26, 2022
Merged

Conversation

romac
Copy link
Member

@romac romac commented Dec 21, 2021

Closes: #1536

Description

  • Split the worker spawning code into two parts:
    • a) scan the chains for the clients, connections and channels
    • b) spawn the appropriate workers based on the resulting scan
  • When scanning a chain with filtering enabled and an allow list, skip scanning all the clients and query the allowed channels directly, resulting in much fewer queries and a faster start.
  • Add a --full-scan option to hermes start to opt out of the fast start mechanism and do a full scan.

PR author checklist:

  • Added changelog entry, using unclog.
  • Added tests: integration (for Hermes) or unit/mock tests (for modules).
  • Linked to GitHub issue.
  • Updated code comments and documentation (e.g., docs/).

Reviewer checklist:

  • Reviewed Files changed in the GitHub PR explorer.
  • Manually tested (in case integration/unit/mock tests are absent).

Copy link
Member

@adizere adizere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick run, only aesthetic/log suggestions so far, otherwise looking good!

Will continue with more in-depth tests.

relayer-cli/src/commands/start.rs Show resolved Hide resolved
relayer/src/supervisor/scan.rs Show resolved Hide resolved
relayer/src/supervisor/scan.rs Outdated Show resolved Hide resolved
@adizere adizere self-requested a review January 19, 2022 14:07
@seanchen1991
Copy link
Contributor

Would we want to set fast starting as the default Hermes start up behavior?

@adizere
Copy link
Member

adizere commented Jan 19, 2022

Two further suggestions:

  1. It is confusing to understand exactly which are the workers that Hermes starts.

    • concretely, I have a partially open connection connection-370 and a partially open channel channel-213 on chain vega-testnet; the channel channel-213 is configured with allow
    • I set enabled = false for both [mode.connections] and [mode.channels] so we don't complete handshakes
    • then we see these log lines, and it's confusing whether any worker is effectively starting up
    • 2022-01-19T15:51:26.342253Z DEBUG ThreadId(01) connection connection-370 on chain vega-testnet is: Open, state on dest. chain (agoricdev-7) is: TryOpen
      2022-01-19T15:51:26.342560Z DEBUG ThreadId(01) done spawning workers for connection connection-370 on chain vega-testnet
      2022-01-19T15:51:26.342785Z DEBUG ThreadId(01) channel channel-213 on chain vega-testnet is: INIT; state on dest. chain (agoricdev-7) is: UNINITIALIZED
      2022-01-19T15:51:26.342965Z DEBUG ThreadId(01) done spawning workers for chain vega-testnet and channel channel-213

    • I think the behavior of Hermes is correct: not starting any workers to complete the conn & chan handshakes, though these log lines seem to be saying "done spawning workers for connection connection-370 on chain vega-testnet" as if something started. Maybe we can improve somehow, eg. with spans?
  2. I think there is a situation where we're aborting a packet worker, but the corresponding packet_cmd worker lingers on.

    • we have a packet waiting to be relayed, but the problem is that the client is expired
    • Hermes will launch the corresponding packet and packet_cmd workers as expected and try to handle the packet relaying
    • 2022-01-19T15:20:07.005129Z INFO ThreadId(01) packet_cmd{src_chain=agoricdev-7 src_port=cosmos-transfer src_channel=channel-11 dst_chain=vega-testnet}: spawning
      2022-01-19T15:20:07.005687Z INFO ThreadId(01) packet{src_chain=agoricdev-7 src_port=cosmos-transfer src_channel=channel-11 dst_chain=vega-testnet}: spawning

    • then the packet_cmd worker aborts
    • 2022-01-19T15:20:24.514649Z DEBUG ThreadId(11) packet_cmd{src_chain=agoricdev-7 src_port=cosmos-transfer src_channel=channel-11 dst_chain=vega-testnet}:relay{odata=-WVC16_h3f ->Destination @7-376539; len=1}: prepending Destination client update at height 7-376540
      2022-01-19T15:20:24.838707Z ERROR ThreadId(11) packet_cmd{src_chain=agoricdev-7 src_port=cosmos-transfer src_channel=channel-11 dst_chain=vega-testnet}: aborting due to expired or frozen client
      2022-01-19T15:20:24.838943Z ERROR ThreadId(11) packet_cmd{src_chain=agoricdev-7 src_port=cosmos-transfer src_channel=channel-11 dst_chain=vega-testnet}: aborting after encountering fatal error: Worker failed after 1 retries
      2022-01-19T15:20:24.839040Z INFO ThreadId(11) packet_cmd{src_chain=agoricdev-7 src_port=cosmos-transfer src_channel=channel-11 dst_chain=vega-testnet}: terminated

    • at this point, the packet worker is still running in the background
    • 2022-01-19T15:21:03.424959Z INFO ThreadId(21)
      2022-01-19T15:21:03.425219Z INFO ThreadId(21) * Chains: a, agoricdev-7, agorictest-18, vega-testnet
      2022-01-19T15:21:03.425356Z INFO ThreadId(21) * Packet workers:
      2022-01-19T15:21:03.425577Z INFO ThreadId(21) - packet::channel-11/cosmos-transfer:agoricdev-7->vega-testnet (id: 1)
      2022-01-19T15:21:03.425707Z INFO ThreadId(21)

It's unclear how (and if) we should fix (2).

relayer/src/supervisor.rs Show resolved Hide resolved
relayer/src/supervisor.rs Outdated Show resolved Hide resolved
relayer/src/supervisor/scan.rs Outdated Show resolved Hide resolved
relayer/src/supervisor/scan.rs Outdated Show resolved Hide resolved
relayer/src/supervisor/scan.rs Outdated Show resolved Hide resolved
relayer/src/supervisor/scan.rs Outdated Show resolved Hide resolved
relayer/src/supervisor/scan.rs Outdated Show resolved Hide resolved
relayer/src/supervisor/scan.rs Outdated Show resolved Hide resolved
relayer/src/supervisor/scan.rs Outdated Show resolved Hide resolved
relayer/src/supervisor/scan.rs Outdated Show resolved Hide resolved
@romac
Copy link
Member Author

romac commented Jan 20, 2022

Added some metrics for queries, here are some preliminary results:

Result of a full scan

  • Chain: ibc0
    • Client: 07-tendermint-0
      • Connection: connection-0
        • Channel: channel-0
        • Channel: channel-1
    • Client: 07-tendermint-1
      • Connection: connection-1
        • Channel: channel-2
        • Channel: channel-3
        • Channel: channel-4
    • Client: 07-tendermint-2
      • Connection: connection-2
        • Channel: channel-5
        • Channel: channel-6
        • Channel: channel-7
        • Channel: channel-8
  • Chain: ibc1
    • Client: 07-tendermint-0
      • Connection: connection-0
        • Channel: channel-0
        • Channel: channel-1
    • Client: 07-tendermint-1
      • Connection: connection-1
        • Channel: channel-2
        • Channel: channel-3
        • Channel: channel-4
    • Client: 07-tendermint-2
      • Connection: connection-2
        • Channel: channel-5
        • Channel: channel-6
        • Channel: channel-7
        • Channel: channel-8

With filter = false (no fast start)

# HELP queries Number of queries emitted by the relayer, per chain and query type
# TYPE queries counter
queries{chain="ibc0",query_type="query_channel"} 9
queries{chain="ibc0",query_type="query_client_connections"} 3
queries{chain="ibc0",query_type="query_clients"} 1
queries{chain="ibc0",query_type="query_connection"} 9
queries{chain="ibc0",query_type="query_connection_channels"} 3
queries{chain="ibc0",query_type="query_latest_height"} 1
queries{chain="ibc0",query_type="query_packet_acknowledgements"} 9
queries{chain="ibc0",query_type="query_packet_commitments"} 18
queries{chain="ibc0",query_type="query_unreceived_acknowledgements"} 9
queries{chain="ibc1",query_type="query_channel"} 9
queries{chain="ibc1",query_type="query_client_connections"} 3
queries{chain="ibc1",query_type="query_clients"} 1
queries{chain="ibc1",query_type="query_connection"} 9
queries{chain="ibc1",query_type="query_connection_channels"} 3
queries{chain="ibc1",query_type="query_latest_height"} 1
queries{chain="ibc1",query_type="query_packet_acknowledgements"} 9
queries{chain="ibc1",query_type="query_packet_commitments"} 18
queries{chain="ibc1",query_type="query_unreceived_acknowledgements"} 9

With filter = true and the following filters (fast start)

ibc0:

[chains.packet_filter]
policy = 'allow'
list = [
  ['transfer', 'channel-0'],
]

ibc1:

[chains.packet_filter]
policy = 'allow'
list = [
  ['transfer', 'channel-8'],
]
# HELP queries Number of queries emitted by the relayer, per chain and query type
# TYPE queries counter
queries{chain="ibc0",query_type="query_channel"} 2
queries{chain="ibc0",query_type="query_client_state"} 1
queries{chain="ibc0",query_type="query_connection"} 3
queries{chain="ibc0",query_type="query_latest_height"} 1
queries{chain="ibc0",query_type="query_packet_acknowledgements"} 1
queries{chain="ibc0",query_type="query_packet_commitments"} 2
queries{chain="ibc0",query_type="query_unreceived_acknowledgements"} 1
queries{chain="ibc1",query_type="query_channel"} 2
queries{chain="ibc1",query_type="query_client_state"} 1
queries{chain="ibc1",query_type="query_connection"} 3
queries{chain="ibc1",query_type="query_latest_height"} 1
queries{chain="ibc1",query_type="query_packet_acknowledgements"} 1
queries{chain="ibc1",query_type="query_packet_commitments"} 2
queries{chain="ibc1",query_type="query_unreceived_acknowledgements"} 1

Note: The results above use mode.clients.enabled = false

@romac
Copy link
Member Author

romac commented Jan 20, 2022

Would we want to set fast starting as the default Hermes start up behavior?

Yes, the relayer will use the fast start if possible, ie. when filter = true and there is an allow list configured for that chain.

@adizere
Copy link
Member

adizere commented Jan 20, 2022

The metrics are exactly what we need, amazing work there Romain!

On the Vega <> Agoric testnet.

With filter = false:

# HELP queries Number of queries emitted by the relayer, per chain and query type
# TYPE queries counter
queries{chain="a",query_type="query_channel"} 3
queries{chain="a",query_type="query_client_connections"} 3
queries{chain="a",query_type="query_clients"} 1
queries{chain="a",query_type="query_connection"} 9
queries{chain="a",query_type="query_connection_channels"} 3
queries{chain="a",query_type="query_latest_height"} 1
queries{chain="a",query_type="query_packet_commitments"} 3
queries{chain="a",query_type="query_unreceived_acknowledgements"} 3
queries{chain="agoricdev-7",query_type="query_channel"} 6
queries{chain="agoricdev-7",query_type="query_client_connections"} 3
queries{chain="agoricdev-7",query_type="query_clients"} 1
queries{chain="agoricdev-7",query_type="query_connection"} 12
queries{chain="agoricdev-7",query_type="query_connection_channels"} 4
queries{chain="agoricdev-7",query_type="query_latest_height"} 1
queries{chain="agoricdev-7",query_type="query_packet_acknowledgements"} 1
queries{chain="agoricdev-7",query_type="query_packet_commitments"} 8
queries{chain="agoricdev-7",query_type="query_unreceived_acknowledgements"} 6
queries{chain="agorictest-18",query_type="query_clients"} 1
queries{chain="agorictest-18",query_type="query_latest_height"} 1
queries{chain="vega-testnet",query_type="query_channel"} 9
queries{chain="vega-testnet",query_type="query_client_connections"} 9
queries{chain="vega-testnet",query_type="query_clients"} 1
queries{chain="vega-testnet",query_type="query_connection"} 15
queries{chain="vega-testnet",query_type="query_connection_channels"} 7
queries{chain="vega-testnet",query_type="query_latest_height"} 1
queries{chain="vega-testnet",query_type="query_packet_acknowledgements"} 9
queries{chain="vega-testnet",query_type="query_packet_commitments"} 10
queries{chain="vega-testnet",query_type="query_unreceived_acknowledgements"} 1

With filter = true restricting packet relaying on five channels across 3 chains::

# HELP queries Number of queries emitted by the relayer, per chain and query type
# TYPE queries counter
queries{chain="a",query_type="query_channel"} 1
queries{chain="a",query_type="query_client_state"} 1
queries{chain="a",query_type="query_connection"} 1
queries{chain="a",query_type="query_latest_height"} 1
queries{chain="a",query_type="query_packet_acknowledgements"} 1
queries{chain="a",query_type="query_packet_commitments"} 1
queries{chain="agoricdev-7",query_type="query_channel"} 4
queries{chain="agoricdev-7",query_type="query_client_state"} 1
queries{chain="agoricdev-7",query_type="query_connection"} 7
queries{chain="agoricdev-7",query_type="query_latest_height"} 1
queries{chain="agoricdev-7",query_type="query_packet_acknowledgements"} 1
queries{chain="agoricdev-7",query_type="query_packet_commitments"} 4
queries{chain="agoricdev-7",query_type="query_unreceived_acknowledgements"} 3
queries{chain="agorictest-18",query_type="query_clients"} 1
queries{chain="agorictest-18",query_type="query_latest_height"} 1
queries{chain="vega-testnet",query_type="query_channel"} 5
queries{chain="vega-testnet",query_type="query_client_state"} 3
queries{chain="vega-testnet",query_type="query_connection"} 7
queries{chain="vega-testnet",query_type="query_latest_height"} 1
queries{chain="vega-testnet",query_type="query_packet_acknowledgements"} 3
queries{chain="vega-testnet",query_type="query_packet_commitments"} 5
queries{chain="vega-testnet",query_type="query_unreceived_acknowledgements"} 2

Nit: Redundancy in the channel & connection workers span.

Check this out:

2022-01-20T12:48:33.000222Z  INFO ThreadId(01) channel{channel=channel::channel-213/transfer:vega-testnet -> agoricdev-7}: spawning
2022-01-20T12:48:33.000754Z  INFO ThreadId(01) channel{channel=channel::channel-217/transfer:vega-testnet -> agoricdev-7}: spawning
2022-01-20T12:48:47.018574Z  INFO ThreadId(01) spawning
2022-01-20T12:48:47.018986Z  INFO ThreadId(01) cmd: spawning
2022-01-20T12:48:47.019382Z  INFO ThreadId(01) Hermes has started
2022-01-20T12:49:04.632104Z  INFO ThreadId(14) send_tx{id=agoricdev-7}: refresh: retrieved account sequence=28 number=97
2022-01-20T12:49:05.879965Z  INFO ThreadId(14) wait_for_block_commits: waiting for commit of tx hashes(s) DA98AC2FB36A61A6B4184AB06FCE5F3599DA9D6CFE9A17AB7B61D1FD0635337E id=agoricdev-7
2022-01-20T12:49:08.966554Z  INFO ThreadId(17) channel{channel=channel::channel-213/transfer:vega-testnet -> agoricdev-7}: channel handshake step completed with events: OpenTryChannel(

Notice that the span mentions the word channel 4 times "channel{channel=channel::channel-213/transfer:vega-testnet -> agoricdev-7}".

Maybe we can make that less verbose?

The same applies for the connection worker.

2022-01-20T12:46:58.642088Z  INFO ThreadId(01) connection{connection=connection::connection-370:vega-testnet -> agoricdev-7}: spawning
2022-01-20T12:47:13.535306Z  INFO ThreadId(01) spawning
2022-01-20T12:47:13.535640Z  INFO ThreadId(01) cmd: spawning
2022-01-20T12:47:13.536045Z  INFO ThreadId(01) Hermes has started
2022-01-20T12:47:29.456739Z  INFO ThreadId(14) send_tx{id=agoricdev-7}: refresh: retrieved account sequence=27 number=97
2022-01-20T12:47:29.882795Z  INFO ThreadId(14) wait_for_block_commits: waiting for commit of tx hashes(s) 19096BD2AB7A47B2CD1BB41C43B3D476C5C7F6DE9348AC2C199432B81EB7ABA6 id=agoricdev-7
2022-01-20T12:47:34.753965Z  INFO ThreadId(17) connection{connection=connection::connection-370:vega-testnet -> agoricdev-7}: connection handshake step completed with events: OpenConfirmConnection(

@romac
Copy link
Member Author

romac commented Jan 20, 2022

How would you suggest we clean them up? Perhaps from

channel{channel=channel::channel-213/transfer:vega-testnet -> agoricdev-7}: spawning

to

channel_worker{object=channel::channel-213/transfer:vega-testnet -> agoricdev-7}: spawning worker

@romac
Copy link
Member Author

romac commented Jan 20, 2022

Well, channel-213 is the identifier of the channel and the channel::prefix is there to ensure that we can tell that this object is a channel when we print it. By the way I am not sure I get how changing channel to object makes filtering less easy given that we can still filter on the channel:: prefix or the channel_worker span name, can you explain what you mean?

@romac romac marked this pull request as ready for review January 20, 2022 15:27
@romac
Copy link
Member Author

romac commented Jan 20, 2022

@adizere Re:

I think there is a situation where we're aborting a packet worker, but the corresponding packet_cmd worker lingers on.

Not sure if this has anything to do with this PR, maybe we can track this in an issue and investigate at a later time?

@mzabaluev
Copy link
Contributor

we can still filter on the channel:: prefix

Probably, or maybe not; I'm thinking of journalctl match syntax and other similar backend facilities that don't necessarily allow you to efficiently match by pattern. This may be a small concern, though, just my intuition is not to commingle things that can be easily distinguished.

@adizere
Copy link
Member

adizere commented Jan 20, 2022

Well, channel-213 is the identifier of the channel and the channel::prefix is there to ensure that we can tell that this object is a channel when we print it. By the way I am not sure I get how changing channel to object makes filtering less easy given that we can still filter on the channel:: prefix or the channel_worker span name, can you explain what you mean?

channel_worker{object=channel::channel-213/transfer:vega-testnet -> agoricdev-7}: spawning worker

object as the logging key provides less information for filters and such.
Is it possible to make the channel IDs etc. less verbose? Two of those channel repetitions seem to be in the ID.

I see. I guess we can ignore this idea then. Let's keep the spans as they are. Redundancy is not always bad.

Copy link
Member

@adizere adizere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just had a relayer operator report on the outcome of using this PR. The Hermes start time reduced from ~40mins to ~10mins. This is anecdotal, however, since I think the 40min time included misbehavior detection. Then the Osmosis epoch time dropped and it messed-up our debugging session...

Will report again tomorrow.

For completeness, this are the query metrics.

Metrics
# TYPE queries counter
queries{chain="axelar-dojo-1",query_type="query_channel"} 2
queries{chain="axelar-dojo-1",query_type="query_client_state"} 201
queries{chain="axelar-dojo-1",query_type="query_connection"} 3
queries{chain="axelar-dojo-1",query_type="query_consensus_state"} 200
queries{chain="axelar-dojo-1",query_type="query_latest_height"} 1
queries{chain="axelar-dojo-1",query_type="query_packet_acknowledgements"} 1
queries{chain="axelar-dojo-1",query_type="query_packet_commitments"} 2
queries{chain="axelar-dojo-1",query_type="query_unreceived_acknowledgements"} 1
queries{chain="axelar-testnet-lisbon-2",query_type="query_channel"} 1
queries{chain="axelar-testnet-lisbon-2",query_type="query_client_state"} 143
queries{chain="axelar-testnet-lisbon-2",query_type="query_connection"} 1
queries{chain="axelar-testnet-lisbon-2",query_type="query_consensus_state"} 142
queries{chain="axelar-testnet-lisbon-2",query_type="query_latest_height"} 1
queries{chain="axelar-testnet-lisbon-2",query_type="query_packet_acknowledgements"} 1
queries{chain="axelar-testnet-lisbon-2",query_type="query_packet_commitments"} 1
queries{chain="bombay-12",query_type="query_channel"} 2
queries{chain="bombay-12",query_type="query_client_state"} 1
queries{chain="bombay-12",query_type="query_connection"} 3
queries{chain="bombay-12",query_type="query_latest_height"} 1
queries{chain="bombay-12",query_type="query_packet_commitments"} 1
queries{chain="bombay-12",query_type="query_unreceived_acknowledgements"} 1
queries{chain="columbus-5",query_type="query_channel"} 21
queries{chain="columbus-5",query_type="query_client_state"} 227
queries{chain="columbus-5",query_type="query_connection"} 15
queries{chain="columbus-5",query_type="query_consensus_state"} 209
queries{chain="columbus-5",query_type="query_latest_height"} 1
queries{chain="columbus-5",query_type="query_packet_acknowledgements"} 2
queries{chain="columbus-5",query_type="query_packet_commitments"} 5
queries{chain="columbus-5",query_type="query_status"} 9
queries{chain="columbus-5",query_type="query_unreceived_acknowledgements"} 9
queries{chain="columbus-5",query_type="query_unreceived_packets"} 6
queries{chain="comdex-1",query_type="query_channel"} 2
queries{chain="comdex-1",query_type="query_client_state"} 197
queries{chain="comdex-1",query_type="query_connection"} 3
queries{chain="comdex-1",query_type="query_consensus_state"} 194
queries{chain="comdex-1",query_type="query_latest_height"} 1
queries{chain="comdex-1",query_type="query_packet_acknowledgements"} 1
queries{chain="comdex-1",query_type="query_packet_commitments"} 2
queries{chain="comdex-1",query_type="query_unreceived_acknowledgements"} 1
queries{chain="core-1",query_type="query_channel"} 6
queries{chain="core-1",query_type="query_client_state"} 365
queries{chain="core-1",query_type="query_connection"} 6
queries{chain="core-1",query_type="query_consensus_state"} 360
queries{chain="core-1",query_type="query_latest_height"} 1
queries{chain="core-1",query_type="query_packet_acknowledgements"} 2
queries{chain="core-1",query_type="query_packet_commitments"} 4
queries{chain="core-1",query_type="query_status"} 1
queries{chain="core-1",query_type="query_unreceived_acknowledgements"} 2
queries{chain="core-1",query_type="query_unreceived_packets"} 1
queries{chain="cosmoshub-4",query_type="query_account"} 1
queries{chain="cosmoshub-4",query_type="query_channel"} 67
queries{chain="cosmoshub-4",query_type="query_client_state"} 602
queries{chain="cosmoshub-4",query_type="query_connection"} 29
queries{chain="cosmoshub-4",query_type="query_consensus_state"} 570
queries{chain="cosmoshub-4",query_type="query_latest_height"} 1
queries{chain="cosmoshub-4",query_type="query_packet_acknowledgements"} 5
queries{chain="cosmoshub-4",query_type="query_packet_commitments"} 8
queries{chain="cosmoshub-4",query_type="query_status"} 28
queries{chain="cosmoshub-4",query_type="query_unreceived_acknowledgements"} 44
queries{chain="cosmoshub-4",query_type="query_unreceived_packets"} 22
queries{chain="emoney-3",query_type="query_channel"} 2
queries{chain="emoney-3",query_type="query_client_state"} 373
queries{chain="emoney-3",query_type="query_connection"} 3
queries{chain="emoney-3",query_type="query_consensus_state"} 372
queries{chain="emoney-3",query_type="query_latest_height"} 1
queries{chain="emoney-3",query_type="query_packet_acknowledgements"} 1
queries{chain="emoney-3",query_type="query_packet_commitments"} 2
queries{chain="emoney-3",query_type="query_unreceived_acknowledgements"} 1
queries{chain="impacthub-3",query_type="query_channel"} 4
queries{chain="impacthub-3",query_type="query_client_state"} 406
queries{chain="impacthub-3",query_type="query_connection"} 6
queries{chain="impacthub-3",query_type="query_consensus_state"} 404
queries{chain="impacthub-3",query_type="query_latest_height"} 1
queries{chain="impacthub-3",query_type="query_packet_acknowledgements"} 2
queries{chain="impacthub-3",query_type="query_packet_commitments"} 4
queries{chain="impacthub-3",query_type="query_unreceived_acknowledgements"} 2
queries{chain="juno-1",query_type="query_channel"} 15
queries{chain="juno-1",query_type="query_client_state"} 374
queries{chain="juno-1",query_type="query_connection"} 14
queries{chain="juno-1",query_type="query_consensus_state"} 347
queries{chain="juno-1",query_type="query_latest_height"} 1
queries{chain="juno-1",query_type="query_packet_acknowledgements"} 4
queries{chain="juno-1",query_type="query_packet_commitments"} 7
queries{chain="juno-1",query_type="query_status"} 6
queries{chain="juno-1",query_type="query_txs"} 1
queries{chain="juno-1",query_type="query_unreceived_acknowledgements"} 5
queries{chain="juno-1",query_type="query_unreceived_packets"} 6
queries{chain="kichain-2",query_type="query_channel"} 10
queries{chain="kichain-2",query_type="query_client_state"} 411
queries{chain="kichain-2",query_type="query_connection"} 9
queries{chain="kichain-2",query_type="query_consensus_state"} 404
queries{chain="kichain-2",query_type="query_latest_height"} 1
queries{chain="kichain-2",query_type="query_packet_acknowledgements"} 3
queries{chain="kichain-2",query_type="query_packet_commitments"} 7
queries{chain="kichain-2",query_type="query_status"} 2
queries{chain="kichain-2",query_type="query_unreceived_acknowledgements"} 5
queries{chain="kichain-2",query_type="query_unreceived_packets"} 1
queries{chain="lum-network-1",query_type="query_channel"} 6
queries{chain="lum-network-1",query_type="query_client_state"} 198
queries{chain="lum-network-1",query_type="query_connection"} 6
queries{chain="lum-network-1",query_type="query_consensus_state"} 191
queries{chain="lum-network-1",query_type="query_latest_height"} 1
queries{chain="lum-network-1",query_type="query_packet_acknowledgements"} 2
queries{chain="lum-network-1",query_type="query_packet_commitments"} 4
queries{chain="lum-network-1",query_type="query_status"} 1
queries{chain="lum-network-1",query_type="query_unreceived_acknowledgements"} 2
queries{chain="lum-network-1",query_type="query_unreceived_packets"} 1
queries{chain="osmosis-1",query_type="query_account"} 1
queries{chain="osmosis-1",query_type="query_blocks"} 2
queries{chain="osmosis-1",query_type="query_channel"} 149
queries{chain="osmosis-1",query_type="query_client_state"} 6430
queries{chain="osmosis-1",query_type="query_connection"} 137
queries{chain="osmosis-1",query_type="query_consensus_state"} 6196
queries{chain="osmosis-1",query_type="query_latest_height"} 1
queries{chain="osmosis-1",query_type="query_packet_acknowledgements"} 19
queries{chain="osmosis-1",query_type="query_packet_commitments"} 41
queries{chain="osmosis-1",query_type="query_status"} 35
queries{chain="osmosis-1",query_type="query_txs"} 12
queries{chain="osmosis-1",query_type="query_unreceived_acknowledgements"} 38
queries{chain="osmosis-1",query_type="query_unreceived_packets"} 26
queries{chain="panacea-3",query_type="query_account"} 1
queries{chain="panacea-3",query_type="query_channel"} 132
queries{chain="panacea-3",query_type="query_client_state"} 206
queries{chain="panacea-3",query_type="query_connection"} 4
queries{chain="panacea-3",query_type="query_consensus_state"} 205
queries{chain="panacea-3",query_type="query_latest_height"} 1
queries{chain="panacea-3",query_type="query_packet_acknowledgements"} 2
queries{chain="panacea-3",query_type="query_packet_commitments"} 4
queries{chain="panacea-3",query_type="query_status"} 4
queries{chain="panacea-3",query_type="query_txs"} 1
queries{chain="panacea-3",query_type="query_unreceived_acknowledgements"} 258
queries{chain="panacea-3",query_type="query_unreceived_packets"} 2
queries{chain="sentinelhub-2",query_type="query_channel"} 2
queries{chain="sentinelhub-2",query_type="query_client_state"} 432
queries{chain="sentinelhub-2",query_type="query_connection"} 3
queries{chain="sentinelhub-2",query_type="query_consensus_state"} 428
queries{chain="sentinelhub-2",query_type="query_latest_height"} 1
queries{chain="sentinelhub-2",query_type="query_packet_acknowledgements"} 1
queries{chain="sentinelhub-2",query_type="query_packet_commitments"} 2
queries{chain="sentinelhub-2",query_type="query_unreceived_acknowledgements"} 1
queries{chain="stargaze-1",query_type="query_channel"} 16
queries{chain="stargaze-1",query_type="query_client_state"} 392
queries{chain="stargaze-1",query_type="query_connection"} 14
queries{chain="stargaze-1",query_type="query_consensus_state"} 372
queries{chain="stargaze-1",query_type="query_latest_height"} 1
queries{chain="stargaze-1",query_type="query_packet_acknowledgements"} 2
queries{chain="stargaze-1",query_type="query_packet_commitments"} 6
queries{chain="stargaze-1",query_type="query_status"} 3
queries{chain="stargaze-1",query_type="query_unreceived_acknowledgements"} 6
queries{chain="stargaze-1",query_type="query_unreceived_packets"} 3
queries{chain="vidulum-1",query_type="query_channel"} 2
queries{chain="vidulum-1",query_type="query_client_state"} 195
queries{chain="vidulum-1",query_type="query_connection"} 3
queries{chain="vidulum-1",query_type="query_consensus_state"} 194
queries{chain="vidulum-1",query_type="query_latest_height"} 1
queries{chain="vidulum-1",query_type="query_packet_acknowledgements"} 1
queries{chain="vidulum-1",query_type="query_packet_commitments"} 2
queries{chain="vidulum-1",query_type="query_unreceived_acknowledgements"} 1
# HELP workers Number of workers per object
# TYPE workers gauge
workers{type="client"} 35
workers{type="packet"} 17

A question

One thing that stands out is this:

queries{chain="osmosis-1",query_type="query_client_state"} 6430

6K queries might explain why Hermes might need 10min to start, but not sure.

The question is: At startup time, assuming ~10 preconfigured channels for osmosis-1, why would Hermes need to query client_state 6K times? We should clarify this.

@@ -0,0 +1,4 @@
- Reduce the startup time of the relayer by querying the necessary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

@adizere
Copy link
Member

adizere commented Jan 24, 2022

One valuable detail we found while parsing some logs with Romac is that Hermes scanning phase can be blazing fast (under 10 seconds for a relayer operator that has ~40 allowed channels on 17 chains) but the spawning phase is the one consuming most of the time. In an execution where the start-time was 10 minutes, 6 minutes of all that time went to only 3 spawning tasks, as follows:

  • almost 2 minutes
  665 Jan 20 17:03:19 Imperator-Relayer hermes[1854166]: 2022-01-20T17:03:19.229792Z DEBUG ThreadId(01) spawned Client worker: client::juno-1->osmosis-1:07-tendermint-1457
   666 Jan 20 17:05:06 Imperator-Relayer hermes[1854166]: 2022-01-20T17:05:06.504609Z DEBUG ThreadId(01) done spawning workers for chain osmosis-1 and channel channel-42
  • almost 2 minutes
   735 Jan 20 17:06:12 Imperator-Relayer hermes[1854166]: 2022-01-20T17:06:12.744757Z DEBUG ThreadId(01) channel channel-141 on chain cosmoshub-4 is: OPEN; state on dest. chain (osmosis-1) is: OPEN
   736 Jan 20 17:08:00 Imperator-Relayer hermes[1854166]: 2022-01-20T17:08:00.022189Z DEBUG ThreadId(01) refresh{client=07-tendermint-259 src_chain=osmosis-1 dst_chain=cosmoshub-4}: spawning task
  • 2 minutes and 40s
   767 Jan 20 17:08:08 Imperator-Relayer hermes[1854166]: 2022-01-20T17:08:08.141602Z DEBUG ThreadId(01) spawned Client worker: client::osmosis-1->juno-1:07-tendermint-0
   768 Jan 20 17:10:55 Imperator-Relayer hermes[1854166]: 2022-01-20T17:10:55.065636Z DEBUG ThreadId(01) done spawning workers for chain juno-1 and channel channel-0

These are low-hanging fruits we should investigate. If we can eliminate the bottleneck here, the start time will reduce to less than half.

@seanchen1991
Copy link
Contributor

@adizere and I dug into this performance issue. Our strong suspicion is that the has_packets closure and has_acks closure are making many network calls and tanking the startup performance in certain cases.

Adi made the observation that these closures are fetching large vectors of data just to calculate a boolean value. Could we make use of query::pagination to fetch at most one acknowledgement when we're computing these closures?

@romac
Copy link
Member Author

romac commented Jan 24, 2022

Very insightful, thanks! Will look into it.

@adizere
Copy link
Member

adizere commented Jan 25, 2022

@adizere and I dug into this performance issue. Our strong suspicion is that the has_packets closure and has_acks closure are making many network calls and tanking the startup performance in certain cases.

Adi made the observation that these closures are fetching large vectors of data just to calculate a boolean value. Could we make use of query::pagination to fetch at most one acknowledgement when we're computing these closures?

Building up on this idea, if we choose to optimize the start time, here's a sketch of how we can exploit pagination to reduce pressure/bandwidth on full node:

5be11ae#diff-5d10b574e97438514650fd6f44664299baf20536aab73554ca2754906d2a72d4R76

I'd suggest the refactoring based on query::pagination can be done separately in a later PR in the next milestone. For releasing v0.11.0 we can go ahead with the current design.

@romac
Copy link
Member Author

romac commented Jan 25, 2022

@adizere Thanks for the sketch! I have an alternative solution here, where we only query a single commitment and check if the resulting vector is empty or not: 83fc0db

Edit: slightly simplified in 5261d65

@adizere adizere self-requested a review January 26, 2022 09:11
Copy link
Collaborator

@ancazamfir ancazamfir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow! amazing work here. In addition to the perf improvement I think the scanning/ spawning code looks so much cleaner and easier to understand. Thanks!!

@romac romac merged commit edc9a2c into master Jan 26, 2022
@romac romac deleted the romac/spawn-scanner branch January 26, 2022 12:33
michaelfig pushed a commit to agoric-labs/ibc-rs that referenced this pull request Feb 4, 2022
…1705)

- Split the worker spawning code into two parts:
  - a) scan the chains for the clients, connections and channels
  - b) spawn the appropriate workers based on the resulting scan
- When scanning a chain with filtering enabled and an allow list, skip scanning all the clients and query the allowed channels directly, resulting in much fewer queries and a faster start.
- Add a `--full-scan` option to `hermes start` to opt out of the fast start mechanism and do a full scan.

---

* Introduce a `ChainScanner` to scan the chains for clients, connections and channels

* Use `ChainScanner` for spawning workers

* Formatting

* Add `--full-scan` option to `start` command to force a full scan even when some chains are using allow lists

* Remove debug statements and print scanned chains on startup

* Changelog entry

* Fix duplicate info message

* Quote identifiers in log messages

* Better error when port/channel does not exists

* Add metrics for queries

* Small log improvements

* Rename queries metric

* Use `chain` key for recording chain identifier in tracing logs

* Use more structured logging in chain scanner

* Fix changelog entry

* Improve logs when no workers were spawned

* Improve logs when spawning connection and channel workers

* Remove spaces in objects names

* Add changelog entry

* Revert part of logs changes

* Use INFO level for spawning logs

* Remove redundant changelog entry
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
…1705)

- Split the worker spawning code into two parts:
  - a) scan the chains for the clients, connections and channels
  - b) spawn the appropriate workers based on the resulting scan
- When scanning a chain with filtering enabled and an allow list, skip scanning all the clients and query the allowed channels directly, resulting in much fewer queries and a faster start.
- Add a `--full-scan` option to `hermes start` to opt out of the fast start mechanism and do a full scan.

---

* Introduce a `ChainScanner` to scan the chains for clients, connections and channels

* Use `ChainScanner` for spawning workers

* Formatting

* Add `--full-scan` option to `start` command to force a full scan even when some chains are using allow lists

* Remove debug statements and print scanned chains on startup

* Changelog entry

* Fix duplicate info message

* Quote identifiers in log messages

* Better error when port/channel does not exists

* Add metrics for queries

* Small log improvements

* Rename queries metric

* Use `chain` key for recording chain identifier in tracing logs

* Use more structured logging in chain scanner

* Fix changelog entry

* Improve logs when no workers were spawned

* Improve logs when spawning connection and channel workers

* Remove spaces in objects names

* Add changelog entry

* Revert part of logs changes

* Use INFO level for spawning logs

* Remove redundant changelog entry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce initialization time for Hermes start CLI
5 participants