-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve metrics descriptions and naming #2409
Conversation
…nfiguration 'tx_confirmation' default to be false
| `send_packet_count` | Number of SendPacket events processed | `u64` Counter | | ||
| `acknowledgement_count` | Number of WriteAcknowledgement events processed | `u64` Counter | | ||
| `cleared_send_packet_count` | Number of SendPacket events processed during the initial and periodic clearing | `u64` Counter | | ||
| `cleared_acknowledgment_count` | Number of WriteAcknowledgement events processed during the initial and periodic clearing | `u64` Counter | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you try to get this counter in the example below? I tried and was not able to get it. Here is what I tried:
- send 20 packets to ibc-0
hermes tx ft-transfer --sender-chain ibc-0 --receiver-chain ibc-1 --sender-channel channel-0 --sender-port transfer --amount 90 --timeout-height-offset 3000 --number-msgs 20 --key-name user2
- check pending (shows 20 unreceived_packets)
hermes query packet pending --chain ibc-0 --channel channel-0 --port transfer
- receive the 20 packets on ibc-1
hermes tx packet-recv --receiver-chain ibc-1 --sender-chain ibc-0 --sender-port transfer --sender-channel channel-0
- check pending (shows 20 unreceived_acks)
hermes query packet pending --chain ibc-0 --channel channel-0 --port transfer
- send 10 more packets to ibc-0
hermes tx ft-transfer --sender-chain ibc-0 --receiver-chain ibc-1 --sender-channel channel-0 --sender-port transfer --amount 90 --timeout-height-offset 3000 --number-msgs 10 --key-name user2
- check pending (shows 10 unreceived_packets and 20 unreceived_acks)
hermes query packet pending --chain ibc-0 --channel channel-0 --port transfer
- start hermes
hermes start
- wait a bit for hermes to finish clearing and then check metrics. I get these (
cleared_acknowledgment_count
is missing):
# HELP acknowledgement_count Number of WriteAcknowledgement events processed
# TYPE acknowledgement_count counter
acknowledgement_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
# HELP cleared_send_packet_count Number of SendPacket events processed during the initial and periodic clearing
# TYPE cleared_send_packet_count counter
cleared_send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
# HELP msg_num How many messages Hermes submitted to the chain, per chain
# TYPE msg_num counter
msg_num{chain="ibc-0"} 32
msg_num{chain="ibc-1"} 11
# HELP send_packet_count Number of SendPacket events processed
# TYPE send_packet_count counter
send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
For me also the latency metrics are missing for one chain (ibc-1
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed I am not getting the cleared_acknowledgment_count
as expected. I will investigate this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was an issue in where the cleared_acknowledgment_count
was counted. I pushed a fix and running the example you gave now correctly has the following metrics:
acknowledgement_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
cleared_acknowledgment_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 20
send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
cleared_send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
There is indeed an issue recording latencies with the clearing:
- clearing unreceived_acks doesn't record any latency
- clearing unreceived_packets only record latencies for the sender chain
- sending packets when Hermes is running correctly record the latencies
I will check and see how this can be corrected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
excellent, thanks for looking into this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me also the latency metrics are missing for one chain (ibc-1)
The latency uses the tracking id in order to record the submit and confirmed time. The tracking id of packets cleared wasn’t being recorded, thus no latency was measured. In addition, we were using a static string for the tracking id when clearing packets.
In order to have the latencies even when clearing packets, I added an additional enum to the TrackingId, which still allows to differentiate cleared packets by the tracking id as they have a prefix "cleared/“ to a UUID.
By repeating the example completely I correctly have 2 latencies submitted and confirmed from ibc-0 to ibc-1, the acks and recv pending. And 1 latency submitted and confirmed from ibc-1 to ibc-0 which should be the ack from the pending recv.
And by repeating the first 4 steps, in order to have only pending acks, the metrics show 1 latency submitted and confirmed from ibc-0 to ibc-1.
What do you think about this solution ? The changes are in the commit : a75810d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
back to the counters, for the example I gave I get now:
acknowledgement_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
cleared_acknowledgment_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 20
send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
cleared_send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
msg_num{chain="ibc-0"} 32
msg_num{chain="ibc-1"} 11
should acknowledgement_count
be 30 if it is the total? Meaning there were 20 pending acks plus the 10 that came in live from the processing of the 10 send_packet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind these four metrics is to give information on two separate interests:
- Has Hermes handled all the WriteAcknowledgment events related to the SendPacket events it handled. This is shown by
send_packet_count
andacknowledgment_count
. - Of all the SendPacket and WriteAcknowledgement events handled, how many were from packet clearing. This is shown by
cleared_send_packet_count
andcleared_acknowledgement_count
.
So in my opinion the acknowledgment_count
should always mirror the send_packet_count
if everything is working correctly.
If there were pending WriteAcknowledgment events, and acknowledgment_count
is not dissociated from cleared_acknowledgement_count
, then send_packet_count
will not mirror acknowledgment_count
, since the SendPacket event have been handled before Hermes started.
And for the cleared_send_packet_count
it is the opposite, if the pending SendPacket events are not included in the send_packet_count
it will differ from acknowledgment_count
since these SendPacket events will have a corresponding WriteAcknowledgment event.
I can see the potential confusion these metrics can cause, if this solution doesn't seem to fit the requirements I can change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me this is confusing. Also send_packet_count
will not mirror acknowledgment_count
if any packets have timed out. IMO we should be consistent for send and ack metrics. But if we keep things the way they are we need to explain very clearly what the metrics mean in the help strings. The name of the counters would also need to reflect the difference (send_packet_count
is a total while the acknowledgment_count
counts the live events only). Something like:
live_acknowledgement_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
cleared_acknowledgment_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 20
total_send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
cleared_send_packet_count{chain="ibc-0",channel="channel-0",counterparty="ibc-1",port="transfer"} 10
We could also have live_send_packet_count
instead of total
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I hadn't taken into account the timeouts. The points raised regarding these metrics make me realise that these metrics seem more confusing than helpful.
The addition of these metrics was originally to spot slow/stalled traffic #2175, but in my opinion the oldest_sequence
and oldest_timestamp
should be enough to track this.
Furthermore these metrics, send_packet_count
, acknowledgment_count
, cleared_send_packet_count
and cleared_acknowledgement_count
seem to aim at corner cases rather than general cases, they work well if there are no timeouts, and/or pending packets.
The more points were raised the more I had trouble convincing myself of the usefulness of these metrics. I am wondering if it is a good idea to keep them. What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep something, they are helpful as there is no other hint on possible packets relayed when tx_confirmation
is false. It's the best we can do to say we processed x, y type events.
What is not completely clear is how to reduce the confusion. As I said, if we keep what we have, we can clarify by having good help and metric names. Since these are already in the code base I would say let's do it the best we can in this PR. Once we experiment more we can change later along the other lines I mentioned or something else or even remove them.
…and updated telemetry output example in Hermes guide
I'm curious, what is the rationale for making this default false? Some operators are running with this enabled. Some are disabling it for optimization reasons. |
It involves more complex processing and logic. The config also says "experimental". And for some reason I thought most operators run with it disabled. Is there any way to gauge this? |
I'm not sure what you mean. Do you mean measure/analytics what % of operators have We discussed yesterday with operators and here are 3 takeaways, also based on my observations so far:
|
Yes. I agree with all of your other 3 points @adizere. I was also confused about the lack of "discovery". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great improvements in here! Did some tests and looks clearly better than what is currently in master
. Will pick up the work in a follow-up issue + PR.
* Improved naming and description of some telemetry metrics. Changed configuration 'tx_confirmation' default to be false * Added changelog entry * Updated guide with new telemetry changes * Renamed clear packet metrics * Updated some metric descriptions and conditional display * Fixed metric used to count cleared acknowledgments, updated latencies histogram buckets. * Clippy fix * Added new enum to TrackindId in order to track latency of packets cleared * Fixes in comments * Improved changelog entry, fixed description of wallet balance metric and updated telemetry output example in Hermes guide
Closes: #2408
Description
This PR applies the following improvements to some telemetry metrics:
cleared_count
tocleared_send_packet_count
cleared_acknowledgment_count
ibc_receive_packets
ibc_acknowledgment_packets
ibc_timeout_packets
send_packet_count
acknowledgement_count
cleared_send_packet_count
oldest_sequence
tx_latency_submitted
tx_latency_confirmed
tx_confirmation
tofalse
forconfig.toml
Improvements suggested by operators:
0
) value so that it no longer signals any oldest sn/timestampPR author checklist:
unclog
.docs/
).Reviewer checklist:
Files changed
in the GitHub PR explorer.