Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics for mempool transaction verification and state checks, including failures #2626

Closed
12 tasks done
Tracked by #2309
mpguerra opened this issue Aug 13, 2021 · 13 comments · Fixed by #2878
Closed
12 tasks done
Tracked by #2309
Assignees
Labels
C-enhancement Category: This is an improvement

Comments

@mpguerra
Copy link
Contributor

mpguerra commented Aug 13, 2021

Motivation

Specifications

Designs

See below

Related Work

Design (discussed below)

From zcashd (use same names)

Already existing (from tx verifier/downloader) that we should rename

  • mempool.downloaded.transactions.total [counter] (renamed from gossip.downloaded.transaction.count)
    • Number of transactions downloaded for the mempool
  • mempool.pushed.transactions.total [counter] (renamed from gossip.pushed.transaction.count)
    • Number of transactions pushed directly for verification
  • mempool.verified.transactions.total [counter] (renamed from gossip.verified.transaction.count)
    • Number of transactions verified successfully
  • mempool.cancelled.verify.tasks.total [counter] (renamed from gossip.cancelled.count)
    • Number of transactions downloads/verification cancelled
  • mempool.currently.queued.transactions.total [gauge] (renamed from gossip.queued.transaction.count)
    • Number of transactions currently queued for download/verification
    • (also update it after poll_next()!)

To be created

  • mempool.queued.transactions.total [counter]
    • Number of transactions ever queued for download/verification
  • mempool.rejected.transaction.ids.total [gauge]
    • Number of currently rejected transactions
  • mempool.rejected.transaction.ids.bytes [gauge]
    • The sum of serialized sizes of rejected IDs
    • TODO: we can later replace by memory size; should be pretty similar
  • mempool.failed.verify.tasks.total [counter]
    • Number of transactions that failed download or verification
  • mempool.gossiped.transactions.total [counter]
  • TODO: number of transactions gossiped to other nodes?

From design brainstorm document:

  • Tag each transaction with its version
  • Tag each transaction with its received timestamp/age, if that’s easy
@mpguerra mpguerra mentioned this issue Aug 13, 2021
59 tasks
@mpguerra mpguerra added C-enhancement Category: This is an improvement S-needs-triage Status: A bug report needs triage P-Low labels Aug 13, 2021
@mpguerra mpguerra added this to the 2021 Sprint 20 milestone Aug 26, 2021
@conradoplg
Copy link
Collaborator

I've used e.g. gossip.downloaded.transaction.count for mempool transactions, but looking back that doesn't make much sense (I based the name on the block metrics). Maybe we should use e.g. mempool.downloaded.transaction.count? Make sure to update transaction_verification.json to match.

@str4d
Copy link
Contributor

str4d commented Sep 3, 2021

Currently zcashd tracks the following mempool-related metrics:

  • zcash.mempool.size.transactions [gauge]
    • Number of transactions in the mempool.
  • zcash.mempool.size.bytes [gauge]
    • The sum of the serialized transaction sizes, i.e. how much block space they require.
  • zcash.mempool.usage.bytes [gauge]
    • The dynamic memory usage of the mempool, i.e. how much it is costing the node to track them.

@teor2345
Copy link
Contributor

teor2345 commented Sep 3, 2021

In Rust, the dynamic size can be calculated using std::mem::size_of, but it needs to be done recursively for pointers, and iteratively for collections.

So maybe we can split that into a separate ticket.

@teor2345
Copy link
Contributor

teor2345 commented Sep 3, 2021

zcash.mempool.size.bytes [gauge]

  • The sum of the serialized transaction sizes, i.e. how much block space they require.

We could add this size to UnminedTx and populate it as we deserialize the transaction.

@str4d
Copy link
Contributor

str4d commented Sep 3, 2021

In Rust, the dynamic size can be calculated using std::mem::size_of, but it needs to be done recursively for pointers, and iteratively for collections.

Yep; for instance, we've added the orchard::Bundle<Authorized, _>::dynamic_usage method to expose this to zcashd.

@teor2345
Copy link
Contributor

teor2345 commented Sep 3, 2021

In Rust, the dynamic size can be calculated using std::mem::size_of, but it needs to be done recursively for pointers, and iteratively for collections.

Yep; for instance, we've added the orchard::Bundle<Authorized, _>::dynamic_usage method to expose this to zcashd.

I expect we'll want a trait for this, and maybe a custom derive macro.
Then the compiler can check that we've implemented the trait for all the fields of each type.

We could also implement the trait using std::mem::size_of for all Copy types.

@str4d
Copy link
Contributor

str4d commented Sep 3, 2021

So, I've thrown this together: https://github.com/str4d/memuse

@str4d
Copy link
Contributor

str4d commented Sep 3, 2021

We could also implement the trait using std::mem::size_of for all Copy types.

We can't do this generically, because rustc will complain in ways we can't get around without being, say, the Rust standard library:

note: upstream crates may add a new impl of trait `std::marker::Copy` for type `std::string::String` in future versions

@str4d
Copy link
Contributor

str4d commented Sep 5, 2021

So, I've thrown this together: https://github.com/str4d/memuse

Now published as memuse 0.1.0 (no custom derive macro yet, but it has impls for almost all the stdlib types).

@mpguerra mpguerra removed the S-needs-triage Status: A bug report needs triage label Sep 7, 2021
@mpguerra mpguerra removed this from the 2021 Sprint 20 milestone Sep 27, 2021
@conradoplg
Copy link
Collaborator

Here's a proposal / suggestions / questions:

From zcashd (use same names?)

  • zcash.mempool.size.transactions [gauge]
    • Number of transactions in the mempool.
  • zcash.mempool.size.bytes [gauge]
    • The sum of the serialized transaction sizes, i.e. how much block space they require.
  • zcash.mempool.usage.bytes [gauge]
    • The dynamic memory usage of the mempool, i.e. how much it is costing the node to track them.

Already existing (from tx verifier/downloader) that we should rename

  • mempool.downloaded.count [counter] (renamed from gossip.downloaded.transaction.count)
    • Number of transactions downloaded for the mempool
  • mempool.pushed.count [counter] (renamed from gossip.pushed.transaction.count)
    • Number of transactions pushed directly for verification
  • mempool.verified.count [counter] (renamed from gossip.verified.transaction.count)
    • Number of transactions verified successfully
  • mempool.cancelled.count [counter] (renamed from gossip.cancelled.count)
    • Number of transactions downloads/verification cancelled
  • mempool.queued.count [gauge] (renamed from gossip.queued.transaction.count)
    • Number of transactions queued for download/verification
    • (TODO: also update it after poll_next())

To be created

  • mempool.rejected.count [gauge]
    • Number of rejected transactions
    • TODO: separate by type?
  • mempool.rejected.bytes [gauge]
    • The dynamic memory usage of the rejection list; or should we include this in zcash.mempool.usage.bytes?
  • TODO: number of transactions that failed verification?
  • TODO: number of transactions gossiped to other nodes?

TODO: should we use 'gauge' suffix for gauges? We don't do that anywhere else, but it may be interesting to have both the counter and gauge for e.g. rejected transactions, maybe?

TODO: from design brainstorm document:

  • Tag each transaction with its version
  • Tag each transaction with its received timestamp/age, if that’s easy

@conradoplg conradoplg added this to the 2021 Sprint 20 milestone Oct 11, 2021
@conradoplg conradoplg self-assigned this Oct 11, 2021
@teor2345
Copy link
Contributor

At the meeting today, we talked about triaging metrics based on two questions:

  • is it quick?
  • does it help us find bugs?

Too Much Work

Here are the metrics that aren't quick, so we should skip them:

  • zcash.mempool.usage.bytes [gauge]
    • The dynamic memory usage of the mempool, i.e. how much it is costing the node to track them.
  • mempool.rejected.bytes [gauge]
    • The dynamic memory usage of the rejection list; or should we include this in zcash.mempool.usage.bytes?

But we might want to do this alternative:

  • The sum of the serialized sizes of rejected IDs

Bug Finding

All these metrics will help us find bugs. But the higher-level metrics can help us discover issues to investigate. Then we can add more detailed metrics as needed.

So let's focus on the number of transactions and rejections first.

We might also want a transaction version dashboard, because zcashd had a mempool v5 transaction rejection bug.

Naming

(use same names?)

If the metric represents exactly the same thing, we should use the same names.
Particularly for zcash-prefixed metrics.

[metrics] should have a suffix describing the unit, in plural form. Note that an accumulating count has total as a suffix, in addition to the unit if applicable.

https://prometheus.io/docs/practices/naming/#metric-names

So I suggest these renames:

Counters:

  • gossip.downloaded.transaction.count -> mempool.downloaded.transactions.total
  • gossip.pushed.transaction.count -> mempool.pushed.transactions.total
  • gossip.verified.transaction.count -> mempool.verified.transactions.total
  • gossip.cancelled.count -> mempool.cancelled.verify.tasks.total (download.and.verify.tasks seems too long)
  • (new counter) -> gossip.queued.transactions.total

Gauges:

  • gossip.queued.transaction.count -> mempool.queued.transactions
  • mempool.rejected.count -> mempool.rejected.transaction.ids.total
  • mempool.rejected.bytes -> mempool.rejected.transaction.ids.bytes

See the Prometheus metric naming guide for more examples and suggestions.

@mpguerra
Copy link
Contributor Author

* (new counter) -> `gossip.queued.transactions.total`

Should this not be mempool.queued.transactions.total instead?

@conradoplg
Copy link
Collaborator

@teor2345 thanks for the feedback!
@mpguerra agreed, I'm considering that instead.

I've incorporated it and posted the result in the ticket description. I've added mempool.failed.count there too.

I'm not sure about using mempool.queued.transactions (gauge) and mempool.queued.transactions.total (counter), seems a bit weird. Maybe mempool.currently.queued.transactions.total (gauge) and mempool.queued.transactions.total (counter)?

Feedback is still welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: This is an improvement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants