Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Telemetry] Design V1 Devnet Dashboard #195

Closed
8 tasks
Olshansk opened this issue Sep 7, 2022 · 3 comments
Closed
8 tasks

[Telemetry] Design V1 Devnet Dashboard #195

Olshansk opened this issue Sep 7, 2022 · 3 comments
Assignees
Labels
telemetry everything related to collection telemetry tooling tooling to support development, testing et al

Comments

@Olshansk
Copy link
Member

Olshansk commented Sep 7, 2022

Objective

Research and provide requirements for a public facing V1 Telemetry dashboard that can measure the performance of V1 Devnet components.

Origin Document

There are several existing examples of gathering and sharing telemetry data that are helpful references for designing our own dashboard:

  1. Rinkeby Testnet
  2. NEAR Testnet
  3. Aptos Testnet (information collected from users for Telemetry)
  4. V0
  5. C0d3r
  6. POKTscan

Goals / Deliverables

  • Create a 1-2 page requirements doc that:
    • Defines key metrics required for a telemetry dashboard (5-10)
    • Defines required visualization types per metric (tile vs. bar chart vs. stacked)
    • Defines interactivity boundaries (drill down? Explanatory, or exploratory?)
  • Identify data sources to guide infrastructure engineering requirements
    • What is logged vs. queried?
    • What is the size of the dataset?
    • Will additional data engineering be required?

Creator: @jessicadaugherty

@Olshansk Olshansk self-assigned this Sep 7, 2022
@Olshansk Olshansk assigned okdas and unassigned Olshansk Oct 7, 2022
@Olshansk Olshansk added telemetry everything related to collection telemetry tooling tooling to support development, testing et al labels Oct 7, 2022
@jessicadaugherty jessicadaugherty moved this from Backlog to In Research in V1 Dashboard Dec 19, 2022
@jessicadaugherty jessicadaugherty assigned okdas and unassigned okdas Dec 19, 2022
@jessicadaugherty jessicadaugherty moved this from In Research to In Progress in V1 Dashboard Jan 13, 2023
@jessicadaugherty jessicadaugherty moved this from In Progress to In Review in V1 Dashboard Jan 23, 2023
@Olshansk Olshansk linked a pull request Jan 24, 2023 that will close this issue
14 tasks
@jessicadaugherty
Copy link
Contributor

Submitting recommendations for an MVP of a public-facing devnet telemetry dashboard.

All recommendations are documented here.

  • Recommended 10 primary metrics to track as part of a devnet dashboard
  • Outlined data points required with various visualization types
  • Provided information (that needs to be confirmed) about sources for data points
  • Provided mock ups for potential dashboards based on visualization types
  • MISSING
    • What is queried vs. what is logged
    • What is raw vs. what is analyzed
    • The size of the dataset

Requesting feedback from @okdas and @Gustavobelfort. Thank you!

@Olshansk
Copy link
Member Author

Olshansk commented Feb 3, 2023

@jessicadaugherty Left a few minor comments in the notion page. Overall, lgtm and I believe it's a good starting point

@okdas
Copy link
Member

okdas commented Feb 10, 2023

@jessicadaugherty The research is very solid! Thank you!

TLDR: I would like us to start with a small dashboard that only utilizes the metrics we already expose. That way we can either gradually add more information to expose from our validators (metrics/logs) or build an app that aggregates blockchain data, and possibly provides a faucet for testnet later.

RE: Data Engineering. I think we can categorize the sources of data in three buckets:

  • Observability data, only requires standard monitoring infrastructure besides validators:
    • Metrics. There are important points to keep in mind:
      • Each validator has its own representation of network health from its PoV. This information might or might not represent the actual network health or status. So we need take that into a consideration when utilizing such metrics. For example, one validator block height might be further ahead than the other one - so we’d need to query for the maximum value across all validators to get the latest height.
      • If we were to utilize Prometheus or similar solution, we need to take into account that Prometheus pulls the metrics periodically (e.g. each 15 or 30 seconds), so the data between queries is not picked up. This usually is fine because many metrics can be interpolated, and primary usage for Prometheus is monitoring and precision is not usually important.
    • Logs. We currently log events in a stdout, which then picked up by Loki and can be queried/aggregated. The query language is advanced, but might not cover all scenarios for very complex queries as, for example, SQL would.
  • Data that can be queried with RPC - anything that our actors expose via RPC endpoint. A good example is v0’s open api spec: https://github.com/pokt-network/pocket-core/blob/staging/doc/specs/rpc-spec.yaml - the data can be queried from API endpoint. I know we talked about similar implementation for v1, and this is coming later.
  • Full-blown DWH that holds all transactions. This is what explorers, such as poktscan, have to do to make all data readily available and indexed. Since we utilize Postgres in v1 for persistence, I suspect some of the data can be queried from that database!

You might notice that all dashboards used as a reference are actually web applications, not publicly shared grafana or datadog dashboards, and there is likely a good reason for that - most of the time observability data is just not enough to build a thorough dashboard. I’ve only seen a few examples of great public dashboards, and I can only find one that is alive currently - https://monitoring.cardano-testnet.iohkdev.io/grafana/d/Oe0reiHef/cardano-application-metrics-v2?orgId=1&refresh=1m&from=now-2d&to=now

Looking at Visualizations suggested, I think we should do the following:

  • As @Olshansk suggested, let’s start with top right dashboard, it would be a great starting point;
  • Let’s not build an app yet, but try to use Grafana. We use hosted Grafana at PNI that should allow us to make some dashboards publicly available;
  • I think we should be able to get most of the data necessary for that dashboard from observability metrics, plus we should be able to utilize Postgres datasource in grafana to get a list of apps.
  • As we gain expertise and find gaps in data that we cannot expose via metrics/logs, we will think about adding a simple web app to show that dashboard.

I can do a first try while I deploy a first devnet. We need some sort of visualization for maintenance of devnets internally, too, and this dashboard would be a good start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
telemetry everything related to collection telemetry tooling tooling to support development, testing et al
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants