[Telemetry] Design V1 Devnet Dashboard #195

Olshansk · 2022-09-07T21:15:11Z

Objective

Research and provide requirements for a public facing V1 Telemetry dashboard that can measure the performance of V1 Devnet components.

Origin Document

There are several existing examples of gathering and sharing telemetry data that are helpful references for designing our own dashboard:

Rinkeby Testnet
NEAR Testnet
Aptos Testnet (information collected from users for Telemetry)
V0
C0d3r
POKTscan

Goals / Deliverables

Create a 1-2 page requirements doc that:
- Defines key metrics required for a telemetry dashboard (5-10)
- Defines required visualization types per metric (tile vs. bar chart vs. stacked)
- Defines interactivity boundaries (drill down? Explanatory, or exploratory?)
Identify data sources to guide infrastructure engineering requirements
- What is logged vs. queried?
- What is the size of the dataset?
- Will additional data engineering be required?

Creator: @jessicadaugherty

jessicadaugherty · 2023-01-24T17:02:25Z

Submitting recommendations for an MVP of a public-facing devnet telemetry dashboard.

All recommendations are documented here.

Recommended 10 primary metrics to track as part of a devnet dashboard
Outlined data points required with various visualization types
Provided information (that needs to be confirmed) about sources for data points
Provided mock ups for potential dashboards based on visualization types
MISSING
- What is queried vs. what is logged
- What is raw vs. what is analyzed
- The size of the dataset

Requesting feedback from @okdas and @Gustavobelfort. Thank you!

Olshansk · 2023-02-03T23:10:08Z

@jessicadaugherty Left a few minor comments in the notion page. Overall, lgtm and I believe it's a good starting point

okdas · 2023-02-10T01:42:38Z

@jessicadaugherty The research is very solid! Thank you!

TLDR: I would like us to start with a small dashboard that only utilizes the metrics we already expose. That way we can either gradually add more information to expose from our validators (metrics/logs) or build an app that aggregates blockchain data, and possibly provides a faucet for testnet later.

RE: Data Engineering. I think we can categorize the sources of data in three buckets:

Observability data, only requires standard monitoring infrastructure besides validators:
- Metrics. There are important points to keep in mind:
  - Each validator has its own representation of network health from its PoV. This information might or might not represent the actual network health or status. So we need take that into a consideration when utilizing such metrics. For example, one validator block height might be further ahead than the other one - so we’d need to query for the maximum value across all validators to get the latest height.
  - If we were to utilize Prometheus or similar solution, we need to take into account that Prometheus pulls the metrics periodically (e.g. each 15 or 30 seconds), so the data between queries is not picked up. This usually is fine because many metrics can be interpolated, and primary usage for Prometheus is monitoring and precision is not usually important.
- Logs. We currently log events in a stdout, which then picked up by Loki and can be queried/aggregated. The query language is advanced, but might not cover all scenarios for very complex queries as, for example, SQL would.
Data that can be queried with RPC - anything that our actors expose via RPC endpoint. A good example is v0’s open api spec: https://github.com/pokt-network/pocket-core/blob/staging/doc/specs/rpc-spec.yaml - the data can be queried from API endpoint. I know we talked about similar implementation for v1, and this is coming later.
Full-blown DWH that holds all transactions. This is what explorers, such as poktscan, have to do to make all data readily available and indexed. Since we utilize Postgres in v1 for persistence, I suspect some of the data can be queried from that database!

You might notice that all dashboards used as a reference are actually web applications, not publicly shared grafana or datadog dashboards, and there is likely a good reason for that - most of the time observability data is just not enough to build a thorough dashboard. I’ve only seen a few examples of great public dashboards, and I can only find one that is alive currently - https://monitoring.cardano-testnet.iohkdev.io/grafana/d/Oe0reiHef/cardano-application-metrics-v2?orgId=1&refresh=1m&from=now-2d&to=now

Looking at Visualizations suggested, I think we should do the following:

As @Olshansk suggested, let’s start with top right dashboard, it would be a great starting point;
Let’s not build an app yet, but try to use Grafana. We use hosted Grafana at PNI that should allow us to make some dashboards publicly available;
I think we should be able to get most of the data necessary for that dashboard from observability metrics, plus we should be able to utilize Postgres datasource in grafana to get a list of apps.
As we gain expertise and find gaps in data that we cannot expose via metrics/logs, we will think about adding a simple web app to show that dashboard.

I can do a first try while I deploy a first devnet. We need some sort of visualization for maintenance of devnets internally, too, and this dashboard would be a good start.

Olshansk self-assigned this Sep 7, 2022

Olshansk added this to the M2: Pocket DoS (Devnet of Servicers) milestone Sep 28, 2022

Olshansk assigned okdas and unassigned Olshansk Oct 7, 2022

Olshansk added telemetry everything related to collection telemetry tooling tooling to support development, testing et al labels Oct 7, 2022

Olshansk assigned jessicadaugherty Dec 15, 2022

jessicadaugherty moved this from Backlog to In Research in V1 Dashboard Dec 19, 2022

jessicadaugherty assigned okdas and unassigned okdas Dec 19, 2022

jessicadaugherty moved this from In Research to In Progress in V1 Dashboard Jan 13, 2023

jessicadaugherty mentioned this issue Jan 23, 2023

[Telemetry] Design V1 Devnet Dashboard - Issue #195 #460

Closed

14 tasks

jessicadaugherty moved this from In Progress to In Review in V1 Dashboard Jan 23, 2023

Olshansk linked a pull request Jan 24, 2023 that will close this issue

[Telemetry] Design V1 Devnet Dashboard - Issue #195 #460

Closed

14 tasks

jessicadaugherty mentioned this issue Feb 13, 2023

[Infrastructure][LocalNet] Deploy 4 node localnet in remote environment #307

Closed

10 tasks

jessicadaugherty moved this from In Review to Done in V1 Dashboard Feb 14, 2023

Olshansk closed this as completed May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Telemetry] Design V1 Devnet Dashboard #195

[Telemetry] Design V1 Devnet Dashboard #195

Olshansk commented Sep 7, 2022 •

edited by jessicadaugherty

Loading

jessicadaugherty commented Jan 24, 2023

Olshansk commented Feb 3, 2023

okdas commented Feb 10, 2023

[Telemetry] Design V1 Devnet Dashboard #195

[Telemetry] Design V1 Devnet Dashboard #195

Comments

Olshansk commented Sep 7, 2022 • edited by jessicadaugherty Loading

Objective

Origin Document

Goals / Deliverables

jessicadaugherty commented Jan 24, 2023

Olshansk commented Feb 3, 2023

okdas commented Feb 10, 2023

Olshansk commented Sep 7, 2022 •

edited by jessicadaugherty

Loading