Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(metrics): Expose metrics friendly for dashboard #2804

Merged
merged 1 commit into from
Jun 11, 2020

Conversation

mfornet
Copy link
Member

@mfornet mfornet commented Jun 6, 2020

Expose useful metrics in a friendly way to have a useful dashboard for on-fire situations.
As part of this effort worked on the dev-ops to deploy the dashboard. https://github.com/nearprotocol/near-ops/pull/53

There is currently a node on betanet running with this changes applied on top of beta branch.
http://34.94.189.12:3030/status

Testplan

Check that the dashboard is working properly.

DISCLAIMER: While the node is syncing some graphs are not being displayed properly (since some information is not recorded).

Once every node run this code, we are going to be able to select and explore each node individually using the dropdown from the upper left corner. Node will be added automatically as they join to the network.

Screen Shot 2020-06-06 at 2 03 14 AM

UPDATE This is the graph after the node finished syncing. More metrics can be displayed in demand:

Screen Shot 2020-06-06 at 2 16 36 PM

@gitpod-io
Copy link

gitpod-io bot commented Jun 6, 2020

Copy link
Collaborator

@bowenwang1996 bowenwang1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts:

  • I suggest that we put all the prometheus stuff behind some flag.
  • It feels like we are reinventing the wheels here. @frol is there some existing solutions for what's done in named_enum_derive?

@@ -605,7 +610,13 @@ impl StreamHandler<Vec<u8>> for Peer {
self.peer_manager_addr.do_send(metadata);
}

peer_msg.record(msg.len());
self.network_metrics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we put this behind "metric_recorder" or some other flag?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok putting it behind a flag as stated below, but since I think it should be enabled by default, metric_recorder is not a good flag, since we don't want to enable metric_recorder by default as it consume more resources.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am reading here confuses me. It sounds like we have no a very descriptive naming metric_recorder if we don't want to include all the recorded metrics under it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in practice, is that metric_recorder store too much information, with little aggregation, it have been useful to track down some issue, but we don't really want to put all metrics there, since some of them should be exposed anyway. I can change the name to extra_metrics.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra_metrics and slow_metrics sound good to me

@@ -260,6 +260,8 @@ pub struct StatusResponse {
pub validators: Vec<ValidatorInfo>,
/// Sync status of the node.
pub sync_info: StatusSyncInfo,
/// Validator id of the node
pub validator_id: Option<AccountId>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validator_account_id is probably a better name

@mfornet
Copy link
Member Author

mfornet commented Jun 7, 2020

  • I suggest that we put all the prometheus stuff behind some flag.

Prometheus metrics are very cheap and the idea was exposing this metrics by default so we can explore this data. I think we should do this at least while we are not on phase 2 so we get better understanding of the current implementation.

For now I can put it behind a feature flag and have it enabled by default.

chain/network/src/peer_manager.rs Outdated Show resolved Hide resolved
tools/named_enum/named_enum/Cargo.toml Outdated Show resolved Hide resolved
@@ -605,7 +610,13 @@ impl StreamHandler<Vec<u8>> for Peer {
self.peer_manager_addr.do_send(metadata);
}

peer_msg.record(msg.len());
self.network_metrics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am reading here confuses me. It sounds like we have no a very descriptive naming metric_recorder if we don't want to include all the recorded metrics under it.

@mfornet mfornet force-pushed the prometheus_metrics branch from c406443 to c797a81 Compare June 9, 2020 19:05
@mfornet mfornet requested a review from bowenwang1996 June 9, 2020 19:18
@mfornet mfornet force-pushed the prometheus_metrics branch from c797a81 to e0b3822 Compare June 9, 2020 19:20
chain/network/Cargo.toml Outdated Show resolved Hide resolved
chain/network/src/lib.rs Outdated Show resolved Hide resolved
@mfornet mfornet force-pushed the prometheus_metrics branch from e0b3822 to 83c606c Compare June 10, 2020 18:40
@mfornet mfornet force-pushed the prometheus_metrics branch from 83c606c to de99e19 Compare June 10, 2020 18:56
@gitpod-io
Copy link

gitpod-io bot commented Jun 10, 2020

@mfornet mfornet force-pushed the prometheus_metrics branch from e232152 to 690bc5d Compare June 11, 2020 03:18
Expose validator_id in rpc

Use strum (instead of named_enum)
@mfornet mfornet force-pushed the prometheus_metrics branch from 690bc5d to b0a183f Compare June 11, 2020 03:30
@nearprotocol-bulldozer nearprotocol-bulldozer bot merged commit 953a4de into master Jun 11, 2020
@nearprotocol-bulldozer nearprotocol-bulldozer bot deleted the prometheus_metrics branch June 11, 2020 03:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants