Consolidate performance related logging #4040

bowenwang1996 · 2021-03-05T16:58:04Z

Right now we use several different ways to log performance related issues. We have delay_detector, performance_stats, and inside contract runtime some new logging using tracing introduced by @matklad. We should consolidate those into a unified framework so that people don't get confused by all those different tools.

The text was updated successfully, but these errors were encountered:

matklad · 2021-03-05T17:06:57Z

Note that we also already use both log and tracing for logging. The subtasks here are:

replace log with tracing (mostly s/log/tracing, as they are api compatible)
replace delay detector with tracing. It's possible to ask tracing to log all spans longer than the specified length
replace stats and metric with logging. I am not entirely sure if that's possible, we might want something more custom for this use case
make sure that we have tracing::span! calls where appropriate. Hierarchical logging is massively helpful when debugging issues and perf regressions.

matklad · 2021-07-08T13:47:46Z

Unassigning myself -- clearly, I am not going to get to this anytime soon. Additional check-box:

document nearcore logging best practices here

stale · 2021-10-06T14:22:27Z

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months.
It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.

stale · 2022-01-06T00:02:53Z

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months.
It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.

bowenwang1996 · 2022-01-24T15:34:28Z

A more pressing issue: debug log should be enabled without affecting the performance of a node #6072

mina86 · 2022-01-25T18:26:29Z

replace stats and metric with logging. I am not entirely sure if that's possible, we might want something more custom for this use case

? That’s sounds exactly backwards to me. If something can be a metric it’s better as a metric which can be collected and aggregated for monitoring.

matklad · 2022-01-25T19:17:52Z

Yeah, that was poor wording. What I've meant to say is that tracing is basically a generic front-end for emitting events, and the sink can be whatever.

Basically, I want us to write something like the following (pseudo-runtime)

fn process_receipt(receipt: Receipt) {
  let _span = tracing::info_span("process_receipt", hash = receipt.hash);
  if receipt.is_action_receipt() {
    tracing::event!(action_receipt, n_actions = receipt.actions.len());
    for action in receipt.actions {
        let _span = tracing::info_span("process_action", deadline = 100ms);
    }
  }
}

and get, by using appropriate subscribes, all of the following:

stderr logs for important event
stream of events sent to some kind of structured event aggregator which shows graphs for receipt processing times, and allows to quickly answer questions like "what are the hashes of the receipts that took longest to process in the last 24 hours"
stderr warnings for spans which took longer than we'd expect them to.

The benefits I see here:

unified interface for emitting all kinds of events
hierarchical info attached to all events: tracing keeps track of which spans were active when this or that event occured, and that should allow us to trace outliers we see in metrics to specific inputs when caused them

But, again, for metrics in particular, I am not entirely sure this will work (my main concern being performance -- metrics in general support pre-aggregation in process, while tracing is build around shipping the full stream of events and leaving aggregation to the consumer. But then again a subscriber can pre-aggregate in process and ship compressed data externally).

nagisa · 2022-03-07T18:48:36Z

I have used tracing + a event/span ingest as well as manually instrumented prometheus metrics in the past. Having event/spans can produce very awesome and useful results. It becomes possible to figure out issues at a granularity of a single unit of work, see at what exact point the issues manifested themselves, etc. They also:

require to plan a fleet-wide guidelines for event structure¹;
require enforcement of these guidelines somehow; and
and is prone to producing amounts of data that quickly spiral out of control (What a nice 10G uplink? Would be a shame to not stuff it full of events and spans.)

On one hand the TBs of daily ingest never really presented much of a roadblock for the application's ability to perform. tracing and opentelemetry are just that good. This amount of events does imply pouring money into a provider of distributed ingest service by bathtubfuls. Or making it somebody's full time job to maintain the ingest infrastructure.

If we're looking to have a fully featured event/span ingest, we should have a plan for how we'll manage the infrastructure (and so somebody dealing with infrastructure should be part of these discussions). We'd also need to document this for other people running their validators. Otherwise, all the work going into instrumenting nearcore with these events and spans will only serve to produce a more of the nice logs.

In contrast, simple prometheus metrics, never really presented any problems related to amount of data, even when not doing too much to keep the cardinality under control. But when I did hit a problem, it was more likely that I wouldn't have sufficient data around the problem area to point me at a cause more precisely than just by giving me a broad guess.

As a side note #6072 was closed, but I believe the issue is still relevant, so another checklist item:

Improve our subscriber setup to reduce the performance impact of logging and otherwise sending events/spans.

so that the different events/spans can be easily associated with each other. ↩

bowenwang1996 · 2022-03-09T01:33:30Z

@nagisa I reopened #6072

nikurt · 2022-12-08T09:48:03Z

This item is important. If we can can rid of delay detector, that will be a win:

replace delay detector with tracing. It's possible to ask tracing to log all spans longer than the specified length

Created #8186

nikurt · 2023-09-20T13:25:13Z

All outstanding sub-issues were fixed.

matklad self-assigned this Mar 10, 2021

stale bot added the S-stale label Jul 1, 2021

bowenwang1996 added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-node Team: issues relevant to the node experience team and removed S-stale labels Jul 1, 2021

near deleted a comment from stale bot Jul 8, 2021

matklad mentioned this issue Jul 8, 2021

feat: improve visibility for runtime performance #4038

Merged

matklad removed their assignment Jul 8, 2021

stale bot added the S-stale label Oct 6, 2021

bowenwang1996 removed the S-stale label Oct 7, 2021

stale bot added the S-stale label Jan 6, 2022

bowenwang1996 removed the S-stale label Jan 7, 2022

bowenwang1996 assigned marcelo-gonzalez Jan 24, 2022

nikurt assigned nikurt and unassigned marcelo-gonzalez May 2, 2022

exalate-issue-sync bot added T-nodeX and removed T-node Team: issues relevant to the node experience team labels Jun 28, 2022

matklad added T-node Team: issues relevant to the node experience team and removed T-nodeX labels Aug 4, 2022

exalate-issue-sync bot added the P-low Priority: low label Sep 6, 2022

exalate-issue-sync bot added the Groomed label Nov 8, 2022

nikurt closed this as completed Sep 20, 2023

gmilescu added the Node Node team label Oct 19, 2023

github-project-automation bot added this to Node Q2 2022 Aug 28, 2024

github-project-automation bot moved this to Done in Node Q2 2022 Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate performance related logging #4040

Consolidate performance related logging #4040

bowenwang1996 commented Mar 5, 2021

matklad commented Mar 5, 2021 •

edited by nikurt

Loading

matklad commented Jul 8, 2021 •

edited by nikurt

Loading

stale bot commented Oct 6, 2021

stale bot commented Jan 6, 2022

bowenwang1996 commented Jan 24, 2022 •

edited

Loading

mina86 commented Jan 25, 2022

matklad commented Jan 25, 2022 •

edited

Loading

nagisa commented Mar 7, 2022

bowenwang1996 commented Mar 9, 2022

nikurt commented Dec 8, 2022 •

edited

Loading

nikurt commented Sep 20, 2023

Consolidate performance related logging #4040

Consolidate performance related logging #4040

Comments

bowenwang1996 commented Mar 5, 2021

matklad commented Mar 5, 2021 • edited by nikurt Loading

matklad commented Jul 8, 2021 • edited by nikurt Loading

stale bot commented Oct 6, 2021

stale bot commented Jan 6, 2022

bowenwang1996 commented Jan 24, 2022 • edited Loading

mina86 commented Jan 25, 2022

matklad commented Jan 25, 2022 • edited Loading

nagisa commented Mar 7, 2022

Footnotes

bowenwang1996 commented Mar 9, 2022

nikurt commented Dec 8, 2022 • edited Loading

nikurt commented Sep 20, 2023

matklad commented Mar 5, 2021 •

edited by nikurt

Loading

matklad commented Jul 8, 2021 •

edited by nikurt

Loading

bowenwang1996 commented Jan 24, 2022 •

edited

Loading

matklad commented Jan 25, 2022 •

edited

Loading

nikurt commented Dec 8, 2022 •

edited

Loading