Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate performance related logging #4040

Closed
bowenwang1996 opened this issue Mar 5, 2021 · 11 comments
Closed

Consolidate performance related logging #4040

bowenwang1996 opened this issue Mar 5, 2021 · 11 comments
Assignees
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. Groomed Node Node team P-low Priority: low T-node Team: issues relevant to the node experience team

Comments

@bowenwang1996
Copy link
Collaborator

Right now we use several different ways to log performance related issues. We have delay_detector, performance_stats, and inside contract runtime some new logging using tracing introduced by @matklad. We should consolidate those into a unified framework so that people don't get confused by all those different tools.

@matklad
Copy link
Contributor

matklad commented Mar 5, 2021

Note that we also already use both log and tracing for logging. The subtasks here are:

  • replace log with tracing (mostly s/log/tracing, as they are api compatible)
  • replace delay detector with tracing. It's possible to ask tracing to log all spans longer than the specified length
  • replace stats and metric with logging. I am not entirely sure if that's possible, we might want something more custom for this use case
  • make sure that we have tracing::span! calls where appropriate. Hierarchical logging is massively helpful when debugging issues and perf regressions.

@matklad matklad self-assigned this Mar 10, 2021
@stale stale bot added the S-stale label Jul 1, 2021
@bowenwang1996 bowenwang1996 added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-node Team: issues relevant to the node experience team and removed S-stale labels Jul 1, 2021
@near near deleted a comment from stale bot Jul 8, 2021
@matklad matklad removed their assignment Jul 8, 2021
@matklad
Copy link
Contributor

matklad commented Jul 8, 2021

Unassigning myself -- clearly, I am not going to get to this anytime soon. Additional check-box:

  • document nearcore logging best practices here

@stale
Copy link

stale bot commented Oct 6, 2021

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months.
It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.

@stale stale bot added the S-stale label Oct 6, 2021
@stale
Copy link

stale bot commented Jan 6, 2022

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months.
It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.

@bowenwang1996
Copy link
Collaborator Author

bowenwang1996 commented Jan 24, 2022

A more pressing issue: debug log should be enabled without affecting the performance of a node #6072

@mina86
Copy link
Contributor

mina86 commented Jan 25, 2022

replace stats and metric with logging. I am not entirely sure if that's possible, we might want something more custom for this use case

? That’s sounds exactly backwards to me. If something can be a metric it’s better as a metric which can be collected and aggregated for monitoring.

@matklad
Copy link
Contributor

matklad commented Jan 25, 2022

Yeah, that was poor wording. What I've meant to say is that tracing is basically a generic front-end for emitting events, and the sink can be whatever.

Basically, I want us to write something like the following (pseudo-runtime)

fn process_receipt(receipt: Receipt) {
  let _span = tracing::info_span("process_receipt", hash = receipt.hash);
  if receipt.is_action_receipt() {
    tracing::event!(action_receipt, n_actions = receipt.actions.len());
    for action in receipt.actions {
        let _span = tracing::info_span("process_action", deadline = 100ms);
    }
  }
}

and get, by using appropriate subscribes, all of the following:

  • stderr logs for important event
  • stream of events sent to some kind of structured event aggregator which shows graphs for receipt processing times, and allows to quickly answer questions like "what are the hashes of the receipts that took longest to process in the last 24 hours"
  • stderr warnings for spans which took longer than we'd expect them to.

The benefits I see here:

  • unified interface for emitting all kinds of events
  • hierarchical info attached to all events: tracing keeps track of which spans were active when this or that event occured, and that should allow us to trace outliers we see in metrics to specific inputs when caused them

But, again, for metrics in particular, I am not entirely sure this will work (my main concern being performance -- metrics in general support pre-aggregation in process, while tracing is build around shipping the full stream of events and leaving aggregation to the consumer. But then again a subscriber can pre-aggregate in process and ship compressed data externally).

@nagisa
Copy link
Collaborator

nagisa commented Mar 7, 2022

I have used tracing + a event/span ingest as well as manually instrumented prometheus metrics in the past. Having event/spans can produce very awesome and useful results. It becomes possible to figure out issues at a granularity of a single unit of work, see at what exact point the issues manifested themselves, etc. They also:

  • require to plan a fleet-wide guidelines for event structure1;
  • require enforcement of these guidelines somehow; and
  • and is prone to producing amounts of data that quickly spiral out of control (What a nice 10G uplink? Would be a shame to not stuff it full of events and spans.)

On one hand the TBs of daily ingest never really presented much of a roadblock for the application's ability to perform. tracing and opentelemetry are just that good. This amount of events does imply pouring money into a provider of distributed ingest service by bathtubfuls. Or making it somebody's full time job to maintain the ingest infrastructure.

If we're looking to have a fully featured event/span ingest, we should have a plan for how we'll manage the infrastructure (and so somebody dealing with infrastructure should be part of these discussions). We'd also need to document this for other people running their validators. Otherwise, all the work going into instrumenting nearcore with these events and spans will only serve to produce a more of the nice logs.

In contrast, simple prometheus metrics, never really presented any problems related to amount of data, even when not doing too much to keep the cardinality under control. But when I did hit a problem, it was more likely that I wouldn't have sufficient data around the problem area to point me at a cause more precisely than just by giving me a broad guess.


As a side note #6072 was closed, but I believe the issue is still relevant, so another checklist item:

  • Improve our subscriber setup to reduce the performance impact of logging and otherwise sending events/spans.

Footnotes

  1. so that the different events/spans can be easily associated with each other.

@bowenwang1996
Copy link
Collaborator Author

@nagisa I reopened #6072

@nikurt nikurt assigned nikurt and unassigned marcelo-gonzalez May 2, 2022
@exalate-issue-sync exalate-issue-sync bot added T-nodeX and removed T-node Team: issues relevant to the node experience team labels Jun 28, 2022
@matklad matklad added T-node Team: issues relevant to the node experience team and removed T-nodeX labels Aug 4, 2022
@exalate-issue-sync exalate-issue-sync bot added the P-low Priority: low label Sep 6, 2022
@nikurt
Copy link
Contributor

nikurt commented Dec 8, 2022

This item is important. If we can can rid of delay detector, that will be a win:

replace delay detector with tracing. It's possible to ask tracing to log all spans longer than the specified length

Created #8186

@nikurt
Copy link
Contributor

nikurt commented Sep 20, 2023

All outstanding sub-issues were fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. Groomed Node Node team P-low Priority: low T-node Team: issues relevant to the node experience team
Projects
Status: Done
Development

No branches or pull requests

7 participants