-
Notifications
You must be signed in to change notification settings - Fork 636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate performance related logging #4040
Comments
Note that we also already use both log and tracing for logging. The subtasks here are:
|
Unassigning myself -- clearly, I am not going to get to this anytime soon. Additional check-box:
|
This issue has been automatically marked as stale because it has not had recent activity in the last 2 months. |
This issue has been automatically marked as stale because it has not had recent activity in the last 2 months. |
A more pressing issue: debug log should be enabled without affecting the performance of a node #6072 |
? That’s sounds exactly backwards to me. If something can be a metric it’s better as a metric which can be collected and aggregated for monitoring. |
Yeah, that was poor wording. What I've meant to say is that Basically, I want us to write something like the following (pseudo-runtime) fn process_receipt(receipt: Receipt) {
let _span = tracing::info_span("process_receipt", hash = receipt.hash);
if receipt.is_action_receipt() {
tracing::event!(action_receipt, n_actions = receipt.actions.len());
for action in receipt.actions {
let _span = tracing::info_span("process_action", deadline = 100ms);
}
}
} and get, by using appropriate subscribes, all of the following:
The benefits I see here:
But, again, for metrics in particular, I am not entirely sure this will work (my main concern being performance -- metrics in general support pre-aggregation in process, while tracing is build around shipping the full stream of events and leaving aggregation to the consumer. But then again a subscriber can pre-aggregate in process and ship compressed data externally). |
I have used tracing + a event/span ingest as well as manually instrumented prometheus metrics in the past. Having event/spans can produce very awesome and useful results. It becomes possible to figure out issues at a granularity of a single unit of work, see at what exact point the issues manifested themselves, etc. They also:
On one hand the TBs of daily ingest never really presented much of a roadblock for the application's ability to perform. If we're looking to have a fully featured event/span ingest, we should have a plan for how we'll manage the infrastructure (and so somebody dealing with infrastructure should be part of these discussions). We'd also need to document this for other people running their validators. Otherwise, all the work going into instrumenting nearcore with these events and spans will only serve to produce a more of the nice logs. In contrast, simple prometheus metrics, never really presented any problems related to amount of data, even when not doing too much to keep the cardinality under control. But when I did hit a problem, it was more likely that I wouldn't have sufficient data around the problem area to point me at a cause more precisely than just by giving me a broad guess. As a side note #6072 was closed, but I believe the issue is still relevant, so another checklist item:
Footnotes
|
This item is important. If we can can rid of delay detector, that will be a win:
Created #8186 |
All outstanding sub-issues were fixed. |
Right now we use several different ways to log performance related issues. We have
delay_detector
,performance_stats
, and inside contract runtime some new logging usingtracing
introduced by @matklad. We should consolidate those into a unified framework so that people don't get confused by all those different tools.The text was updated successfully, but these errors were encountered: