-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(o11y): Export basic node config values as static opentelemetry attributes #7711
Conversation
…he_ and near_shard_cache_ Crate `itoa` provides the fastest way of converting ints as strings.
Use a faster way of converting `shard_id` to a string. Crate `itoa` provides the fastest way of converting ints to strings. Snapshot of metrics from a mainnet node. Note that some metrics are not incremented by 1 (`inc()`), but rather by a larger number using `inc_by()`. ``` 1948843343 near_chunk_cache_hits # Not hot 399446179 near_chunk_cache_misses # Not hot 376782133 near_shard_cache_hits # Not hot 276622570 near_peer_manager_messages_time_count 239378614 near_applied_trie_deletions # Optimized 206898552 near_database_op_latency_by_op_and_column_count 184492507 near_peer_message_sent_by_type_total 179740890 near_peer_message_received_total 179344494 near_peer_message_received_by_type_total 168098152 near_applied_trie_insertions # Optimized 160684011 near_shard_cache_pop_hits # Not hot 100536077 near_client_triggers_time_count 100064991 near_requests_count_by_type_total 98868135 near_client_messages_processing_time_count 98868135 near_client_messages_count 97408226 near_peer_client_message_received_by_type_total 94021301 near_network_routed_msg_latency_count 80880283 partial_encoded_chunk_request_delay_count 78394032 near_shard_cache_gc_pop_misses 51277767 partial_encoded_chunk_response_delay_count 43248133 near_partial_encoded_chunk_request_processing_time_count 22664046 near_shard_cache_misses # Not hot ```
let _subscriber_guard = default_subscriber( | ||
make_env_filter(neard_cmd.opts.verbose_target())?, | ||
&neard_cmd.opts.o11y, | ||
) | ||
.local(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An approach I had used in the past was instead:
- Save all the events, spans into a some sort of buffer;
- Once the configuration is loaded, initialize the logging system;
- Replay all of the saved events, spans again, now that the subscriber has been initialized.
While the current approach works for now, it is probably going to be quite inconvenient if there's a need to track down some consequential information being emit during the prep stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually like the current approach, because it:
- lets us easily and reliably configure logging for simple uses cases, such as tools/storage-usage-delta-calculator/src/main.rs . Previously those use cases created tokio runtime and then initialized logging. I remember seeing that one of the tools was not able to log anything at all because of mis-managing that tokio runtime, but I didn't have time to fix it.
- Logging from the
main()
-thread directly is good enough and unlikely that we need non-blocking logging from that thread. Ideally a node should be up and running after a restart as soon as possible, which means it is supposed to do as little work as possible in themain()
-thread. - Maybe I don't understand how buffering is supposed to work, but here is my understanding. The
main()
-thread buffers the spans and logs until nearconfig is available, which means anything logged before loading nearconfig will only be shown to the user after nearconfig is loaded. For example, if an error is encountered when reading nearconfig, that error will not be able to be shown to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that’s fine, yeah. The only thing I’m not seeing is this subscriber guard being dropped sometime around the point in time where the global subscriber is initialized.
This, I believe, would mean that the main thread uses the simplified logger indefinitely, which doesn’t seem quite right to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uses the simplified logger indefinitely
Which is good, isn't it? It lets us log messages after actix is shutdown. For example, such messages as the RocksDB shutdown message:
info!(target: "neard", "Waiting for RocksDB to gracefully shutdown");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, once there is a problem, being able to figure out what version a broken node was running is useful, isn’t it?
Though, again, this is kinda back to the discussion about whether we consider needing to hunt down and look at the local logs for the particular broken instance. I personally don’t think that’s super convenient, and would rather if everything eventually made into the same place (for regular neard validator deployments, anyway.)
It isn’t necessarily the case that this needs to be moved per se, I could see this being solved by duplication (i.e. log the version again after initializing the otel logger.)
The caveat is that I’m not an operating a validator, so all I’m saying is basically just me armchairing.
Enabling or disabling tracing at runtime can be useful to debug issues with nodes running too slow. This allows us to avoid restarting a node, which is known to have a rather large impact on the node. Tested that all combinations of logs verbosity and opentelemetry verbosity work well together. Tested by reconfiguring logging as follows: ``` echo '{"verbose_module":"", "opentelemetry_level": "DEBUG"}' > ~/.near/betanet/log_config.json ; killall -SIGHUP neard ```
We create FlatStorageState on Chain creation, but it currently happens for any RuntimeAdapter impl. Because KeyValueRuntime doesn't and shouldn't support flat storage, it leads to tests failures. Here we move FSS creation inside RuntimeAdapter impl. ## Testing https://buildkite.com/nearprotocol/nearcore-flat-state/builds/60#018383c5-7838-4e5b-86f1-da2baa381789 - number of failing tests drops from 77 to 4 - which are failing due to other reasons
* Use the underlying reproducible random impls. * Improve testing * fix deny * fmt Co-authored-by: Michal Nazarewicz <mina86@mina86.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, but needs to be fixed to pass the CI.
@@ -58,7 +58,7 @@ impl NeardCmd { | |||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move these, or at least the version log, to after the global subscriber has been initialized?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want the version log to be in the global subscriber? Restarts are relatively rare, and I don't expect to find this specific log line in Grafana Loki or Tempo.
Overall, I'd still want to keep this version message logged to the simple thread-default subscrtiber:
- only the
run
command initializes the global subscriber. - this info can and should be printed as soon as the binary starts. I don't see a point in delaying this until after the
NearConfig
is available.
let _subscriber_guard = default_subscriber( | ||
make_env_filter(neard_cmd.opts.verbose_target())?, | ||
&neard_cmd.opts.o11y, | ||
) | ||
.local(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that’s fine, yeah. The only thing I’m not seeing is this subscriber guard being dropped sometime around the point in time where the global subscriber is initialized.
This, I believe, would mean that the main thread uses the simplified logger indefinitely, which doesn’t seem quite right to me.
…ttributes (#7711) This PR lets us make sure that `chain_id`, `node_id` and `account_id` of the node are attached as attributes to every root span. That lets us easily distinguish traces from different nodes when multiple nodes export traces to the same backend. To make it possible, the PR refactors creation of the layers. The new setup: * Initialize a logging-only subscriber at the very beginning of `neard` starting. Make it a **thread default** subscriber. * After NearConfig is read, initialize a different subscriber, which has logging, tracing and iotracing enabled, and make that subscriber the **global default**. Note that with this PR only `neard run` command can enable tracing. We definitely don't need tracing for `init`, `localnet` and `recompress-storage` commands. However, `view-state` may benefit from tracing in some cases.
I think IO traces were not working in the state viewer since #7711 changed how initialization works. Using state viewer to generate these traces is one of the main use cases of the IO traces, so we want this to work. This PR adds initialization to the default logger, which is used in the state viewer.
This PR lets us make sure that
chain_id
,node_id
andaccount_id
of the node are attached as attributes to every root span. That lets us easily distinguish traces from different nodes when multiple nodes export traces to the same backend.To make it possible, the PR refactors creation of the layers. The new setup:
neard
starting. Make it a thread default subscriber.Note that with this PR only
neard run
command can enable tracing. We definitely don't need tracing forinit
,localnet
andrecompress-storage
commands. However,view-state
may benefit from tracing in some cases.