Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(o11y): Export basic node config values as static opentelemetry attributes #7711

Merged
merged 29 commits into from
Oct 3, 2022

Conversation

nikurt
Copy link
Contributor

@nikurt nikurt commented Sep 28, 2022

This PR lets us make sure that chain_id, node_id and account_id of the node are attached as attributes to every root span. That lets us easily distinguish traces from different nodes when multiple nodes export traces to the same backend.

To make it possible, the PR refactors creation of the layers. The new setup:

  • Initialize a logging-only subscriber at the very beginning of neard starting. Make it a thread default subscriber.
  • After NearConfig is read, initialize a different subscriber, which has logging, tracing and iotracing enabled, and make that subscriber the global default.

Note that with this PR only neard run command can enable tracing. We definitely don't need tracing for init, localnet and recompress-storage commands. However, view-state may benefit from tracing in some cases.

nikurt and others added 9 commits September 28, 2022 13:09
…he_ and near_shard_cache_

Crate `itoa` provides the fastest way of converting ints as strings.
Use a faster way of converting `shard_id` to a string.
Crate `itoa` provides the fastest way of converting ints to strings.

Snapshot of metrics from a mainnet node. Note that some metrics are not incremented by 1 (`inc()`), but rather by a larger number using `inc_by()`.
```
1948843343 near_chunk_cache_hits # Not hot
399446179 near_chunk_cache_misses # Not hot
376782133 near_shard_cache_hits # Not hot
276622570 near_peer_manager_messages_time_count
239378614 near_applied_trie_deletions    # Optimized
206898552 near_database_op_latency_by_op_and_column_count
184492507 near_peer_message_sent_by_type_total
179740890 near_peer_message_received_total
179344494 near_peer_message_received_by_type_total
168098152 near_applied_trie_insertions    # Optimized
160684011 near_shard_cache_pop_hits # Not hot
100536077 near_client_triggers_time_count
100064991 near_requests_count_by_type_total
98868135 near_client_messages_processing_time_count
98868135 near_client_messages_count
97408226 near_peer_client_message_received_by_type_total
94021301 near_network_routed_msg_latency_count
80880283 partial_encoded_chunk_request_delay_count
78394032 near_shard_cache_gc_pop_misses
51277767 partial_encoded_chunk_response_delay_count
43248133 near_partial_encoded_chunk_request_processing_time_count
22664046 near_shard_cache_misses # Not hot
```
@nikurt nikurt requested a review from a team as a code owner September 28, 2022 12:02
@nikurt nikurt requested review from mm-near and nagisa September 28, 2022 12:02
Comment on lines +38 to +42
let _subscriber_guard = default_subscriber(
make_env_filter(neard_cmd.opts.verbose_target())?,
&neard_cmd.opts.o11y,
)
.local();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An approach I had used in the past was instead:

  1. Save all the events, spans into a some sort of buffer;
  2. Once the configuration is loaded, initialize the logging system;
  3. Replay all of the saved events, spans again, now that the subscriber has been initialized.

While the current approach works for now, it is probably going to be quite inconvenient if there's a need to track down some consequential information being emit during the prep stage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like the current approach, because it:

  1. lets us easily and reliably configure logging for simple uses cases, such as tools/storage-usage-delta-calculator/src/main.rs . Previously those use cases created tokio runtime and then initialized logging. I remember seeing that one of the tools was not able to log anything at all because of mis-managing that tokio runtime, but I didn't have time to fix it.
  2. Logging from the main()-thread directly is good enough and unlikely that we need non-blocking logging from that thread. Ideally a node should be up and running after a restart as soon as possible, which means it is supposed to do as little work as possible in the main()-thread.
  3. Maybe I don't understand how buffering is supposed to work, but here is my understanding. The main()-thread buffers the spans and logs until nearconfig is available, which means anything logged before loading nearconfig will only be shown to the user after nearconfig is loaded. For example, if an error is encountered when reading nearconfig, that error will not be able to be shown to the user.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that’s fine, yeah. The only thing I’m not seeing is this subscriber guard being dropped sometime around the point in time where the global subscriber is initialized.

This, I believe, would mean that the main thread uses the simplified logger indefinitely, which doesn’t seem quite right to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uses the simplified logger indefinitely

Which is good, isn't it? It lets us log messages after actix is shutdown. For example, such messages as the RocksDB shutdown message:

info!(target: "neard", "Waiting for RocksDB to gracefully shutdown");

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, once there is a problem, being able to figure out what version a broken node was running is useful, isn’t it?

Though, again, this is kinda back to the discussion about whether we consider needing to hunt down and look at the local logs for the particular broken instance. I personally don’t think that’s super convenient, and would rather if everything eventually made into the same place (for regular neard validator deployments, anyway.)

It isn’t necessarily the case that this needs to be moved per se, I could see this being solved by duplication (i.e. log the version again after initializing the otel logger.)

The caveat is that I’m not an operating a validator, so all I’m saying is basically just me armchairing.

nikurt and others added 14 commits September 28, 2022 16:58
Enabling or disabling tracing at runtime can be useful to debug issues with nodes running too slow. This allows us to avoid restarting a node, which is known to have a rather large impact on the node.

Tested that all combinations of logs verbosity and opentelemetry verbosity work well together.
Tested by reconfiguring logging as follows:

```
echo '{"verbose_module":"", "opentelemetry_level": "DEBUG"}' > ~/.near/betanet/log_config.json ; killall -SIGHUP neard
```
We create FlatStorageState on Chain creation, but it currently happens for any RuntimeAdapter impl. Because KeyValueRuntime doesn't and shouldn't support flat storage, it leads to tests failures.

Here we move FSS creation inside RuntimeAdapter impl.

## Testing

https://buildkite.com/nearprotocol/nearcore-flat-state/builds/60#018383c5-7838-4e5b-86f1-da2baa381789 - number of failing tests drops from 77 to 4 - which are failing due to other reasons
* Use the underlying reproducible random impls.

* Improve testing

* fix deny

* fmt

Co-authored-by: Michal Nazarewicz <mina86@mina86.com>
@nikurt nikurt requested a review from nagisa September 29, 2022 09:03
Copy link
Collaborator

@nagisa nagisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, but needs to be fixed to pass the CI.

@@ -58,7 +58,7 @@ impl NeardCmd {
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move these, or at least the version log, to after the global subscriber has been initialized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want the version log to be in the global subscriber? Restarts are relatively rare, and I don't expect to find this specific log line in Grafana Loki or Tempo.

Overall, I'd still want to keep this version message logged to the simple thread-default subscrtiber:

  1. only the run command initializes the global subscriber.
  2. this info can and should be printed as soon as the binary starts. I don't see a point in delaying this until after the NearConfig is available.

Comment on lines +38 to +42
let _subscriber_guard = default_subscriber(
make_env_filter(neard_cmd.opts.verbose_target())?,
&neard_cmd.opts.o11y,
)
.local();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that’s fine, yeah. The only thing I’m not seeing is this subscriber guard being dropped sometime around the point in time where the global subscriber is initialized.

This, I believe, would mean that the main thread uses the simplified logger indefinitely, which doesn’t seem quite right to me.

@near-bulldozer near-bulldozer bot merged commit bacef84 into master Oct 3, 2022
@near-bulldozer near-bulldozer bot deleted the nikurt-nodekey-delay branch October 3, 2022 12:06
nikurt added a commit that referenced this pull request Nov 9, 2022
…ttributes (#7711)

This PR lets us make sure that `chain_id`, `node_id` and `account_id` of the node are attached as attributes to every root span. That lets us easily distinguish traces from different nodes when multiple nodes export traces to the same backend.

To make it possible, the PR refactors creation of the layers. The new setup:
* Initialize a logging-only subscriber at the very beginning of `neard` starting. Make it a **thread default** subscriber.
* After NearConfig is read, initialize a different subscriber, which has logging, tracing and iotracing enabled, and make that subscriber the **global default**.

Note that with this PR only `neard run` command can enable tracing. We definitely don't need tracing for `init`, `localnet` and `recompress-storage` commands. However, `view-state` may benefit from tracing in some cases.
near-bulldozer bot pushed a commit that referenced this pull request Feb 22, 2023
I think IO traces were not working in the state viewer since #7711
changed how initialization works.

Using state viewer to generate these traces is one of the main use
cases of the IO traces, so we want this to work.

This PR adds initialization to the default logger, which is used in
the state viewer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants