Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(o11y): Configure opentelemetry filter at runtime #7701

Merged
merged 47 commits into from
Sep 28, 2022

Conversation

nikurt
Copy link
Contributor

@nikurt nikurt commented Sep 27, 2022

Enabling or disabling tracing at runtime can be useful to debug issues with nodes running too slow. This allows us to avoid restarting a node, which is known to have a rather large impact on the node.

Tested that all combinations of logs verbosity and opentelemetry verbosity work well together.
Tested by reconfiguring logging as follows:

echo '{"verbose_module":"", "opentelemetry_level": "DEBUG"}' > ~/.near/betanet/log_config.json ; killall -SIGHUP neard

nikurt and others added 30 commits September 21, 2022 13:29
…he_ and near_shard_cache_

Crate `itoa` provides the fastest way of converting ints as strings.
Turns out `CryptoHash::hash_borsh(&[foo])` doesn’t quite do what one
might expect.  Rather than hashing one-element slice (which is
serialised as length of one plus serialisation of the element) it
hashes one-element array (which is serialised as just the element).

Introduce `CryptoHash::hash_borsh_slice` to provide less error-prone
method and fix tests which fell victim of the confusion.  The tests
were broken by commit d5f650a: ‘Prefer CryptoHash::hash_borsh to
try_to_vec+hash combo’.
Part of #6910

The grand plan is to separate EpochManager out of NightshadeRuntime. As
a first step, I want to separate the *types* of runtime and epoch
manager, while keeping them in the same object. I want to do this via
the following series of steps:

```
trait RuntimeAdapter {}
```

```
trait EpochManagerAdapter {}

trait RuntimeAdapter: EpochManagerAdapter {}
```

```
trait EpochManagerAdapter {}

trait RuntimeAdapter {}
```

This PR introduces the EpochManagerAdapter infra.

You can see the rough idea of the plan in #7606
Part of #6910, depends on #7652

Must be reviewed per commit.
What this test is intended to do is to spawn a local neard node and to run near-api-js tests against that.

The purpose is to detect unintentional changes to near public API.

We have disabled actually running the test a while ago, and today the build broke b/c api-js moved from yarn to pnpm.

I think at this it's safe to kill this infra completely, as we weren't running it for a while.

It would of course be nice to have some in-repo tests covering public JSON API, but I think we are currently server well-enough by betanet and testnet anyway.

Testing public APIs via running tests of downstream projects upstream is clearly a wrong way to do that!
Use a faster way of converting `shard_id` to a string.
Crate `itoa` provides the fastest way of converting ints to strings.

Snapshot of metrics from a mainnet node. Note that some metrics are not incremented by 1 (`inc()`), but rather by a larger number using `inc_by()`.
```
1948843343 near_chunk_cache_hits # Not hot
399446179 near_chunk_cache_misses # Not hot
376782133 near_shard_cache_hits # Not hot
276622570 near_peer_manager_messages_time_count
239378614 near_applied_trie_deletions    # Optimized
206898552 near_database_op_latency_by_op_and_column_count
184492507 near_peer_message_sent_by_type_total
179740890 near_peer_message_received_total
179344494 near_peer_message_received_by_type_total
168098152 near_applied_trie_insertions    # Optimized
160684011 near_shard_cache_pop_hits # Not hot
100536077 near_client_triggers_time_count
100064991 near_requests_count_by_type_total
98868135 near_client_messages_processing_time_count
98868135 near_client_messages_count
97408226 near_peer_client_message_received_by_type_total
94021301 near_network_routed_msg_latency_count
80880283 partial_encoded_chunk_request_delay_count
78394032 near_shard_cache_gc_pop_misses
51277767 partial_encoded_chunk_response_delay_count
43248133 near_partial_encoded_chunk_request_processing_time_count
22664046 near_shard_cache_misses # Not hot
```
Adding an explicit "test fixture" struct makes call-sites more readable (named args!) and makes making further changes easier.

No particular grand plan here, just figured this small change to be a marginal improvement!
Part of #6910, depends on #7656

This is slighly less trivial than pure move -- the API is adjusted to
remove direct dependency on doomslug which lives in a different crate
Part of #6910, depends on #7657

Must be reviewed per commit.
Part of #6910, depends on #7660

Must be reviewed per commit.
Part of #6910, depends on #7665

Previously, this was only available on NightshadeRuntime, but seems fine
to expose everywhere.
https://blog.rust-lang.org/2022/09/22/Rust-1.64.0.html

Notable changes:

* rust-analyzer is now available via rustup :o)
* cargo supports "workspace inheritance". We can use this to cut some
  repeated deps and avoid repating rust-version everywhere. I'll handle
  that in a separate PR
* poll_fn and ready! are stabilized, we might use them for some of our
  async stuff
* add Nintendo Switch as tier 3 target so I guess we can run neard on
  switch now?
* monitor_peers_max_period - controls how ofen we think about connecting to a new peer (default is 60 seconds)
* peer_expiration_duration - controls when we remove the peer from candidate set (default - 7 days since we heard from it)

Also hooked up peer_stats_period to use the value from config if set.
Prefetch receipt meta data (account and access keys) ahead of time.
This recent performance optimization has been disabled by default.

In lab settings, performance improvement is confirmed. Using the
estimator to measure the time it takes to process empty receipts,
on a DB with 2 million accounts, on a local SSD, with enabled shard
caches. The result is as follows.

- sender = receiver:  737us -> 386 us
- sender != receiver: 1014us -> 644us
- overhead per block: 6.9us -> 7.4us

Note that this is with 100 empty receipts in the same block, with all
different accounts. In real traffic it usually does not happen that so
many different accounts are accessed in the same block. But it is
allowed and we must be able process this case in reasonable time.
So even if it might not help in the average case, it makes sense to
activate this feature to speed up the worst-case.

Currently we use 8 IO threads per shard. Repeated experiments with
more threads showed no difference.
Decreasing it to 4 threads is about equal to 8 threads. Going lower is
significantly worse. Thus, overall, 8 threads seems reasonable here.

Canary nodes in testnet and mainnet with the feature enabled show that
the feature also works as expected on real traffic. The memory impact is
minimal, usually less than 40MB of reserved capacity, which is less than
8MB actual memory because 8 threads reserve 4MB each ahead of actually
fetching the data.
* doc: wordsmithing

* Update book/src/README.md

Co-authored-by: Akhilesh Singhania <akhi@near.org>

Co-authored-by: Akhilesh Singhania <akhi@near.org>
This PR bulk-moves the content from docs to book, and reformats it according to our code style (mostly wrapping).

I very specifically don't want to proof-read the contents, fix broken links, etc. I want to do that initially, but I'd love to start with just moving the stuff over.
docs feels like a much more self-explanatory name than book. I
originally put the stuff into book folder to not mess up existing docs,
but now that those are moved into the book, we can set up proper name.

This was the plan all along!

This also finalizes, for the time being

* the name "Guide to Nearcore Development" (unispiringly stolen from
  https://github.com/rust-lang/rustc-dev-guide/blob/master/book.toml)
* the url https://near.github.io/nearcore/. We can add some fancier DNS
  here, but I don't see a big need here. I also hope that most links
  would be *internal* to the guide, so I wouldnt' worry about URL
  stability too much. This is by design not the reference doc!
It has been a while since the last time we cut the crates releases. Let's do that with 1.29.0 release of nearcore.

Tagged: https://github.com/near/nearcore/releases/tag/crates-0.15.0

Notable changes:

  - #6924
  - #6941
  - (BREAKING CHANGE) #7308

Check here for a full changelog: crates-0.14.0...crates-0.15.0
Make it clear what kind of thing is missing from the DB. Ideally, we'd
add a db_col field to the error, but that's going to be a much larger
change.
jakmeier and others added 10 commits September 27, 2022 13:56
resolves #7004, see that issue for reasoning behind the change
This PR implements FlatStorageState, the struct that manages deltas in flat storage. After this PR, the main implementation of flat storage is mostly ready. The next step is to add more tests.
* Upgrade tracing and opentelemetry crates.

* Avoid touching Cargo.toml's for minior version upgrade of `tracing`.

* HTTP exporter is not the best but it:
a) links
b) doesn't require protoc

* Undo opentelemtry upgrade, because it introduces an external dependency on protoc or link-time errors.

Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>
@nikurt nikurt requested a review from a team as a code owner September 27, 2022 11:58
@nikurt nikurt requested a review from mm-near September 27, 2022 11:58
@nikurt nikurt requested a review from nagisa September 27, 2022 13:13
Comment on lines 49 to 59
Handle<
Filtered<
tracing_subscriber::fmt::Layer<Registry, DefaultFields, Format, NonBlocking>,
EnvFilter,
LevelFilter,
Layered<
Filtered<
tracing_subscriber::fmt::Layer<Registry, DefaultFields, Format, NonBlocking>,
reload::Layer<EnvFilter, Registry>,
Registry,
>,
Registry,
>,
Registry,
>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes. I wish we were able to use TAIT.

core/o11y/src/lib.rs Outdated Show resolved Hide resolved
core/o11y/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@nagisa nagisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall seems good.

neard/src/log_config_watcher.rs Outdated Show resolved Hide resolved
core/o11y/src/lib.rs Outdated Show resolved Hide resolved
core/o11y/src/lib.rs Outdated Show resolved Hide resolved
core/o11y/src/lib.rs Outdated Show resolved Hide resolved
@near-bulldozer near-bulldozer bot merged commit c6f0059 into master Sep 28, 2022
@near-bulldozer near-bulldozer bot deleted the nikurt-nodekey branch September 28, 2022 18:13
nikurt added a commit that referenced this pull request Sep 29, 2022
Enabling or disabling tracing at runtime can be useful to debug issues with nodes running too slow. This allows us to avoid restarting a node, which is known to have a rather large impact on the node.

Tested that all combinations of logs verbosity and opentelemetry verbosity work well together.
Tested by reconfiguring logging as follows:

```
echo '{"verbose_module":"", "opentelemetry_level": "DEBUG"}' > ~/.near/betanet/log_config.json ; killall -SIGHUP neard
```
nikurt added a commit that referenced this pull request Nov 9, 2022
Enabling or disabling tracing at runtime can be useful to debug issues with nodes running too slow. This allows us to avoid restarting a node, which is known to have a rather large impact on the node.

Tested that all combinations of logs verbosity and opentelemetry verbosity work well together.
Tested by reconfiguring logging as follows:

```
echo '{"verbose_module":"", "opentelemetry_level": "DEBUG"}' > ~/.near/betanet/log_config.json ; killall -SIGHUP neard
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants