-
Notifications
You must be signed in to change notification settings - Fork 663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(o11y): Configure opentelemetry filter at runtime #7701
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…he_ and near_shard_cache_ Crate `itoa` provides the fastest way of converting ints as strings.
Turns out `CryptoHash::hash_borsh(&[foo])` doesn’t quite do what one might expect. Rather than hashing one-element slice (which is serialised as length of one plus serialisation of the element) it hashes one-element array (which is serialised as just the element). Introduce `CryptoHash::hash_borsh_slice` to provide less error-prone method and fix tests which fell victim of the confusion. The tests were broken by commit d5f650a: ‘Prefer CryptoHash::hash_borsh to try_to_vec+hash combo’.
Part of #6910 The grand plan is to separate EpochManager out of NightshadeRuntime. As a first step, I want to separate the *types* of runtime and epoch manager, while keeping them in the same object. I want to do this via the following series of steps: ``` trait RuntimeAdapter {} ``` ``` trait EpochManagerAdapter {} trait RuntimeAdapter: EpochManagerAdapter {} ``` ``` trait EpochManagerAdapter {} trait RuntimeAdapter {} ``` This PR introduces the EpochManagerAdapter infra. You can see the rough idea of the plan in #7606
What this test is intended to do is to spawn a local neard node and to run near-api-js tests against that. The purpose is to detect unintentional changes to near public API. We have disabled actually running the test a while ago, and today the build broke b/c api-js moved from yarn to pnpm. I think at this it's safe to kill this infra completely, as we weren't running it for a while. It would of course be nice to have some in-repo tests covering public JSON API, but I think we are currently server well-enough by betanet and testnet anyway. Testing public APIs via running tests of downstream projects upstream is clearly a wrong way to do that!
Use a faster way of converting `shard_id` to a string. Crate `itoa` provides the fastest way of converting ints to strings. Snapshot of metrics from a mainnet node. Note that some metrics are not incremented by 1 (`inc()`), but rather by a larger number using `inc_by()`. ``` 1948843343 near_chunk_cache_hits # Not hot 399446179 near_chunk_cache_misses # Not hot 376782133 near_shard_cache_hits # Not hot 276622570 near_peer_manager_messages_time_count 239378614 near_applied_trie_deletions # Optimized 206898552 near_database_op_latency_by_op_and_column_count 184492507 near_peer_message_sent_by_type_total 179740890 near_peer_message_received_total 179344494 near_peer_message_received_by_type_total 168098152 near_applied_trie_insertions # Optimized 160684011 near_shard_cache_pop_hits # Not hot 100536077 near_client_triggers_time_count 100064991 near_requests_count_by_type_total 98868135 near_client_messages_processing_time_count 98868135 near_client_messages_count 97408226 near_peer_client_message_received_by_type_total 94021301 near_network_routed_msg_latency_count 80880283 partial_encoded_chunk_request_delay_count 78394032 near_shard_cache_gc_pop_misses 51277767 partial_encoded_chunk_response_delay_count 43248133 near_partial_encoded_chunk_request_processing_time_count 22664046 near_shard_cache_misses # Not hot ```
Adding an explicit "test fixture" struct makes call-sites more readable (named args!) and makes making further changes easier. No particular grand plan here, just figured this small change to be a marginal improvement!
https://blog.rust-lang.org/2022/09/22/Rust-1.64.0.html Notable changes: * rust-analyzer is now available via rustup :o) * cargo supports "workspace inheritance". We can use this to cut some repeated deps and avoid repating rust-version everywhere. I'll handle that in a separate PR * poll_fn and ready! are stabilized, we might use them for some of our async stuff * add Nintendo Switch as tier 3 target so I guess we can run neard on switch now?
* monitor_peers_max_period - controls how ofen we think about connecting to a new peer (default is 60 seconds) * peer_expiration_duration - controls when we remove the peer from candidate set (default - 7 days since we heard from it) Also hooked up peer_stats_period to use the value from config if set.
Prefetch receipt meta data (account and access keys) ahead of time. This recent performance optimization has been disabled by default. In lab settings, performance improvement is confirmed. Using the estimator to measure the time it takes to process empty receipts, on a DB with 2 million accounts, on a local SSD, with enabled shard caches. The result is as follows. - sender = receiver: 737us -> 386 us - sender != receiver: 1014us -> 644us - overhead per block: 6.9us -> 7.4us Note that this is with 100 empty receipts in the same block, with all different accounts. In real traffic it usually does not happen that so many different accounts are accessed in the same block. But it is allowed and we must be able process this case in reasonable time. So even if it might not help in the average case, it makes sense to activate this feature to speed up the worst-case. Currently we use 8 IO threads per shard. Repeated experiments with more threads showed no difference. Decreasing it to 4 threads is about equal to 8 threads. Going lower is significantly worse. Thus, overall, 8 threads seems reasonable here. Canary nodes in testnet and mainnet with the feature enabled show that the feature also works as expected on real traffic. The memory impact is minimal, usually less than 40MB of reserved capacity, which is less than 8MB actual memory because 8 threads reserve 4MB each ahead of actually fetching the data.
* doc: wordsmithing * Update book/src/README.md Co-authored-by: Akhilesh Singhania <akhi@near.org> Co-authored-by: Akhilesh Singhania <akhi@near.org>
This PR bulk-moves the content from docs to book, and reformats it according to our code style (mostly wrapping). I very specifically don't want to proof-read the contents, fix broken links, etc. I want to do that initially, but I'd love to start with just moving the stuff over.
docs feels like a much more self-explanatory name than book. I originally put the stuff into book folder to not mess up existing docs, but now that those are moved into the book, we can set up proper name. This was the plan all along! This also finalizes, for the time being * the name "Guide to Nearcore Development" (unispiringly stolen from https://github.com/rust-lang/rustc-dev-guide/blob/master/book.toml) * the url https://near.github.io/nearcore/. We can add some fancier DNS here, but I don't see a big need here. I also hope that most links would be *internal* to the guide, so I wouldnt' worry about URL stability too much. This is by design not the reference doc!
It has been a while since the last time we cut the crates releases. Let's do that with 1.29.0 release of nearcore. Tagged: https://github.com/near/nearcore/releases/tag/crates-0.15.0 Notable changes: - #6924 - #6941 - (BREAKING CHANGE) #7308 Check here for a full changelog: crates-0.14.0...crates-0.15.0
Make it clear what kind of thing is missing from the DB. Ideally, we'd add a db_col field to the error, but that's going to be a much larger change.
resolves #7004, see that issue for reasoning behind the change
See #7674 for more details.
This PR implements FlatStorageState, the struct that manages deltas in flat storage. After this PR, the main implementation of flat storage is mostly ready. The next step is to add more tests.
* Upgrade tracing and opentelemetry crates. * Avoid touching Cargo.toml's for minior version upgrade of `tracing`. * HTTP exporter is not the best but it: a) links b) doesn't require protoc * Undo opentelemtry upgrade, because it introduces an external dependency on protoc or link-time errors. Co-authored-by: near-bulldozer[bot] <73298989+near-bulldozer[bot]@users.noreply.github.com>
nagisa
reviewed
Sep 27, 2022
core/o11y/src/lib.rs
Outdated
Comment on lines
49
to
59
Handle< | ||
Filtered< | ||
tracing_subscriber::fmt::Layer<Registry, DefaultFields, Format, NonBlocking>, | ||
EnvFilter, | ||
LevelFilter, | ||
Layered< | ||
Filtered< | ||
tracing_subscriber::fmt::Layer<Registry, DefaultFields, Format, NonBlocking>, | ||
reload::Layer<EnvFilter, Registry>, | ||
Registry, | ||
>, | ||
Registry, | ||
>, | ||
Registry, | ||
>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes. I wish we were able to use TAIT.
nagisa
reviewed
Sep 27, 2022
nagisa
reviewed
Sep 27, 2022
nagisa
reviewed
Sep 27, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall seems good.
nagisa
reviewed
Sep 28, 2022
nagisa
approved these changes
Sep 28, 2022
nikurt
added a commit
that referenced
this pull request
Sep 29, 2022
Enabling or disabling tracing at runtime can be useful to debug issues with nodes running too slow. This allows us to avoid restarting a node, which is known to have a rather large impact on the node. Tested that all combinations of logs verbosity and opentelemetry verbosity work well together. Tested by reconfiguring logging as follows: ``` echo '{"verbose_module":"", "opentelemetry_level": "DEBUG"}' > ~/.near/betanet/log_config.json ; killall -SIGHUP neard ```
nikurt
added a commit
that referenced
this pull request
Nov 9, 2022
Enabling or disabling tracing at runtime can be useful to debug issues with nodes running too slow. This allows us to avoid restarting a node, which is known to have a rather large impact on the node. Tested that all combinations of logs verbosity and opentelemetry verbosity work well together. Tested by reconfiguring logging as follows: ``` echo '{"verbose_module":"", "opentelemetry_level": "DEBUG"}' > ~/.near/betanet/log_config.json ; killall -SIGHUP neard ```
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Enabling or disabling tracing at runtime can be useful to debug issues with nodes running too slow. This allows us to avoid restarting a node, which is known to have a rather large impact on the node.
Tested that all combinations of logs verbosity and opentelemetry verbosity work well together.
Tested by reconfiguring logging as follows: