-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Add new hardware and software metrics #11062
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand you did not want to depend on an external crate to gather these data. But are you afraid if you extracted your benchmark code into a separate crate, other people would start asking new features into it unrelated to what Substrate needs?
client/service/src/builder.rs
Outdated
info!("💻 Operating system: {}", TARGET_OS); | ||
info!("💻 CPU architecture: {}", TARGET_ARCH); | ||
if !TARGET_ENV.is_empty() { | ||
info!("💻 Target environment: {}", TARGET_ENV); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looking at these log lines, I would get the impression that the properties of the running system are listed, not those of the build target. I know it is really an edge-case, but foreign ELF formats can be loaded and run emulated on a different system. Are we okay with ignoring those fringe usages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... well, that is a good point; I don't think we have to care about this in general though since those should mostly be really fringe cases, and detecting this will most likely not be easy. (That said, if anyone has any counterpoints here or any good ideas how to handle this in a reasonable way I'm all ears.)
I guess the most likely cases here would be either someone running the Linux binary on a BSD system, or someone running an amd64 binary on an M1 Mac (but we don't provide binaries for macOS, so they'd have to compile it themselves, and if they're compiling it themselves then why not compile a native aarch64 binary in the first place and run that?).
Well, I guess we could chuck it into a separate crate, but I'm not entirely convinced that it'd be worth it. And, yes, once you get any actual external users they do tend to start asking for new features. (: A major point of this implementation is that it's small, simple and narrow in scope. There's a gazillion of other things a general-purpose sysinfo and/or benchmarking crate would have to support besides what we support here. (Just compare our ~100 lines of code which gather all the Linux sysinfo we need with the |
Excluding the grouping tests in their submodule nitpick, LGTM and it is a very cool and useful feature to get some insights about the network nodes |
We have a lot of tests that just start a |
Considering those benchmarks take less that 1s it shouldn't be too big of a deal in practice, but good point. I've added an extra flag and suppressed those in those tests. (I'll do |
Looks like a companion is necessary now anyway since I've added the new CLI argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly some nitpicks, otherwise it looks good. However, I also don't checked every benchmark into each detail.
} | ||
} | ||
|
||
positions.shuffle(&mut rng()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want to use some fixed seed here to always have this reproducible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably doesn't make such a big difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something I am using a fixed seed? (:
fn rng() -> rand_pcg::Pcg64 {
rand_pcg::Pcg64::new(0xcafef00dd15ea5e5, 0xa02bdbf7bb3c0a7ac28fa16a64abf96)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh fuck :D I did not realize that rng
was defined locally. I did not say anything :P
Minor nit: for the CLI we follow the GNU convention of using |
I've randomly recompiled and retested this code and.... the hwbench telemetry stopped working. The message was not being sent for some reason, even though I swear it used to work. It took a while to debug, but it turns out the connection notification stream in telemetry was totally broken - it was using This should be good to go I think; I'll let it marinate until tomorrow and if there won't be any more comments I'll merge it in. |
We have the But I don't really care either way; I can change it. |
Okay that is really bad and your solution is also not really great. I already see people complaining about the warning appearing very often. To the point why it worked before, |
(If you convert your |
Wait, is it? From what I can see
Unless I'm missing something here AFAIK it shouldn't appear at all in normal circumstances. The
We're probably probably not going to be disconnected from the telemetry very often (and that prints a warning on its own anyway), and on top of that for this warning to trigger the receiver either isn't handling the notifications at all, or is handling them really slow (slower than the time it takes to get disconnected from the telemetry and reconnected again), so I'd argue the Also, here's a quick test program to make sure this works as I think it works: use futures::StreamExt;
#[tokio::main]
async fn main() {
let (mut tx, mut rx) = futures::channel::mpsc::channel(0);
tokio::task::spawn_blocking(move || {
loop {
tx.try_send(()).unwrap();
std::thread::sleep(std::time::Duration::from_millis(1));
}
});
loop {
rx.next().await.unwrap();
println!("Got message");
}
} This prints out:
So |
No you are right. I didn't checked the code yesterday and was under the impression that the model was right in my brain. Sorry! No after thinking again about it this doesn't make any sense, because futures need to be polled to do something...
The point here is you don't know what is on the other side and why it is maybe not processing messages fast enough. I would argue that if you for example reconnected very fast and the other side still doesn't has processed the first reconnect, there is no harm in just dropping any further reconnect message, because they will process the waiting message at some point. We can also not consume too much memory, because there is only room for one element in the channel. TDLR: Either remove the log completely or turn it into a debug please. |
Okay, I've changed it to a |
bot merge |
Error: pr-custom-review is not passing for paritytech/polkadot#5206 |
Merged in manually (with the merge button on GH) since the bot was still somehow watching the companion PR (which shouldn't have to be a companion now and should be mergable independently, so no point in making it harder than it needs to be). |
* Add new hardware and software metrics * Move sysinfo tests into `mod tests` * Correct a typo in a comment * Remove unnecessary `nix` dependency * Fix the version tests * Add a `--disable-hardware-benchmarks` CLI argument * Disable hardware benchmarks in the integration tests * Remove unused import * Fix benchmarks compilation * Move code to a new `sc-sysinfo` crate * Correct `impl_version` comment * Move `--disable-hardware-benchmarks` to the chain-specific bin crate * Move printing out of hardware bench results to `sc-sysinfo` * Move hardware benchmarks to a separate messages; trigger them manually * Rename some of the fields in the `HwBench` struct * Revert changes to the telemetry crate; manually send hwbench messages * Move sysinfo logs into the sysinfo crate * Move the `TARGET_OS_*` constants into the sysinfo crate * Minor cleanups * Move the `HwBench` struct to the sysinfo crate * Derive `Clone` for `HwBench` * Fix broken telemetry connection notification stream * Prevent the telemetry connection notifiers from leaking if they're disconnected * Turn the telemetry notification failure log into a debug log * Rename `--disable-hardware-benchmarks` to `--no-hardware-benchmarks`
* Add new hardware and software metrics * Move sysinfo tests into `mod tests` * Correct a typo in a comment * Remove unnecessary `nix` dependency * Fix the version tests * Add a `--disable-hardware-benchmarks` CLI argument * Disable hardware benchmarks in the integration tests * Remove unused import * Fix benchmarks compilation * Move code to a new `sc-sysinfo` crate * Correct `impl_version` comment * Move `--disable-hardware-benchmarks` to the chain-specific bin crate * Move printing out of hardware bench results to `sc-sysinfo` * Move hardware benchmarks to a separate messages; trigger them manually * Rename some of the fields in the `HwBench` struct * Revert changes to the telemetry crate; manually send hwbench messages * Move sysinfo logs into the sysinfo crate * Move the `TARGET_OS_*` constants into the sysinfo crate * Minor cleanups * Move the `HwBench` struct to the sysinfo crate * Derive `Clone` for `HwBench` * Fix broken telemetry connection notification stream * Prevent the telemetry connection notifiers from leaking if they're disconnected * Turn the telemetry notification failure log into a debug log * Rename `--disable-hardware-benchmarks` to `--no-hardware-benchmarks`
This PR adds new hardware/software telemetry to Substrate.
The following extra information about the system is gathered (Linux-only):
The following benchmarks are ran on startup (all OSes):
memcpy
)The benchmarks are ran on every startup and in total should take less than ~1s. They are deliberately kept very simple as to not become a maintenance burden.
I've also changed how the node reports its
version
; previously it appended the current CPU ISA + the OS + the environment to the version and sent it as one field (e.g.0.9.17-75dd6c7d0-x86_64-linux-gnu
) while now that field contains only the version (e.g.0.9.17-75dd6c7d0
) and the rest of the information is transmitted as their own fields.Fixes (partially) #8944
Polkadot PR (should me mergeable independently now, so not marking as companion): paritytech/polkadot#5206
Cumulus PR (should me mergeable independently now, so not marking as companion): paritytech/cumulus#1113
substrate-telemetry
PR: paritytech/substrate-telemetry#464cc @emostov @jsdw
Questions you might have
Why is the system information Linux-only?
The majority of nodes are running on Linux, so it makes sense to start there. Besides, gathering this information on Linux is pretty trivial.
Why not use an external crate to get the system information?
Apparently (or so I've heard) we used such crates in the past and we had problems with them; since gathering this is pretty simple anyway I see no point it adding new, potentially janky dependencies.
Are these benchmarks reliable?
In general from what I can see - yes, however the numbers they produce obviously do vary from run-to-run and can change depending on whether something else also ran in the background on that machine. But on average across the whole network this shouldn't matter. We could periodically rerun them in the future, but for now I propose that we should just add them as-is and see whether any potential noise will actually be a problem or not.
Is this going to be useful?
I think it will. Although it's hard to tell without actually, you know, adding those metrics in and seeing the results.
How will those be displayed?
These metrics are printed out in the console (screenshot from our benchmarking machine):
And also in
substrate-telemetry
(work in progress; incomplete; also, I had only one node connected since I'm only testing it locally, so this is not completely representative of how it will actually look):As you can see there's a new tab/category in the upper right corner on which you can click, and which will bring you to this screen where you'll see a bunch of tables with aggregate statistics for a given chain, showing the most common values for each category. (Somewhat inspired by the Steam Hardware Survey.) This will display all of the information gathered here along with the relative benchmark results as compared to our bechmarking machine. (So we'll see what fraction of the network is running faster/slower hardware.)