Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Add new hardware and software metrics #11062

Merged
merged 26 commits into from
Apr 11, 2022
Merged

Conversation

koute
Copy link
Contributor

@koute koute commented Mar 18, 2022

This PR adds new hardware/software telemetry to Substrate.

The following extra information about the system is gathered (Linux-only):

  • CPU name
  • CPU core count
  • RAM size
  • Linux kernel version
  • Linux distribution
  • Is the node running on a VM?

The following benchmarks are ran on startup (all OSes):

  • CPU speed (hashrate of BLAKE2b-256 in MB/s)
  • Memory speed (how many MB/s it can memcpy)
  • Sequential disk write speed
  • Random disk write speed

The benchmarks are ran on every startup and in total should take less than ~1s. They are deliberately kept very simple as to not become a maintenance burden.

I've also changed how the node reports its version; previously it appended the current CPU ISA + the OS + the environment to the version and sent it as one field (e.g. 0.9.17-75dd6c7d0-x86_64-linux-gnu) while now that field contains only the version (e.g. 0.9.17-75dd6c7d0) and the rest of the information is transmitted as their own fields.

Fixes (partially) #8944

Polkadot PR (should me mergeable independently now, so not marking as companion): paritytech/polkadot#5206

Cumulus PR (should me mergeable independently now, so not marking as companion): paritytech/cumulus#1113

substrate-telemetry PR: paritytech/substrate-telemetry#464

cc @emostov @jsdw

Questions you might have

Why is the system information Linux-only?

The majority of nodes are running on Linux, so it makes sense to start there. Besides, gathering this information on Linux is pretty trivial.

Why not use an external crate to get the system information?

Apparently (or so I've heard) we used such crates in the past and we had problems with them; since gathering this is pretty simple anyway I see no point it adding new, potentially janky dependencies.

Are these benchmarks reliable?

In general from what I can see - yes, however the numbers they produce obviously do vary from run-to-run and can change depending on whether something else also ran in the background on that machine. But on average across the whole network this shouldn't matter. We could periodically rerun them in the future, but for now I propose that we should just add them as-is and see whether any potential noise will actually be a problem or not.

Is this going to be useful?

I think it will. Although it's hard to tell without actually, you know, adding those metrics in and seeing the results.

How will those be displayed?

These metrics are printed out in the console (screenshot from our benchmarking machine):

console

And also in substrate-telemetry (work in progress; incomplete; also, I had only one node connected since I'm only testing it locally, so this is not completely representative of how it will actually look):

telemetry

As you can see there's a new tab/category in the upper right corner on which you can click, and which will bring you to this screen where you'll see a bunch of tables with aggregate statistics for a given chain, showing the most common values for each category. (Somewhat inspired by the Steam Hardware Survey.) This will display all of the information gathered here along with the relative benchmark results as compared to our bechmarking machine. (So we'll see what fraction of the network is running faster/slower hardware.)

@koute koute added A0-please_review Pull request needs code review. J0-enhancement An additional feature request. B5-clientnoteworthy C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit labels Mar 18, 2022
@koute koute requested a review from a team March 18, 2022 14:18
Copy link
Contributor

@wigy-opensource-developer wigy-opensource-developer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you did not want to depend on an external crate to gather these data. But are you afraid if you extracted your benchmark code into a separate crate, other people would start asking new features into it unrelated to what Substrate needs?

Comment on lines 509 to 513
info!("💻 Operating system: {}", TARGET_OS);
info!("💻 CPU architecture: {}", TARGET_ARCH);
if !TARGET_ENV.is_empty() {
info!("💻 Target environment: {}", TARGET_ENV);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking at these log lines, I would get the impression that the properties of the running system are listed, not those of the build target. I know it is really an edge-case, but foreign ELF formats can be loaded and run emulated on a different system. Are we okay with ignoring those fringe usages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... well, that is a good point; I don't think we have to care about this in general though since those should mostly be really fringe cases, and detecting this will most likely not be easy. (That said, if anyone has any counterpoints here or any good ideas how to handle this in a reasonable way I'm all ears.)

I guess the most likely cases here would be either someone running the Linux binary on a BSD system, or someone running an amd64 binary on an M1 Mac (but we don't provide binaries for macOS, so they'd have to compile it themselves, and if they're compiling it themselves then why not compile a native aarch64 binary in the first place and run that?).

@koute
Copy link
Contributor Author

koute commented Mar 18, 2022

I understand you did not want to depend on an external crate to gather these data. But are you afraid if you extracted your benchmark code into a separate crate, other people would start asking new features into it unrelated to what Substrate needs?

Well, I guess we could chuck it into a separate crate, but I'm not entirely convinced that it'd be worth it. And, yes, once you get any actual external users they do tend to start asking for new features. (: A major point of this implementation is that it's small, simple and narrow in scope. There's a gazillion of other things a general-purpose sysinfo and/or benchmarking crate would have to support besides what we support here. (Just compare our ~100 lines of code which gather all the Linux sysinfo we need with the sysinfo crate which is 10k lines of code.)

@davxy
Copy link
Member

davxy commented Mar 18, 2022

Excluding the grouping tests in their submodule nitpick, LGTM and it is a very cool and useful feature to get some insights about the network nodes

client/service/Cargo.toml Outdated Show resolved Hide resolved
@ggwpez
Copy link
Member

ggwpez commented Mar 21, 2022

We have a lot of tests that just start a --dev node for testing, also in Polkadot.
See bin/node/cli/test in Substrate. Maybe you can add a suppression flag to them as they would otherwise run that over and over.

@koute
Copy link
Contributor Author

koute commented Mar 25, 2022

We have a lot of tests that just start a --dev node for testing, also in Polkadot. See bin/node/cli/test in Substrate. Maybe you can add a suppression flag to them as they would otherwise run that over and over.

Considering those benchmarks take less that 1s it shouldn't be too big of a deal in practice, but good point. I've added an extra flag and suppressed those in those tests. (I'll do polkadot in a separate PR later since this is not critical and if I don't have to put up a companion I'd rather not.)

@koute
Copy link
Contributor Author

koute commented Mar 25, 2022

Looks like a companion is necessary now anyway since I've added the new CLI argument.

@koute koute added A0-please_review Pull request needs code review. and removed A0-please_review Pull request needs code review. labels Mar 25, 2022
@davxy davxy self-requested a review March 25, 2022 14:19
client/telemetry/src/lib.rs Outdated Show resolved Hide resolved
client/service/src/builder.rs Outdated Show resolved Hide resolved
Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly some nitpicks, otherwise it looks good. However, I also don't checked every benchmark into each detail.

client/sysinfo/src/sysinfo.rs Outdated Show resolved Hide resolved
}
}

positions.shuffle(&mut rng());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you want to use some fixed seed here to always have this reproducible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably doesn't make such a big difference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something I am using a fixed seed? (:

fn rng() -> rand_pcg::Pcg64 {
	rand_pcg::Pcg64::new(0xcafef00dd15ea5e5, 0xa02bdbf7bb3c0a7ac28fa16a64abf96)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh fuck :D I did not realize that rng was defined locally. I did not say anything :P

client/sysinfo/src/sysinfo.rs Outdated Show resolved Hide resolved
client/sysinfo/src/sysinfo.rs Outdated Show resolved Hide resolved
client/sysinfo/src/lib.rs Outdated Show resolved Hide resolved
client/telemetry/src/lib.rs Outdated Show resolved Hide resolved
@arkpar
Copy link
Member

arkpar commented Apr 5, 2022

Minor nit: for the CLI we follow the GNU convention of using --no-whatever rather than--disable-whatever

@koute
Copy link
Contributor Author

koute commented Apr 5, 2022

I've randomly recompiled and retested this code and.... the hwbench telemetry stopped working. The message was not being sent for some reason, even though I swear it used to work.

It took a while to debug, but it turns out the connection notification stream in telemetry was totally broken - it was using send which is async and returns a future which needs to be awaited on, but the code was deliberately ignoring it with a let _ =. I have no idea why it worked before. Anyway, I've fixed it to use the synchronous try_send instead, and made it so that it won't leak disconnected notifiers while I'm at it. (Alas, retain_mut's still unstable, so it's a little ugly.)

This should be good to go I think; I'll let it marinate until tomorrow and if there won't be any more comments I'll merge it in.

@koute
Copy link
Contributor Author

koute commented Apr 5, 2022

Minor nit: for the CLI we follow the GNU convention of using --no-whatever rather than--disable-whatever

We have the --disable-log-color argument, so not entirely it seems... (:

But I don't really care either way; I can change it.

@bkchr
Copy link
Member

bkchr commented Apr 5, 2022

it was using send which is async and returns a future which needs to be awaited on, but the code was deliberately ignoring it with a let _ =. I have no idea why it worked before. Anyway, I've fixed it to use the synchronous try_send instead, and made it so that it won't leak disconnected notifiers while I'm at it. (Alas, retain_mut's still unstable, so it's a little ugly.)

Okay that is really bad and your solution is also not really great. I already see people complaining about the warning appearing very often. To the point why it worked before, send is internally first taking the same path as try_send and if that fails it returns the future to try again. The correct solution here would be to collect these futures and to poll them until they are finished.

@bkchr
Copy link
Member

bkchr commented Apr 5, 2022

(If you convert your warn into a debug you can merge as is and open an issue to implement this properly)

@koute
Copy link
Contributor Author

koute commented Apr 6, 2022

To the point why it worked before, send is internally first taking the same path as try_send and if that fails it returns the future to try again. The correct solution here would be to collect these futures and to poll them until they are finished.

Wait, is it? From what I can see send doesn't actually do anything besides creating a new struct. ([1], [2], [3]) I was convinced it used to work, but this definitely should have not worked. Maybe I'm just going crazy. (Or I'm looking at the wrong thing?)

and your solution is also not really great. I already see people complaining about the warning appearing very often. [..] The correct solution here would be to collect these futures and to poll them until they are finished.

Unless I'm missing something here AFAIK it shouldn't appear at all in normal circumstances. The try_send should always succeed as long as there's space in the channel (that is: as long as the receiver is not stuck; if I'm reading this code right) and this is triggered only when we succeeded (re)connecting to the telemetry. So for this warning to trigger this needs to happen:

  1. We get disconnected from the telemetry.
  2. We get reconnected to the telemetry.
  3. The connection notifier receiver must have not yet processed a notification from before the (1) happened.

We're probably probably not going to be disconnected from the telemetry very often (and that prints a warning on its own anyway), and on top of that for this warning to trigger the receiver either isn't handling the notifications at all, or is handling them really slow (slower than the time it takes to get disconnected from the telemetry and reconnected again), so I'd argue the warn! here is warranted since if it's printed out there's probably a bug somewhere in the code (because it hasn't yet processed the notification from before when we got disconnected).

Also, here's a quick test program to make sure this works as I think it works:

use futures::StreamExt;

#[tokio::main]
async fn main() {
    let (mut tx, mut rx) = futures::channel::mpsc::channel(0);
    tokio::task::spawn_blocking(move || {
        loop {
            tx.try_send(()).unwrap();
            std::thread::sleep(std::time::Duration::from_millis(1));
        }
    });

    loop {
        rx.next().await.unwrap();
        println!("Got message");
    }
}

This prints out:

Got message
Got message
Got message
Got message
...

So try_send always succeeds as long as the receiver is processing the messages fast enough, so I think in practice this warning should never actually be triggered. (And if it does I'd like to know, since either something's seriously broken, or this whole thing works entirely different than how I think it works.)

@bkchr
Copy link
Member

bkchr commented Apr 6, 2022

Wait, is it? From what I can see send doesn't actually do anything besides creating a new struct. ([1], [2], [3]) I was convinced it used to work, but this definitely should have not worked. Maybe I'm just going crazy. (Or I'm looking at the wrong thing?)

No you are right. I didn't checked the code yesterday and was under the impression that the model was right in my brain. Sorry! No after thinking again about it this doesn't make any sense, because futures need to be polled to do something...

So try_send always succeeds as long as the receiver is processing the messages fast enough, so I think in practice this warning should never actually be triggered. (And if it does I'd like to know, since either something's seriously broken, or this whole thing works entirely different than how I think it works.)

The point here is you don't know what is on the other side and why it is maybe not processing messages fast enough. I would argue that if you for example reconnected very fast and the other side still doesn't has processed the first reconnect, there is no harm in just dropping any further reconnect message, because they will process the waiting message at some point. We can also not consume too much memory, because there is only room for one element in the channel.

TDLR: Either remove the log completely or turn it into a debug please.

@koute
Copy link
Contributor Author

koute commented Apr 6, 2022

Okay, I've changed it to a debug log and also renamed the command-line argument as @arkpar requested. Should be good now; I'll also update the companions.

@koute
Copy link
Contributor Author

koute commented Apr 11, 2022

bot merge

@paritytech-processbot
Copy link

Error: pr-custom-review is not passing for paritytech/polkadot#5206

@koute koute merged commit 5597a93 into paritytech:master Apr 11, 2022
@koute
Copy link
Contributor Author

koute commented Apr 11, 2022

Merged in manually (with the merge button on GH) since the bot was still somehow watching the companion PR (which shouldn't have to be a companion now and should be mergable independently, so no point in making it harder than it needs to be).

DaviRain-Su pushed a commit to octopus-network/substrate that referenced this pull request Aug 23, 2022
* Add new hardware and software metrics

* Move sysinfo tests into `mod tests`

* Correct a typo in a comment

* Remove unnecessary `nix` dependency

* Fix the version tests

* Add a `--disable-hardware-benchmarks` CLI argument

* Disable hardware benchmarks in the integration tests

* Remove unused import

* Fix benchmarks compilation

* Move code to a new `sc-sysinfo` crate

* Correct `impl_version` comment

* Move `--disable-hardware-benchmarks` to the chain-specific bin crate

* Move printing out of hardware bench results to `sc-sysinfo`

* Move hardware benchmarks to a separate messages; trigger them manually

* Rename some of the fields in the `HwBench` struct

* Revert changes to the telemetry crate; manually send hwbench messages

* Move sysinfo logs into the sysinfo crate

* Move the `TARGET_OS_*` constants into the sysinfo crate

* Minor cleanups

* Move the `HwBench` struct to the sysinfo crate

* Derive `Clone` for `HwBench`

* Fix broken telemetry connection notification stream

* Prevent the telemetry connection notifiers from leaking if they're disconnected

* Turn the telemetry notification failure log into a debug log

* Rename `--disable-hardware-benchmarks` to `--no-hardware-benchmarks`
ark0f pushed a commit to gear-tech/substrate that referenced this pull request Feb 27, 2023
* Add new hardware and software metrics

* Move sysinfo tests into `mod tests`

* Correct a typo in a comment

* Remove unnecessary `nix` dependency

* Fix the version tests

* Add a `--disable-hardware-benchmarks` CLI argument

* Disable hardware benchmarks in the integration tests

* Remove unused import

* Fix benchmarks compilation

* Move code to a new `sc-sysinfo` crate

* Correct `impl_version` comment

* Move `--disable-hardware-benchmarks` to the chain-specific bin crate

* Move printing out of hardware bench results to `sc-sysinfo`

* Move hardware benchmarks to a separate messages; trigger them manually

* Rename some of the fields in the `HwBench` struct

* Revert changes to the telemetry crate; manually send hwbench messages

* Move sysinfo logs into the sysinfo crate

* Move the `TARGET_OS_*` constants into the sysinfo crate

* Minor cleanups

* Move the `HwBench` struct to the sysinfo crate

* Derive `Clone` for `HwBench`

* Fix broken telemetry connection notification stream

* Prevent the telemetry connection notifiers from leaking if they're disconnected

* Turn the telemetry notification failure log into a debug log

* Rename `--disable-hardware-benchmarks` to `--no-hardware-benchmarks`
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit J0-enhancement An additional feature request.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants