metrics: add monitoring support for disk and database counters #1346

karalabe · 2015-06-27T15:22:41Z

This PR is meant to add a few general metrics that should be useful for debugging various issues.

Disk IO usage metrics for Linux (Windows needs cgo to access performance counters, should be done by someone working on that platform; OSX needs some root privileged syscalls, so that's an interesting question). geth monitor system/disk
LevelDB compaction counters to measure the total disk input, output and time spent on compacting the databases. geth monitor eth/db/block,state/compact
Added a --metrics flag that's required to collect anything, otherwise all the metrics calls are nops. Also because of this all meter/timer creations need to be done through our own metrics wrapper (i.e. metrics.NewMeter("my/fancy/meter").

fjl · 2015-06-28T08:03:00Z

I don't see the point of adding monitoring for system-level stats. There are existing tools for each platform (e.g. iotop, Activity Monitor) that make these visible. Geth monitoring should be restricted to stuff that can only be measured inside of geth.

karalabe · 2015-06-29T07:57:40Z

I added the disk io monitoring before the leveldb one because I didn't know leveldb had a way to retrieve it. I still don't really see a way to fully retrieve everything from leveldb (we have the compaction stats from this PR + a very high level - pre-compression - stats from the GET/PUT requests), but there's still a big information gap.

The reason I think this stat might still be useful is because you can compare it to other metrics on the same chart vs. having to open up a different tool and cross reference. However, if you guys feel it's redundant, I see no problem in dropping it.

fjl · 2015-06-29T08:07:27Z

It depends on the amount of code that needs to be added, I guess.

tgerring · 2015-06-29T08:09:59Z

Seeing as the monitoring tool might be indispensable for tracking down performance issues (like our pesky disk I/O situation), I'm happy to see system-level stats as long as it's not too cumbersome to support.

karalabe · 2015-06-29T08:25:47Z

Usually Linux provides a huge set of infos in /proc and /proc/< pid > so it's very easy to access and monitor those. Windows makes this harder as the infos are hidden in performance counters, which aren't that obvious (some are surfaced through WIN API, some through registry keys), while OSX (as far as I know) needs system calls and usually root privileges to many.

I think this project kind of sums up the complexity: https://support.hyperic.com along with the really really dire state of any ps-utils packages implemented in Go. So, to sum it up, writing cross platform system level monitoring is probably not something we want to do.

My take on the matter is that for a select few metrics that geth seems sensitive to (like disk io), we could consider adding a limited monitoring support. Otherwise I concur with @fjl, that the "costs" can/probably outweigh the benefits.

obscuren · 2015-06-29T11:22:26Z

While I believe this type of code is useful for debugging it isn't so useful when running it in production mode (e.g., users running the node). While the time to process is probably only a few ms or maybe even us it does add up as add more and more of these timers.

This code can make it in as long as there's an option to disable it (and disables it by default). For example the vm package has a Debug variable you can set that will enable VM logging.

karalabe · 2015-06-29T11:30:07Z

Would it make sense to have fine grained control over the metrics (i.e. being able to enable/disable meters specifically), or just one big on/off switch?

obscuren · 2015-06-29T11:30:53Z

Big on/off. We can improve this later if required.

eth/fetcher: don't double filter/fetch the same block

karalabe · 2015-06-29T11:32:12Z

K, that should be simple enough.

…hereum#1346)

obscuren added the in progress label Jun 27, 2015

karalabe added review and removed in progress labels Jun 29, 2015

Merge pull request ethereum#1353 from karalabe/fix-double-fetch

5e7db8f

eth/fetcher: don't double filter/fetch the same block

obscuren modified the milestone: 0.9.34 Jun 29, 2015

karalabe added 3 commits June 29, 2015 15:18

cmd/geth, metrics: separate process metric collection, add disk

53bdacb

cmd/geth, eth, ethdb: monitor database compactions

1be62c3

cmd, core, eth, metrics, p2p: require enabling metrics

199c44b

karalabe force-pushed the advanced-metrics branch from 5e370e8 to 199c44b Compare June 29, 2015 13:12

cmd/geth: decent error message if metrics are disabled

99879ea

obscuren force-pushed the develop branch from a22fb03 to b0a5be4 Compare June 29, 2015 15:46

obscuren mentioned this pull request Jun 29, 2015

Rebased peter's PR #1360

Merged

obscuren closed this Jun 29, 2015

obscuren removed the review label Jun 29, 2015

obscuren modified the milestone: 0.9.34 Jun 30, 2015

nonsense pushed a commit to nonsense/go-ethereum that referenced this pull request Apr 26, 2019

p2p/protocols, swarm/network: fix resource leak with p2p teardown (et…

1502800

…hereum#1346)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics: add monitoring support for disk and database counters #1346

metrics: add monitoring support for disk and database counters #1346

karalabe commented Jun 27, 2015

fjl commented Jun 28, 2015

karalabe commented Jun 29, 2015

fjl commented Jun 29, 2015

tgerring commented Jun 29, 2015

karalabe commented Jun 29, 2015

obscuren commented Jun 29, 2015

karalabe commented Jun 29, 2015

obscuren commented Jun 29, 2015

karalabe commented Jun 29, 2015

metrics: add monitoring support for disk and database counters #1346

metrics: add monitoring support for disk and database counters #1346

Conversation

karalabe commented Jun 27, 2015

fjl commented Jun 28, 2015

karalabe commented Jun 29, 2015

fjl commented Jun 29, 2015

tgerring commented Jun 29, 2015

karalabe commented Jun 29, 2015

obscuren commented Jun 29, 2015

karalabe commented Jun 29, 2015

obscuren commented Jun 29, 2015

karalabe commented Jun 29, 2015