Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: add monitoring support for disk and database counters #1346

Closed
wants to merge 5 commits into from

Conversation

karalabe
Copy link
Member

This PR is meant to add a few general metrics that should be useful for debugging various issues.

  • Disk IO usage metrics for Linux (Windows needs cgo to access performance counters, should be done by someone working on that platform; OSX needs some root privileged syscalls, so that's an interesting question). geth monitor system/disk
  • LevelDB compaction counters to measure the total disk input, output and time spent on compacting the databases. geth monitor eth/db/block,state/compact
  • Added a --metrics flag that's required to collect anything, otherwise all the metrics calls are nops. Also because of this all meter/timer creations need to be done through our own metrics wrapper (i.e. metrics.NewMeter("my/fancy/meter").

@fjl
Copy link
Contributor

fjl commented Jun 28, 2015

I don't see the point of adding monitoring for system-level stats. There are existing tools for each platform (e.g. iotop, Activity Monitor) that make these visible. Geth monitoring should be restricted to stuff that can only be measured inside of geth.

@karalabe
Copy link
Member Author

I added the disk io monitoring before the leveldb one because I didn't know leveldb had a way to retrieve it. I still don't really see a way to fully retrieve everything from leveldb (we have the compaction stats from this PR + a very high level - pre-compression - stats from the GET/PUT requests), but there's still a big information gap.

The reason I think this stat might still be useful is because you can compare it to other metrics on the same chart vs. having to open up a different tool and cross reference. However, if you guys feel it's redundant, I see no problem in dropping it.

@fjl
Copy link
Contributor

fjl commented Jun 29, 2015

It depends on the amount of code that needs to be added, I guess.

@tgerring
Copy link
Contributor

Seeing as the monitoring tool might be indispensable for tracking down performance issues (like our pesky disk I/O situation), I'm happy to see system-level stats as long as it's not too cumbersome to support.

@karalabe
Copy link
Member Author

Usually Linux provides a huge set of infos in /proc and /proc/< pid > so it's very easy to access and monitor those. Windows makes this harder as the infos are hidden in performance counters, which aren't that obvious (some are surfaced through WIN API, some through registry keys), while OSX (as far as I know) needs system calls and usually root privileges to many.

I think this project kind of sums up the complexity: https://support.hyperic.com along with the really really dire state of any ps-utils packages implemented in Go. So, to sum it up, writing cross platform system level monitoring is probably not something we want to do.

My take on the matter is that for a select few metrics that geth seems sensitive to (like disk io), we could consider adding a limited monitoring support. Otherwise I concur with @fjl, that the "costs" can/probably outweigh the benefits.

@obscuren
Copy link
Contributor

While I believe this type of code is useful for debugging it isn't so useful when running it in production mode (e.g., users running the node). While the time to process is probably only a few ms or maybe even us it does add up as add more and more of these timers.

This code can make it in as long as there's an option to disable it (and disables it by default). For example the vm package has a Debug variable you can set that will enable VM logging.

@karalabe
Copy link
Member Author

Would it make sense to have fine grained control over the metrics (i.e. being able to enable/disable meters specifically), or just one big on/off switch?

@obscuren
Copy link
Contributor

Big on/off. We can improve this later if required.

eth/fetcher: don't double filter/fetch the same block
@karalabe
Copy link
Member Author

K, that should be simple enough.

@obscuren obscuren modified the milestone: 0.9.34 Jun 29, 2015
@obscuren obscuren mentioned this pull request Jun 29, 2015
@obscuren obscuren closed this Jun 29, 2015
@obscuren obscuren removed the review label Jun 29, 2015
@obscuren obscuren modified the milestone: 0.9.34 Jun 30, 2015
nonsense pushed a commit to nonsense/go-ethereum that referenced this pull request Apr 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants