-
Notifications
You must be signed in to change notification settings - Fork 20.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: add monitoring support for disk and database counters #1346
Conversation
I don't see the point of adding monitoring for system-level stats. There are existing tools for each platform (e.g. iotop, Activity Monitor) that make these visible. Geth monitoring should be restricted to stuff that can only be measured inside of geth. |
I added the disk io monitoring before the leveldb one because I didn't know leveldb had a way to retrieve it. I still don't really see a way to fully retrieve everything from leveldb (we have the compaction stats from this PR + a very high level - pre-compression - stats from the GET/PUT requests), but there's still a big information gap. The reason I think this stat might still be useful is because you can compare it to other metrics on the same chart vs. having to open up a different tool and cross reference. However, if you guys feel it's redundant, I see no problem in dropping it. |
It depends on the amount of code that needs to be added, I guess. |
Seeing as the monitoring tool might be indispensable for tracking down performance issues (like our pesky disk I/O situation), I'm happy to see system-level stats as long as it's not too cumbersome to support. |
Usually Linux provides a huge set of infos in /proc and /proc/< pid > so it's very easy to access and monitor those. Windows makes this harder as the infos are hidden in performance counters, which aren't that obvious (some are surfaced through WIN API, some through registry keys), while OSX (as far as I know) needs system calls and usually root privileges to many. I think this project kind of sums up the complexity: https://support.hyperic.com along with the really really dire state of any ps-utils packages implemented in Go. So, to sum it up, writing cross platform system level monitoring is probably not something we want to do. My take on the matter is that for a select few metrics that geth seems sensitive to (like disk io), we could consider adding a limited monitoring support. Otherwise I concur with @fjl, that the "costs" can/probably outweigh the benefits. |
While I believe this type of code is useful for debugging it isn't so useful when running it in production mode (e.g., users running the node). While the time to process is probably only a few ms or maybe even us it does add up as add more and more of these timers. This code can make it in as long as there's an option to disable it (and disables it by default). For example the |
Would it make sense to have fine grained control over the metrics (i.e. being able to enable/disable meters specifically), or just one big on/off switch? |
Big on/off. We can improve this later if required. |
eth/fetcher: don't double filter/fetch the same block
K, that should be simple enough. |
5e370e8
to
199c44b
Compare
This PR is meant to add a few general metrics that should be useful for debugging various issues.
geth monitor system/disk
geth monitor eth/db/block,state/compact
--metrics
flag that's required to collect anything, otherwise all the metrics calls are nops. Also because of this all meter/timer creations need to be done through our own metrics wrapper (i.e.metrics.NewMeter("my/fancy/meter")
.