-
-
Notifications
You must be signed in to change notification settings - Fork 331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics to capture beacon node and validator db size #5087
Conversation
3dda18a
to
6263e5e
Compare
@@ -20,6 +20,9 @@ export type LevelDbControllerModules = { | |||
|
|||
const BUCKET_ID_UNKNOWN = "unknown"; | |||
|
|||
/** Time between capturing metric for db size, every few minutes is sufficient */ | |||
const DB_SIZE_METRIC_INTERVAL_MS = 5 * 60 * 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every 5 minutes seems to be sufficient as we don't really need real-time data for this, it is just an approximation after all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approximateSize
takes around 0.05 - 0.5ms and it is executed in a different thread, so we could definitely also increase this interval if necessary
const minKey = Buffer.from([0x00]); | ||
const maxKey = Buffer.from([0xff]); | ||
|
||
this.approximateSize(minKey, maxKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some tests back in the day and this was both extremely expensive and inaccurate. Please test in depth before consider merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accuracy looks pretty good
In terms of performance, I am just testing this on a goerli node at the moment and the DB is not that big. On the 2GB db, approximateSize
takes around ~0.3ms. This will probably go up a bit if the DB gets bigger but we only execute this once at startup and then every 5 minutes.
I think it makes sense to test this on a mainnet archive node, is there any way we could do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to test this on a mainnet archive node, is there any way we could do this?
Yes sure, provision a node and sync from genesis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some testing on a mainnet node with 30GB db, approximateSize
still executes mostly under 1ms
mainnet-consensus-1 | approximateSize: 0.182ms
mainnet-consensus-1 | approximateSize: 0.596ms
mainnet-consensus-1 | approximateSize: 0.592ms
mainnet-consensus-1 | approximateSize: 0.211ms
mainnet-consensus-1 | approximateSize: 0.923ms
mainnet-consensus-1 | approximateSize: 1.36ms
mainnet-consensus-1 | approximateSize: 1.185ms
mainnet-consensus-1 | approximateSize: 0.961ms
mainnet-consensus-1 | approximateSize: 0.213ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above
@nflaig from PR description:
Can you check how they collect that metric specifically? |
PrysmPrysm uses the stats function from go os package to getCurrentDbBytes, this seems to be executed as a collect function, i.e. every time metrics are scraped func (bc *bcnodeCollector) getCurrentDbBytes() (float64, error) {
fs, err := os.Stat(bc.dbPath)
if err != nil {
return 0, fmt.Errorf("could not collect database file size for prometheus, path=%s, err=%s", bc.dbPath, err)
}
return float64(fs.Size()), nil
} LighthouseLighthouse also uses native fs to get the size size_of_dir, looks like they also execute this every time metrics are scraped /// Get the approximate size of a directory and its contents.
///
/// Will skip unreadable files, and files. Not 100% accurate if files are being created and deleted
/// while this function is running.
pub fn size_of_dir(path: &Path) -> u64 {
if let Ok(iter) = fs::read_dir(path) {
iter.filter_map(std::result::Result::ok)
.map(size_of_dir_entry)
.sum()
} else {
0
}
}
fn size_of_dir_entry(dir: fs::DirEntry) -> u64 {
dir.metadata().map(|m| m.len()).unwrap_or(0)
} They both seem to collect on every scrape but as mentioned above I think this is not necessary at all as a data point every 5 minutes is enough. I would rather reduce in the interval further if we want more "real-time" stats about the size. Lodestar alternativeBased how prysm and lighthouse measure the size of the db we could follow a similar approach by using /** Get approximate size of db folder */
async function getDbFolderSize(dbPath: string): Promise<number> {
const files = await fs.readdir(dbPath);
const statsPromises = files.map((file) => {
const filePath = path.join(dbPath, file);
return fs.stat(filePath);
});
const stats = await Promise.all(statsPromises);
return stats.reduce((totalSize, stat) => totalSize + stat.size, 0);
} The result of Lodestar alternative 2Use get-folder-size package, but this seems to just use PS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Initial reports of performance look good, all metrics are in place to monitor the situation.
🎉 This PR is included in v1.6.0 🎉 |
Motivation
At the moment as an operator you need to look on OS level to find out how big the beacon node and validator db is. This is not ideal and there is no metric to see the db growth over time. The db size is also a value we need for the client monitoring implemenation (#5037). Other CLs such as lighthouse and prysm also expose this as a metric.
Description
Adds support for async collect functions to support callingdb.approximateSize
, prom-client itself also supports this, see CollectFunction and get()dbSizeTotal
to validator and lodestar metrics