-
Notifications
You must be signed in to change notification settings - Fork 86
fix: add future/fdb metrics #2377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add future/fdb metrics #2377
Conversation
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
How to use the Graphite Merge QueueAdd the label merge-queue to this PR to add it to the merge queue. You must have a Graphite account in order to use the merge queue. Sign up using this link. An organization admin has enabled the Graphite Merge Queue in this repository. Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue. This stack of pull requests is managed by Graphite. Learn more about stacking. |
Deploying rivet with
|
Latest commit: |
2ffbbd6
|
Status: | ✅ Deploy successful! |
Preview URL: | https://1790567b.rivet.pages.dev |
Branch Preview URL: | https://04-23-fix-add-future-fdb-met.rivet.pages.dev |
9c38e32
to
19edb7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR adds comprehensive metrics and tracing capabilities across the Rivet codebase, focusing on FoundationDB operations and future durations.
- Added new Grafana dashboard
futures.json
for monitoring instrumented future durations with heatmap visualization - Introduced
CustomInstrumentExt
trait infuture.rs
for tracking future durations with Prometheus metrics, though has potential safety issues with unsafe code - Added process-exporter installation script and configuration in
/packages/core/services/cluster/src/workflows/server/install/install_scripts/
but lacks security measures like checksum verification - Replaced
foundationdb::tuple::Subspace
withfdb_util::Subspace
across multiple files for better metrics tracking and consistency - Added custom instrumentation spans to various FDB transactions throughout the codebase for improved tracing and monitoring
28 file(s) reviewed, 10 comment(s)
Edit PR Review Bot Settings | Greptile
"filterValues": { | ||
"le": 1e-9 | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: filterValues.le is set to 1e-9 which is an extremely small value that may filter out relevant data points. Consider adjusting or removing this filter.
"yAxis": { | ||
"axisPlacement": "left", | ||
"max": "60", | ||
"min": 0, | ||
"reverse": false, | ||
"unit": "s" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: yAxis max value is hardcoded to 60s which may not be suitable for all future durations. Consider making this dynamic or configurable via variables.
res = &mut gc_handle => { | ||
tracing::error!(?res, "metrics task unexpectedly stopped"); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Error message incorrectly states 'metrics task unexpectedly stopped' when it's actually the GC task that stopped
res = &mut gc_handle => { | |
tracing::error!(?res, "metrics task unexpectedly stopped"); | |
break; | |
res = &mut gc_handle => { | |
tracing::error!(?res, "gc task unexpectedly stopped"); | |
break; |
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> { | ||
let this = unsafe { self.get_unchecked_mut() }; | ||
let inner = unsafe { Pin::new_unchecked(&mut this.inner) }; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Unsafe code can be replaced with safe alternatives. Use Pin::as_mut().get_mut()
and Pin::as_mut()
instead of get_unchecked_mut
and new_unchecked
.
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> { | |
let this = unsafe { self.get_unchecked_mut() }; | |
let inner = unsafe { Pin::new_unchecked(&mut this.inner) }; | |
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> { | |
let this = self.as_mut().get_mut(); | |
let inner = Pin::new(&mut this.inner); |
let this = unsafe { self.get_unchecked_mut() }; | ||
let inner = unsafe { Pin::new_unchecked(&mut this.inner) }; | ||
|
||
let metadata = inner.span().metadata().clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Cloning metadata on every poll is inefficient. Consider storing the formatted location string in the struct during construction.
User=process-exporter | ||
Group=process-exporter | ||
Type=simple | ||
ExecStart=/usr/bin/process-exporter --config.path /etc/process-exporter/config.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Missing security-related systemd directives like ProtectSystem=strict and NoNewPrivileges=true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The entire file has been deleted. This file contains critical routing logic for actors and should not be removed. Please restore this file and add the metrics changes separately.
@@ -44,6 +44,7 @@ pub async fn prewarm_image( | |||
Ok(None) | |||
} | |||
}) | |||
.custom_instrument(tracing::info_span!("prewarm_fetch_tx")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The tracing span should be placed before the transaction starts (line 25) rather than after it completes to properly capture the full transaction duration
.custom_instrument(tracing::info_span!("prewarm_fetch_tx")) | |
.custom_instrument(tracing::info_span!("prewarm_fetch_tx")) | |
.run(|tx, _mc| async move { |
@@ -110,6 +110,7 @@ pub async fn pegboard_actor_get(ctx: &OperationCtx, input: &Input) -> GlobalResu | |||
.try_collect::<Vec<_>>() | |||
.await | |||
}) | |||
.custom_instrument(tracing::info_span!("actor_list_wf_tx")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider moving the custom_instrument call before the run() to capture the entire FDB operation including setup
@@ -161,6 +161,7 @@ pub async fn update_fdb(ctx: &ActivityCtx, input: &UpdateFdbInput) -> GlobalResu | |||
.await | |||
} | |||
}) | |||
.custom_instrument(tracing::info_span!("actor_destroy_tx")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: The span should be added before line 131 where the transaction begins, not after the transaction completes. This ensures the entire transaction is properly instrumented.
19edb7d
to
2ffbbd6
Compare
Deploying rivet-studio with
|
Latest commit: |
2ffbbd6
|
Status: | ✅ Deploy successful! |
Preview URL: | https://306c3dae.rivet-studio.pages.dev |
Branch Preview URL: | https://04-23-fix-add-future-fdb-met.rivet-studio.pages.dev |
2ffbbd6
to
8cd9d0a
Compare
Merge activity
|
<!-- Please make sure there is an issue that this PR is correlated to. --> ## Changes <!-- If there are frontend changes, please include screenshots. -->
Changes