Skip to content

fix: add future/fdb metrics #2377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

MasterPtato
Copy link
Contributor

Changes

Copy link
Contributor Author

MasterPtato commented Apr 24, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link

cloudflare-workers-and-pages bot commented Apr 24, 2025

Deploying rivet with  Cloudflare Pages  Cloudflare Pages

Latest commit: 2ffbbd6
Status: ✅  Deploy successful!
Preview URL: https://1790567b.rivet.pages.dev
Branch Preview URL: https://04-23-fix-add-future-fdb-met.rivet.pages.dev

View logs

@MasterPtato MasterPtato force-pushed the 04-23-fix_add_future_fdb_metrics branch 2 times, most recently from 9c38e32 to 19edb7d Compare April 24, 2025 22:42
@MasterPtato MasterPtato marked this pull request as ready for review April 24, 2025 22:42
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR adds comprehensive metrics and tracing capabilities across the Rivet codebase, focusing on FoundationDB operations and future durations.

  • Added new Grafana dashboard futures.json for monitoring instrumented future durations with heatmap visualization
  • Introduced CustomInstrumentExt trait in future.rs for tracking future durations with Prometheus metrics, though has potential safety issues with unsafe code
  • Added process-exporter installation script and configuration in /packages/core/services/cluster/src/workflows/server/install/install_scripts/ but lacks security measures like checksum verification
  • Replaced foundationdb::tuple::Subspace with fdb_util::Subspace across multiple files for better metrics tracking and consistency
  • Added custom instrumentation spans to various FDB transactions throughout the codebase for improved tracing and monitoring

28 file(s) reviewed, 10 comment(s)
Edit PR Review Bot Settings | Greptile

Comment on lines +71 to +73
"filterValues": {
"le": 1e-9
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: filterValues.le is set to 1e-9 which is an extremely small value that may filter out relevant data points. Consider adjusting or removing this filter.

Comment on lines +85 to +91
"yAxis": {
"axisPlacement": "left",
"max": "60",
"min": 0,
"reverse": false,
"unit": "s"
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: yAxis max value is hardcoded to 60s which may not be suitable for all future durations. Consider making this dynamic or configurable via variables.

Comment on lines +86 to +88
res = &mut gc_handle => {
tracing::error!(?res, "metrics task unexpectedly stopped");
break;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Error message incorrectly states 'metrics task unexpectedly stopped' when it's actually the GC task that stopped

Suggested change
res = &mut gc_handle => {
tracing::error!(?res, "metrics task unexpectedly stopped");
break;
res = &mut gc_handle => {
tracing::error!(?res, "gc task unexpectedly stopped");
break;

Comment on lines +48 to +51
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
let this = unsafe { self.get_unchecked_mut() };
let inner = unsafe { Pin::new_unchecked(&mut this.inner) };

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Unsafe code can be replaced with safe alternatives. Use Pin::as_mut().get_mut() and Pin::as_mut() instead of get_unchecked_mut and new_unchecked.

Suggested change
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
let this = unsafe { self.get_unchecked_mut() };
let inner = unsafe { Pin::new_unchecked(&mut this.inner) };
fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
let this = self.as_mut().get_mut();
let inner = Pin::new(&mut this.inner);

let this = unsafe { self.get_unchecked_mut() };
let inner = unsafe { Pin::new_unchecked(&mut this.inner) };

let metadata = inner.span().metadata().clone();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Cloning metadata on every poll is inefficient. Consider storing the formatted location string in the struct during construction.

User=process-exporter
Group=process-exporter
Type=simple
ExecStart=/usr/bin/process-exporter --config.path /etc/process-exporter/config.yaml
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Missing security-related systemd directives like ProtectSystem=strict and NoNewPrivileges=true

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The entire file has been deleted. This file contains critical routing logic for actors and should not be removed. Please restore this file and add the metrics changes separately.

@@ -44,6 +44,7 @@ pub async fn prewarm_image(
Ok(None)
}
})
.custom_instrument(tracing::info_span!("prewarm_fetch_tx"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The tracing span should be placed before the transaction starts (line 25) rather than after it completes to properly capture the full transaction duration

Suggested change
.custom_instrument(tracing::info_span!("prewarm_fetch_tx"))
.custom_instrument(tracing::info_span!("prewarm_fetch_tx"))
.run(|tx, _mc| async move {

@@ -110,6 +110,7 @@ pub async fn pegboard_actor_get(ctx: &OperationCtx, input: &Input) -> GlobalResu
.try_collect::<Vec<_>>()
.await
})
.custom_instrument(tracing::info_span!("actor_list_wf_tx"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Consider moving the custom_instrument call before the run() to capture the entire FDB operation including setup

@@ -161,6 +161,7 @@ pub async fn update_fdb(ctx: &ActivityCtx, input: &UpdateFdbInput) -> GlobalResu
.await
}
})
.custom_instrument(tracing::info_span!("actor_destroy_tx"))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The span should be added before line 131 where the transaction begins, not after the transaction completes. This ensures the entire transaction is properly instrumented.

@MasterPtato MasterPtato force-pushed the 04-23-fix_add_future_fdb_metrics branch from 19edb7d to 2ffbbd6 Compare April 29, 2025 22:25
Copy link

cloudflare-workers-and-pages bot commented Apr 29, 2025

Deploying rivet-studio with  Cloudflare Pages  Cloudflare Pages

Latest commit: 2ffbbd6
Status: ✅  Deploy successful!
Preview URL: https://306c3dae.rivet-studio.pages.dev
Branch Preview URL: https://04-23-fix-add-future-fdb-met.rivet-studio.pages.dev

View logs

Copy link

Deploying rivet-hub with  Cloudflare Pages  Cloudflare Pages

Latest commit: 2ffbbd6
Status:🚫  Build failed.

View logs

@MasterPtato MasterPtato changed the base branch from main to graphite-base/2377 May 1, 2025 20:12
@MasterPtato MasterPtato force-pushed the 04-23-fix_add_future_fdb_metrics branch from 2ffbbd6 to 8cd9d0a Compare May 1, 2025 20:12
@MasterPtato MasterPtato changed the base branch from graphite-base/2377 to 04-30-fix_allow_custom_project_for_status_monitor May 1, 2025 20:12
Copy link
Contributor

graphite-app bot commented May 2, 2025

Merge activity

  • May 1, 9:57 PM EDT: MasterPtato added this pull request to the Graphite merge queue.
  • May 1, 9:58 PM EDT: CI is running for this pull request on a draft pull request (#2422) due to your merge queue CI optimization settings.
  • May 1, 9:59 PM EDT: Merged by the Graphite merge queue via draft PR: #2422.

graphite-app bot pushed a commit that referenced this pull request May 2, 2025
<!-- Please make sure there is an issue that this PR is correlated to. -->

## Changes

<!-- If there are frontend changes, please include screenshots. -->
@graphite-app graphite-app bot closed this May 2, 2025
@graphite-app graphite-app bot deleted the 04-23-fix_add_future_fdb_metrics branch May 2, 2025 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant