Skip to content

Conversation

@d4mr
Copy link
Member

@d4mr d4mr commented Aug 8, 2025

  • Introduced prometheus and lazy_static dependencies to facilitate metrics collection.
  • Added metrics recording for transaction durations in various components, including EIP-7702 and external bundler.
  • Updated the metrics module to include functions for tracking transaction states and processing times.
  • Enhanced the server to initialize and expose metrics via a dedicated endpoint.

These changes improve observability and performance monitoring across the execution framework.

Summary by CodeRabbit

  • New Features

    • Introduced Prometheus-based metrics collection and a new /metrics endpoint for real-time monitoring.
    • Added a Grafana dashboard for visualizing executor metrics, including transaction durations and throughput.
    • Provided Docker Compose configuration for easy setup of Prometheus and Grafana monitoring stack.
  • Enhancements

    • Integrated detailed timing metrics throughout transaction processing and job execution flows for improved observability.
  • Chores

    • Updated dependencies to support Prometheus integration.
    • Added configuration files for Prometheus and Grafana provisioning.

- Introduced `prometheus` and `lazy_static` dependencies to facilitate metrics collection.
- Added metrics recording for transaction durations in various components, including EIP-7702 and external bundler.
- Updated the `metrics` module to include functions for tracking transaction states and processing times.
- Enhanced the server to initialize and expose metrics via a dedicated endpoint.

These changes improve observability and performance monitoring across the execution framework.
@coderabbitai
Copy link

coderabbitai bot commented Aug 8, 2025

Walkthrough

This change set introduces Prometheus-based metrics instrumentation across executor components, adds a metrics HTTP endpoint to the server, and provides a full monitoring stack with Docker Compose, Prometheus, and Grafana configuration and dashboards. Executor logic is augmented to record transaction and job processing durations, and metrics are exposed via a new /metrics endpoint. Supporting configuration and dashboards are added for observability.

Changes

Cohort / File(s) Change Summary
Monitoring Stack Configuration
docker-compose.monitoring.yml, monitoring/prometheus.yml, monitoring/grafana/provisioning/datasources/prometheus.yml, monitoring/grafana/provisioning/dashboards/dashboard.yml, monitoring/grafana/dashboards/thirdweb-engine.json
Adds Docker Compose setup for Prometheus and Grafana, Prometheus scrape config, Grafana datasource and dashboard provisioning, and a custom dashboard for executor metrics.
Executor Metrics Instrumentation
executors/src/metrics.rs, executors/Cargo.toml, executors/src/lib.rs
Introduces a new Prometheus-based metrics module, adds dependencies, and exposes metrics recording and export functions.
EOA Executor Metrics Integration
executors/src/eoa/store/borrowed.rs, executors/src/eoa/store/submitted.rs, executors/src/eoa/worker/mod.rs, executors/src/eoa/worker/confirm.rs
Adds metric recording for transaction queued-to-sent, queued-to-confirmed, and job processing durations in EOA executor flows. Removes an unused import.
EIP-7702 Executor Metrics Integration
executors/src/eip7702_executor/send.rs, executors/src/eip7702_executor/confirm.rs
Adds metric recording for transaction queued-to-sent and queued-to-confirmed durations, propagates original queued timestamps through job data.
ERC-4337 External Bundler Metrics Integration
executors/src/external_bundler/send.rs, executors/src/external_bundler/confirm.rs
Adds metric recording for user operation queued-to-sent and queued-to-confirmed durations, propagates original queued timestamps.
Server Metrics Endpoint Integration
server/Cargo.toml, server/src/http/routes/admin/metrics.rs, server/src/http/routes/admin/mod.rs, server/src/http/server.rs, server/src/main.rs
Adds Prometheus and error handling dependencies, introduces /metrics HTTP endpoint, updates server state and router to serve metrics, and initializes metrics registry at startup.

Sequence Diagram(s)

Metrics Recording and Exposure Flow

sequenceDiagram
    participant Client
    participant Executor
    participant MetricsModule
    participant Prometheus
    participant Server

    Executor->>MetricsModule: record_transaction_queued_to_sent/confirmed(...)
    Executor->>MetricsModule: record_eoa_job_processing_time(...)
    Prometheus->>Server: HTTP GET /metrics
    Server->>MetricsModule: export_default_metrics()
    MetricsModule-->>Server: Prometheus metrics text
    Server-->>Prometheus: HTTP 200 with metrics
    Prometheus-->>Grafana: Scrapes metrics for dashboard
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Note

🔌 MCP (Model Context Protocol) integration is now available in Early Access!

Pro users can now connect to remote MCP servers under the Integrations page to get reviews and chat conversations that understand additional development context.

Warning

Review ran into problems

🔥 Problems

Errors were encountered while retrieving linked issues.

Errors (1)
  • EIP-7702: Entity not found: Issue - Could not find referenced Issue.
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pb/executor-metrics

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

d4mr added 2 commits August 8, 2025 15:22
- Added a new helper function `calculate_duration_seconds_from_twmq` to compute duration in seconds using timestamps in TWMQ format.
- Updated various components (EIP-7702 executor, external bundler, and EOA worker) to utilize the new TWMQ duration calculation for recording transaction metrics.
- Enhanced test coverage for the new duration calculation function.

These changes improve the accuracy of metrics tracking related to transaction durations.
@d4mr d4mr merged commit 9788924 into main Aug 8, 2025
2 checks passed
@d4mr d4mr deleted the pb/executor-metrics branch August 8, 2025 10:02
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

🔭 Outside diff range comments (1)
executors/src/eoa/store/submitted.rs (1)

307-310: Potential bug: passing tuple to ZREM

pipeline.zrem(key, member) is typically called with the member string, but here to_legacy_redis_string_with_nonce() returns (String, u64) and is passed without destructuring. This likely fails or removes the wrong member.

Apply:

-            pipeline.zrem(
-                self.keys.submitted_transactions_zset_name(),
-                tx.to_legacy_redis_string_with_nonce(),
-            );
+            let (legacy_member, _legacy_nonce) = tx.to_legacy_redis_string_with_nonce();
+            pipeline.zrem(
+                self.keys.submitted_transactions_zset_name(),
+                legacy_member,
+            );
🧹 Nitpick comments (11)
monitoring/grafana/provisioning/dashboards/dashboard.yml (1)

6-6: Optional: Put dashboards under a named Grafana folder

Improves organization in Grafana UI.

-    folder: ''
+    folder: 'Thirdweb Engine'
monitoring/prometheus.yml (1)

16-18: Trim trailing whitespace.

YAML-lint reports trailing spaces at Line 18. Cleaning this avoids noisy CI lint failures.

executors/src/eip7702_executor/confirm.rs (1)

269-275: LGTM on conditional metric recording with presence check

The optional timestamp guard is correct. After fixing units as above, this block is good.

Minor: consider reusing a single confirmed_timestamp for logging/metrics if you log confirmation timing elsewhere to avoid tiny drift across calls.

docker-compose.monitoring.yml (1)

10-10: Mount configs read-only to reduce tampering risk

Config mounts can be read-only for safety.

-      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
+      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
-      - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
-      - ./monitoring/grafana/provisioning:/etc/grafana/provisioning
+      - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards:ro
+      - ./monitoring/grafana/provisioning:/etc/grafana/provisioning:ro

Also applies to: 28-29

executors/src/eip7702_executor/send.rs (1)

285-287: Clarify units in local variable name

sent_timestamp is in milliseconds; rename to sent_timestamp_ms to avoid unit confusion when reviewing/maintaining.

-    let sent_timestamp = current_timestamp_ms();
-    let queued_to_sent_duration = calculate_duration_seconds_from_twmq(job.job.created_at, sent_timestamp);
+    let sent_timestamp_ms = current_timestamp_ms();
+    let queued_to_sent_duration =
+        calculate_duration_seconds_from_twmq(job.job.created_at, sent_timestamp_ms);
executors/src/external_bundler/send.rs (2)

508-510: Clarify units in local variable name

Align with ms-based helper to avoid confusion.

-    let sent_timestamp = current_timestamp_ms();
-    let queued_to_sent_duration = calculate_duration_seconds_from_twmq(job.job.created_at, sent_timestamp);
+    let sent_timestamp_ms = current_timestamp_ms();
+    let queued_to_sent_duration =
+        calculate_duration_seconds_from_twmq(job.job.created_at, sent_timestamp_ms);

510-510: Consider consolidating executor type labels

"erc4337-external" is a sensible label to distinguish paths, but it's easy to drift. Consider centralizing executor type label strings (e.g., associated consts on handlers or a shared constants module) to keep labels consistent across modules and dashboards.

server/src/http/routes/admin/metrics.rs (1)

24-29: Add Cache-Control to avoid intermediary caching of metrics

Metrics should generally not be cached by proxies.

-    Ok((
-        StatusCode::OK,
-        [("content-type", "text/plain; version=0.0.4; charset=utf-8")],
-        metrics_output,
-    ).into_response())
+    Ok((
+        StatusCode::OK,
+        [
+            ("content-type", "text/plain; version=0.0.4; charset=utf-8"),
+            ("cache-control", "no-cache, no-store, must-revalidate"),
+        ],
+        metrics_output,
+    ).into_response())
monitoring/grafana/dashboards/thirdweb-engine.json (1)

326-331: Optional: stabilize throughput rates by aggregating

If you want one time series per executor_type and chain_id (as the legend implies), explicitly sum by those labels. It avoids duplicate overlapping series per ephemeral label combinations.

-          "expr": "rate(tw_engine_executor_transaction_queued_to_sent_duration_seconds_count[5m])",
+          "expr": "sum by (executor_type, chain_id) (rate(tw_engine_executor_transaction_queued_to_sent_duration_seconds_count[5m]))",
...
-          "expr": "rate(tw_engine_executor_transaction_queued_to_confirmed_duration_seconds_count[5m])",
+          "expr": "sum by (executor_type, chain_id) (rate(tw_engine_executor_transaction_queued_to_confirmed_duration_seconds_count[5m]))",

Also applies to: 338-344

executors/src/metrics.rs (2)

64-70: Make initialize_metrics idempotent to avoid accidental double init

Double initialization can cause split metrics across registries or confusion. Consider ignoring subsequent inits with a warning.

 pub fn initialize_metrics(metrics: ExecutorMetrics) {
     let mut instance = EXECUTOR_METRICS_INSTANCE.write().unwrap();
-    *instance = Some(Arc::new(metrics));
+    if instance.is_some() {
+        // Avoid re-initialization; keep the first registry/instance
+        #[allow(unused_must_use)]
+        { tracing::warn!("Executor metrics already initialized; ignoring subsequent initialization"); }
+        return;
+    }
+    *instance = Some(Arc::new(metrics));
 }

Note: this uses tracing; if not available here, you can replace with eprintln! or remove the log.


161-198: Tests mutate global state; can flake under parallel test execution

initialize_metrics sets a global used by other tests. Consider serializing these tests or resetting the global between tests to avoid interference. Using serial_test or running these tests in a single-threaded group will avoid flakiness.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cabfbb9 and 2df0e38.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (21)
  • docker-compose.monitoring.yml (1 hunks)
  • executors/Cargo.toml (1 hunks)
  • executors/src/eip7702_executor/confirm.rs (3 hunks)
  • executors/src/eip7702_executor/send.rs (3 hunks)
  • executors/src/eoa/store/borrowed.rs (2 hunks)
  • executors/src/eoa/store/submitted.rs (2 hunks)
  • executors/src/eoa/worker/confirm.rs (0 hunks)
  • executors/src/eoa/worker/mod.rs (2 hunks)
  • executors/src/external_bundler/confirm.rs (3 hunks)
  • executors/src/external_bundler/send.rs (3 hunks)
  • executors/src/lib.rs (1 hunks)
  • executors/src/metrics.rs (1 hunks)
  • monitoring/grafana/dashboards/thirdweb-engine.json (1 hunks)
  • monitoring/grafana/provisioning/dashboards/dashboard.yml (1 hunks)
  • monitoring/grafana/provisioning/datasources/prometheus.yml (1 hunks)
  • monitoring/prometheus.yml (1 hunks)
  • server/Cargo.toml (1 hunks)
  • server/src/http/routes/admin/metrics.rs (1 hunks)
  • server/src/http/routes/admin/mod.rs (1 hunks)
  • server/src/http/server.rs (2 hunks)
  • server/src/main.rs (3 hunks)
💤 Files with no reviewable changes (1)
  • executors/src/eoa/worker/confirm.rs
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-06T15:44:13.701Z
Learnt from: d4mr
PR: thirdweb-dev/engine-core#5
File: executors/src/eoa/worker.rs:173-176
Timestamp: 2025-07-06T15:44:13.701Z
Learning: The EOA executor store uses comprehensive WATCH-based coordination where every Redis state mutation watches the lock key and validates ownership before proceeding. If the lock is lost during any operation, the transaction fails with LockLost error. This makes aggressive lock acquisition safe because only the actual lock holder can successfully perform state mutations, regardless of who claims the lock.

Applied to files:

  • executors/src/eoa/worker/mod.rs
🧬 Code Graph Analysis (7)
executors/src/eoa/worker/mod.rs (1)
executors/src/metrics.rs (4)
  • calculate_duration_seconds (120-122)
  • calculate_duration_seconds_from_twmq (125-128)
  • current_timestamp_ms (131-133)
  • record_eoa_job_processing_time (112-117)
executors/src/external_bundler/confirm.rs (1)
executors/src/metrics.rs (3)
  • record_transaction_queued_to_confirmed (104-109)
  • current_timestamp_ms (131-133)
  • calculate_duration_seconds_from_twmq (125-128)
executors/src/eoa/store/borrowed.rs (1)
executors/src/metrics.rs (3)
  • record_transaction_queued_to_sent (97-102)
  • current_timestamp_ms (131-133)
  • calculate_duration_seconds (120-122)
executors/src/eip7702_executor/send.rs (1)
executors/src/metrics.rs (3)
  • record_transaction_queued_to_sent (97-102)
  • current_timestamp_ms (131-133)
  • calculate_duration_seconds_from_twmq (125-128)
executors/src/eip7702_executor/confirm.rs (1)
executors/src/metrics.rs (3)
  • record_transaction_queued_to_confirmed (104-109)
  • current_timestamp_ms (131-133)
  • calculate_duration_seconds_from_twmq (125-128)
executors/src/external_bundler/send.rs (1)
executors/src/metrics.rs (3)
  • record_transaction_queued_to_sent (97-102)
  • current_timestamp_ms (131-133)
  • calculate_duration_seconds_from_twmq (125-128)
executors/src/eoa/store/submitted.rs (1)
executors/src/metrics.rs (3)
  • record_transaction_queued_to_confirmed (104-109)
  • current_timestamp_ms (131-133)
  • calculate_duration_seconds (120-122)
🪛 YAMLlint (1.37.1)
monitoring/prometheus.yml

[error] 18-18: trailing spaces

(trailing-spaces)

🔇 Additional comments (19)
executors/src/lib.rs (1)

4-4: LGTM on adding metrics module

Public exposure aligns with the PR’s goal of cross-crate instrumentation. Keep as-is if the server depends on these types.

server/Cargo.toml (1)

46-46: LGTM on adding engine-eip7702-core

Keeps server aligned with executor capabilities.

server/src/http/routes/admin/mod.rs (1)

1-3: LGTM on adding the metrics admin module

Module wiring looks correct.

monitoring/grafana/provisioning/dashboards/dashboard.yml (2)

1-12: LGTM on provisioning setup

Provider configuration and polling interval look sane.


11-12: Writable Dashboards Mount Verified

We’ve confirmed that in docker-compose.monitoring.yml the dashboards directory is bind-mounted without a :ro flag, so /var/lib/grafana/dashboards is writable by default. If you prefer explicitness, you can append :rw:

services:
  grafana:
    volumes:
      - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards:rw

Optional improvement: to keep things organized, consider grouping your dashboard JSON files into subfolders under monitoring/grafana/dashboards (e.g. dashboards/app1/, dashboards/app2/) and update the options.path in monitoring/grafana/provisioning/dashboards/dashboard.yml accordingly.

monitoring/grafana/provisioning/datasources/prometheus.yml (1)

1-9: No issues found – looks good.

executors/src/eoa/worker/mod.rs (1)

26-26: Potentially unused import.

calculate_duration_seconds_from_twmq is brought into scope but not referenced in this module; if none of the child sub-modules use it via super::, the build will warn. Please verify.

server/src/main.rs (1)

102-111: Metrics bootstrap LGTM.

Registry creation and global initialization are straightforward and safe.

executors/src/eoa/store/borrowed.rs (1)

15-15: LGTM on metrics import

Import set matches usage. Keep executor type label stable (“eoa”) across the codebase.

If not already done, verify label consistency across executors to avoid fragmented panels. See script in a later comment.

executors/src/eoa/store/submitted.rs (1)

19-19: LGTM on metrics import

Import aligns with usage and metrics module API.

executors/src/external_bundler/confirm.rs (2)

218-224: Verify executor label naming consistency

Label used is "erc4337-external". Ensure this aligns with other ERC-4337 instrumentation (e.g., "erc4337" vs "erc4337-external") to avoid split dashboards and cardinality bloat.

I can run a repo-wide check for label values to confirm consistency (see script below).


18-18: LGTM on metrics import

Imports map to the used helpers and keep the handler lean.

executors/src/eip7702_executor/confirm.rs (1)

21-21: LGTM on metrics import

Imports align with usage; executor label “eip7702” looks consistent with the module.

executors/src/eip7702_executor/send.rs (2)

284-288: Good addition: queued→sent duration recorded using consistent helpers

The placement after successful bundler send is correct, using TWMQ seconds start and ms end via the provided helpers.


336-336: Propagating original queued timestamp looks right

Passing original_queued_timestamp from TWMQ (seconds) enables queued→confirmed duration downstream. Ensure the confirmation handler treats it as seconds, consistent with calculate_duration_seconds_from_twmq.

executors/src/external_bundler/send.rs (2)

507-511: Good instrumentation: queued→sent duration for external bundler

Correct usage of helpers and consistent labeling. This will populate tw_engine_executor_transaction_queued_to_sent_duration_seconds for "erc4337-external".


546-547: Nice: carry original queued timestamp to confirm stage

This keeps queued→confirmed metrics accurate across the pipeline. Confirm the confirm handler interprets this as seconds (TWMQ) and not ms.

server/src/http/routes/admin/metrics.rs (2)

24-29: Correct Prometheus exposition response

TextEncoder + proper content-type is correct for the exposition format. Looks good.


40-51: Error handling is fine; logs are useful

IntoResponse maps to 500 with clear message; good operational behavior.


services:
prometheus:
image: prom/prometheus:latest
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Pin images to specific versions instead of latest

Using :latest hurts reproducibility and can break environments unexpectedly.

Pin to known-good tags and document update cadence, e.g.:

-    image: prom/prometheus:latest
+    image: prom/prometheus:v2.53.0
-    image: grafana/grafana:latest
+    image: grafana/grafana:10.4.3

Also applies to: 22-22

🤖 Prompt for AI Agents
In docker-compose.monitoring.yml at lines 5 and 22, the image tags use
`:latest`, which reduces reproducibility and can cause unexpected breaks.
Replace `:latest` with specific, known-good version tags for the prom/prometheus
image and any other images using `:latest`. Additionally, add comments to
document the update cadence for these images to maintain clarity on when and how
to update them.

Comment on lines +21 to +34
grafana:
image: grafana/grafana:latest
container_name: tw-engine-grafana
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
- ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
restart: unless-stopped
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Do not ship default Grafana admin credentials

Hardcoding GF_SECURITY_ADMIN_USER=admin and GF_SECURITY_ADMIN_PASSWORD=admin is unsafe even for demos (accidental exposure is common).

  • Move credentials to an .env file (gitignored) or use secrets.
  • Alternatively, require interactive setup or use GF_SECURITY_ADMIN_PASSWORD__FILE.

Example:

-    environment:
-      - GF_SECURITY_ADMIN_USER=admin
-      - GF_SECURITY_ADMIN_PASSWORD=admin
-      - GF_USERS_ALLOW_SIGN_UP=false
+    env_file:
+      - .env.monitoring
+    environment:
+      - GF_USERS_ALLOW_SIGN_UP=false

.env.monitoring (not committed):

GF_SECURITY_ADMIN_USER=admin
GF_SECURITY_ADMIN_PASSWORD=strong-unique-password
🤖 Prompt for AI Agents
In docker-compose.monitoring.yml around lines 21 to 34, the Grafana service uses
hardcoded default admin credentials which is insecure. Remove the hardcoded
GF_SECURITY_ADMIN_USER and GF_SECURITY_ADMIN_PASSWORD environment variables and
instead load these values from an external .env file or use Docker secrets.
Update the environment section to reference these variables without exposing
them in the compose file, ensuring the .env file is gitignored to prevent
accidental credential leaks.

Comment on lines +28 to +29
prometheus = "0.13.4"
lazy_static = "1.5.0"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Unify Prometheus across the workspace to avoid type duplication and registry fragmentation

Pin Prometheus at the workspace level and consume it via { workspace = true } in member crates. This prevents multiple versions in the graph and ensures metrics registered in executors are visible to the server.

Apply this diff in this Cargo.toml (requires adding the deps to [workspace.dependencies] in the root):

-prometheus = "0.13.4"
-lazy_static = "1.5.0"
+prometheus = { workspace = true }
+lazy_static = { workspace = true }

Then, in the root Cargo.toml:

[workspace.dependencies]
prometheus = "0.13.4"
lazy_static = "1.5.0"

Verify all crates use the same Prometheus version:


🏁 Script executed:

#!/bin/bash
# List all prometheus and lazy_static specs to ensure single sourced versions
rg -n --glob '**/Cargo.toml' $'prometheus\\s*=\\s*' -A 1
rg -n --glob '**/Cargo.toml' $'lazy_static\\s*=\\s*' -A 1

Length of output: 338


Workspace-level dependency pinning and lazy_static modernization

We’ve verified that both executors and server currently depend on prometheus = "0.13.4", so there’s no drift today. To guard against future mismatches and ensure a single Prometheus registry, pin it at the workspace level:

– In your root Cargo.toml:

[workspace.dependencies]
prometheus = "0.13.4"

– In each member Cargo.toml:

-[dependencies]
-prometheus = "0.13.4"
+prometheus = { workspace = true }

Since lazy_static only appears in executors, workspace-pinning it is optional. However, you’re on Rust 2021, so consider migrating to the standard LazyLock or the lighter once_cell::sync::Lazy:

-use lazy_static::lazy_static;
-
-lazy_static! {
-    static ref METRIC: Metric = build_metric();
-}
+use std::sync::LazyLock; // or once_cell::sync::Lazy
+
+static METRIC: LazyLock<Metric> = LazyLock::new(build_metric);

Next steps:

  • Add prometheus to [workspace.dependencies] in the root.
  • Update each crate’s prometheus entry to { workspace = true }.
  • Replace lazy_static usages with std::sync::LazyLock or once_cell::sync::Lazy.
🤖 Prompt for AI Agents
In executors/Cargo.toml around lines 28 to 29, you need to remove the direct
version pinning of prometheus and instead reference it from the workspace by
setting prometheus = { workspace = true }. Then, in the root Cargo.toml, add
prometheus = "0.13.4" under [workspace.dependencies] to centralize the version.
For lazy_static, replace its usage with std::sync::LazyLock or
once_cell::sync::Lazy to modernize the code for Rust 2021, and remove
lazy_static from dependencies if no longer used.

futures = "0.3.31"
moka = { version = "0.12.10", features = ["future"] }
prometheus = "0.13.4"
lazy_static = "1.5.0"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Prefer std::sync::LazyLock or once_cell over lazy_static in Rust 2024

With Edition 2024, you can avoid the macro dependency. Use std::sync::LazyLock (or OnceLock) or the once_cell crate for statics.

If you switch, remove the dependency:

-lazy_static = { workspace = true }

And update metrics statics, e.g.:

use std::sync::LazyLock;

pub static HISTO: LazyLock<prometheus::Histogram> = LazyLock::new(|| {
    prometheus::Histogram::with_opts(
        prometheus::HistogramOpts::new("tx_duration_seconds", "…")
    ).unwrap()
});

Find current usages that would need refactoring:


🏁 Script executed:

#!/bin/bash
rg -n $'lazy_static!\\(' -A 2

Length of output: 29


To catch the usual brace‐style invocations, let’s re‐run the grep with a more appropriate pattern:


🏁 Script executed:

#!/bin/bash
# Find macro invocations of lazy_static (brace style)
rg -n 'lazy_static!' -A 2

# Find any explicit extern crate imports
rg -n '^extern crate lazy_static' -A 2

Length of output: 336


Replace lazy_static with std::sync::LazyLock for your metrics registry

We detected one lazy_static! invocation in executors/src/metrics.rs around line 50. To adopt Rust 2024 without the lazy_static macro:

• In executors/Cargo.toml, remove the lazy_static dependency:

--- a/executors/Cargo.toml
@@ -29,7 +29,6 @@-lazy_static = "1.5.0"

• In executors/src/metrics.rs, replace the macro block with a LazyLock static:

--- a/executors/src/metrics.rs
@@ -49,7 +49,9 @@
-lazy_static! {
-    /// Default metrics registry for executors (fallback if no external registry provided)
-    static ref DEFAULT_EXECUTOR_METRICS_REGISTRY: Registry = Registry::new();
-}
+use std::sync::LazyLock;
+
+/// Default metrics registry for executors (fallback if no external registry provided)
+static DEFAULT_EXECUTOR_METRICS_REGISTRY: LazyLock<Registry> =
+    LazyLock::new(|| Registry::new());

Adjust the static visibility (pub) if it was previously public. This removes the extra macro dependency and leverages the standard library’s lazy initializer.

🤖 Prompt for AI Agents
In executors/Cargo.toml at line 29, remove the lazy_static dependency entirely.
Then in executors/src/metrics.rs around line 50, replace the lazy_static! macro
usage with a static variable using std::sync::LazyLock for lazy initialization.
Adjust the static's visibility to match the previous one if it was public. This
eliminates the need for the lazy_static crate and uses the Rust 2024 standard
library feature instead.

Comment on lines +41 to 44
/// Original timestamp when the transaction was first queued (unix timestamp in milliseconds)
#[serde(default)]
pub original_queued_timestamp: Option<u64>,
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Timestamp unit mismatch between docstring and usage

Doc says milliseconds, but calculate_duration_seconds_from_twmq expects seconds for the start timestamp.

Pick one and align:

  • If the field is milliseconds, switch to calculate_duration_seconds in the metrics block.
  • If the field is seconds, correct the docstring to “…seconds”.
    Include a unit test to prevent future regressions.
🤖 Prompt for AI Agents
In executors/src/eip7702_executor/confirm.rs around lines 41 to 44, the
docstring states the timestamp is in milliseconds but the function
calculate_duration_seconds_from_twmq expects seconds, causing a unit mismatch.
Decide whether to keep the field as milliseconds or seconds; if milliseconds,
update the metrics code to use calculate_duration_seconds instead of
calculate_duration_seconds_from_twmq; if seconds, correct the docstring to
reflect seconds. Additionally, add a unit test verifying the timestamp unit and
duration calculation to prevent future regressions.

Comment on lines +218 to +223
"expr": "histogram_quantile(0.95, rate(tw_engine_executor_transaction_queued_to_confirmed_duration_seconds_bucket[5m]))",
"instant": false,
"legendFormat": "p95 Queued to Confirmed - {{executor_type}}",
"range": true,
"refId": "A"
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Same issue for queued→confirmed

Aggregate by le (and executor_type) before histogram_quantile.

-          "expr": "histogram_quantile(0.95, rate(tw_engine_executor_transaction_queued_to_confirmed_duration_seconds_bucket[5m]))",
+          "expr": "histogram_quantile(0.95, sum by (le, executor_type) (rate(tw_engine_executor_transaction_queued_to_confirmed_duration_seconds_bucket[5m])))",
...
-          "expr": "histogram_quantile(0.50, rate(tw_engine_executor_transaction_queued_to_confirmed_duration_seconds_bucket[5m]))",
+          "expr": "histogram_quantile(0.50, sum by (le, executor_type) (rate(tw_engine_executor_transaction_queued_to_confirmed_duration_seconds_bucket[5m])))",

Also applies to: 230-236

🤖 Prompt for AI Agents
In monitoring/grafana/dashboards/thirdweb-engine.json around lines 218 to 223
and 230 to 236, the Prometheus query for histogram_quantile does not aggregate
by the 'le' label and 'executor_type' before applying histogram_quantile. Modify
the query to include a sum by 'le' and 'executor_type' over the rate of the
histogram buckets before calling histogram_quantile, ensuring correct
aggregation for accurate quantile calculation.

Comment on lines +434 to +439
"expr": "histogram_quantile(0.95, rate(tw_engine_eoa_executor_job_processing_duration_seconds_bucket[5m]))",
"instant": false,
"legendFormat": "p95 EOA Job Processing - Chain {{chain_id}}",
"range": true,
"refId": "A"
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

EOA job processing histograms: add aggregation by le and chain_id

Same histogram rule applies.

-          "expr": "histogram_quantile(0.95, rate(tw_engine_eoa_executor_job_processing_duration_seconds_bucket[5m]))",
+          "expr": "histogram_quantile(0.95, sum by (le, chain_id) (rate(tw_engine_eoa_executor_job_processing_duration_seconds_bucket[5m])))",
...
-          "expr": "histogram_quantile(0.50, rate(tw_engine_eoa_executor_job_processing_duration_seconds_bucket[5m]))",
+          "expr": "histogram_quantile(0.50, sum by (le, chain_id) (rate(tw_engine_eoa_executor_job_processing_duration_seconds_bucket[5m])))",

Also applies to: 446-452

🤖 Prompt for AI Agents
In monitoring/grafana/dashboards/thirdweb-engine.json around lines 434 to 439
and 446 to 452, the histogram_quantile expression lacks aggregation by le and
chain_id labels. Modify the Prometheus query to include sum by (le, chain_id)
over the rate of the histogram buckets to correctly aggregate the histogram data
per chain_id and bucket boundary. This ensures accurate percentile calculation
grouped by chain_id.

Comment on lines +47 to +48
prometheus = "0.13.4"
thiserror = "2.0.12"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Single-source prometheus and thiserror via workspace to avoid version drift

Mirror the executors crate and consume from [workspace.dependencies] for consistency.

-prometheus = "0.13.4"
-thiserror = "2.0.12"
+prometheus = { workspace = true }
+thiserror = { workspace = true }

Add to root Cargo.toml if not present:

[workspace.dependencies]
prometheus = "0.13.4"
thiserror = "2.0.12"

Confirm there’s only one version in the tree and that thiserror is used:


🏁 Script executed:

#!/bin/bash
rg -n --glob '**/Cargo.toml' $'prometheus\\s*=\\s*' -A 1
rg -n --glob '**/Cargo.toml' $'thiserror\\s*=\\s*' -A 1

# Check usage of thiserror
rg -n $'derive\\(.*Error.*\\)' server -A 2
rg -n $'thiserror::Error' server -A 2

Length of output: 1434


Workspace-pin prometheus and thiserror to avoid version drift

Move the version declarations into your root Cargo.toml under [workspace.dependencies] and switch all per-crate entries to workspace = true so there’s a single source of truth.

  1. Add to root Cargo.toml (if not already present):

    [workspace.dependencies]
    prometheus = "0.13.4"
    thiserror   = "2.0.12"
  2. In each crate’s Cargo.toml, replace the hard-pinned versions with workspace references:

    • executors/Cargo.toml:

      - prometheus = "0.13.4"
      + prometheus = { workspace = true }
      - thiserror = "2.0.12"
      + thiserror   = { workspace = true }
    • server/Cargo.toml:

      - prometheus = "0.13.4"
      + prometheus = { workspace = true }
      - thiserror = "2.0.12"
      + thiserror   = { workspace = true }
    • eip7702-core/Cargo.toml (line 14):

      - thiserror = "2.0"
      + thiserror   = { workspace = true }
    • thirdweb-core/Cargo.toml (line 20):

      - thiserror = "2.0.12"
      + thiserror   = { workspace = true }
    • twmq/Cargo.toml (line 13):

      - thiserror = "2.0.12"
      + thiserror   = { workspace = true }
    • core/Cargo.toml (line 12):

      - thiserror = "2.0.12"
      + thiserror   = { workspace = true }
    • aa-types/Cargo.toml (line 10):

      - thiserror = "2.0.12"
      + thiserror   = { workspace = true }
  3. Confirmed usage of thiserror in server/src/http/routes/admin/metrics.rs.

This ensures a single version of both crates across your entire workspace.

🤖 Prompt for AI Agents
In server/Cargo.toml at lines 47-48, prometheus and thiserror versions are
hardcoded, causing potential version drift. Move these version declarations to
the root Cargo.toml under [workspace.dependencies] with prometheus = "0.13.4"
and thiserror = "2.0.12". Then, in each crate's Cargo.toml (including
server/Cargo.toml), replace the version numbers with workspace = true to
reference the workspace versions, ensuring a single source of truth for these
dependencies across the workspace.

pub mod queue;
pub mod eoa_diagnostics;
pub mod eoa_diagnostics;
pub mod metrics;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Gate the metrics admin module behind a feature flag if exposure must be optional

This keeps /metrics opt-in for environments where it shouldn’t be exposed.

-pub mod metrics;
+#[cfg(feature = "metrics")]
+pub mod metrics;

Ensure the router adds the route conditionally as well:


🏁 Script executed:

#!/bin/bash
rg -n $'/metrics' -A 3 server/src

Length of output: 367


Add cfg(feature = "metrics") to the metrics admin module and its route registration

Make the /metrics endpoint fully opt-in by guarding both the module and the router calls behind the metrics feature flag.

Locations to update:

  • server/src/http/routes/admin/mod.rs – gate the metrics module declaration
  • server/src/http/server.rs – conditionally register the /metrics route and state

Proposed diffs:

--- a/server/src/http/routes/admin/mod.rs
+++ b/server/src/http/routes/admin/mod.rs
@@ Lines 1-3
- pub mod metrics;
+ #[cfg(feature = "metrics")]
+ pub mod metrics;
--- a/server/src/http/server.rs
+++ b/server/src/http/server.rs
@@ Around line 82
-    .route("/metrics", get(crate::http::routes::admin::metrics::get_metrics))
-    .with_state(state.metrics_registry.clone());
+    #[cfg(feature = "metrics")]
+    .route("/metrics", get(crate::http::routes::admin::metrics::get_metrics))
+    .with_state(state.metrics_registry.clone());

With these changes, environments without --features metrics won’t expose or compile the /metrics endpoint. LGTM once gated appropriately.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In server/src/http/routes/admin/mod.rs at line 2, the metrics module is declared
without a feature gate. Add a #[cfg(feature = "metrics")] attribute above the
pub mod metrics; line to conditionally compile this module only when the
"metrics" feature is enabled. Also, in server/src/http/server.rs, wrap the
registration of the /metrics route and any related state initialization with the
same #[cfg(feature = "metrics")] attribute to ensure the endpoint is only
included when the feature is enabled.

Comment on lines +81 to +92
let metrics_router = Router::new()
.route("/metrics", get(crate::http::routes::admin::metrics::get_metrics))
.with_state(state.metrics_registry.clone());

let (router, api) = OpenApiRouter::with_openapi(ApiDoc::openapi())
.nest("/v1", v1_router)
.split_for_parts();

// Merge the hidden diagnostic routes after OpenAPI split
let router = router.merge(eoa_diagnostics_router);
let router = router
.merge(eoa_diagnostics_router)
.merge(metrics_router);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Compile-time break: router state mismatch.

metrics_router is defined with state Arc<prometheus::Registry> but is merged into a router whose state is EngineServerState.
axum::Router::merge requires both routers to share the same state type, so this will fail to compile.

Unify the state, e.g. expose the registry via EngineServerState and keep all routers on that type:

-        let metrics_router = Router::new()
-            .route("/metrics", get(crate::http::routes::admin::metrics::get_metrics))
-            .with_state(state.metrics_registry.clone());
+        let metrics_router = Router::new()
+            .route("/metrics", get(crate::http::routes::admin::metrics::get_metrics))
+            .with_state(state.clone()); // handler should extract registry from state

Update get_metrics to accept State<EngineServerState> and read state.metrics_registry.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let metrics_router = Router::new()
.route("/metrics", get(crate::http::routes::admin::metrics::get_metrics))
.with_state(state.metrics_registry.clone());
let (router, api) = OpenApiRouter::with_openapi(ApiDoc::openapi())
.nest("/v1", v1_router)
.split_for_parts();
// Merge the hidden diagnostic routes after OpenAPI split
let router = router.merge(eoa_diagnostics_router);
let router = router
.merge(eoa_diagnostics_router)
.merge(metrics_router);
let metrics_router = Router::new()
.route("/metrics", get(crate::http::routes::admin::metrics::get_metrics))
.with_state(state.clone()); // handler should extract registry from state
let (router, api) = OpenApiRouter::with_openapi(ApiDoc::openapi())
.nest("/v1", v1_router)
.split_for_parts();
// Merge the hidden diagnostic routes after OpenAPI split
let router = router
.merge(eoa_diagnostics_router)
.merge(metrics_router);
🤖 Prompt for AI Agents
In server/src/http/server.rs around lines 81 to 92, the merge of metrics_router
fails because metrics_router uses state Arc<prometheus::Registry> while the main
router uses EngineServerState. To fix this, unify the state by modifying
metrics_router to use EngineServerState as its state type. Update the
get_metrics handler to accept State<EngineServerState> and access the
metrics_registry from that state. This ensures all routers share the same state
type and compiles correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants