rpc-alt: metrics #20728

amnn · 2024-12-24T16:09:55Z

Description

Add metrics tracking to the RPC service (tracking latencies as well as success/failure/total counts for requests, broken down by method, and DB queries). This required some additional/surrounding changes:

Upgrading to the latest version of jsonrpsee (just for sui-indexer-alt-jsonrpc), which has better support for adding middleware that is aware of a JSON-RPC request's structure. This is used to add a middleware to track request latencies.
Factoring out the MetricsService from the indexing framework, to be used in the reader as well. This was mostly motivated by the fact that in a test cluster or local network, we would need to serve metrics for both the reader and the indexer from the same endpoint -- so we will want to run one instance of this service and register metrics for both the reader and the indexer (this change also simplifies factoring out the ingestion service because we can now factor out its metrics).
Similarly, factoring out the database stats collector, so we can collect stats about the connection pool on both the read and the write side.
Now that the indexer does not control the lifecycle of its metrics service, the graceful shutdown logic has been tweaked: Previously, the indexer would trigger a cancellation as part of it shutdown, which might cause other unrelated services to wind down as well. Now, the indexer is responsible for just its own shutdown and we move the responsibility to coordinate shutdowns between unrelated services to the code that starts the services together. This is achieved using "child" cancellation tokens, which will trigger if their parents trigger, but not vice versa.

The changes to introduce RPC discovery were also coupled with this change -- the rpc.discover endpoint was added using the existing sui-open-rpc crate -- the coupling happened because it was useful for the metrics service to know which methods were supported to avoid polluting metrics with unrecognised methods.

Test plan

Unit tests were added for rpc discovery, graceful shutdown and metrics:

sui$ cargo nextest run -p sui-indexer-alt-jsonrpc

Stack

indexer-alt: configure consistency retention in one place #20889

Release notes

Check each box that your changes affect. If none of the boxes relate to your changes, release notes aren't required.

For each box you select, include information after the relevant heading that describes the impact of your changes that a user might notice and any actions they must take to implement updates.

vercel · 2024-12-24T16:09:59Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
sui-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jan 15, 2025 6:43pm

3 Skipped Deployments

Name	Status	Preview	Updated (UTC)
multisig-toolkit	⬜️ Ignored (Inspect)	Visit Preview	Jan 15, 2025 6:43pm
sui-kiosk	⬜️ Ignored (Inspect)	Visit Preview	Jan 15, 2025 6:43pm
sui-typescript-docs	⬜️ Ignored (Inspect)		Jan 15, 2025 6:43pm

lxfind · 2024-12-24T18:08:10Z

Curious why do we have to move the metrics to a separate crate? It's still part of the framework, right?

amnn · 2024-12-24T23:11:45Z

Curious why do we have to move the metrics to a separate crate? It's still part of the framework, right?

In my mind the indexing framework is about writing to the database, so it didn't make sense that the read side would depend on the indexing framework, especially not to use its metrics service, which is a shared component to both (put another way, I wouldn't think to look inside the indexing framework for a metrics service).

This PR does not get us there, but I think ideally (when prepared for external folks to use) the indexing framework would just deal with the prometheus Registry type (and even then, gated behind a feature), and exposing those metrics would be left as an implementation detail. As it stands, I think the current metrics service is quite specific to how we report metrics from our services (I was surprised that we didn't already have a shared crate for this already).

bmwill · 2024-12-26T14:32:13Z

@amnn would it be possible to update jsonrpsee for the whole project vs using a separate version just for the indexer? I understand there may be additional work to do that but it would be nice if we limited the duplicate deps that we bring into the repository, as well as any metric improvements you're making here may also be beneficial to the existing jsonrpc on-node service.

bmwill · 2024-12-26T14:33:38Z

Also fwiw, as a part of the new gRPC service i'm trying to take a ground up approach to how we capture metrics for http services to be able to handle some of the edge cases that we've been missing. may be worth syncing here to see if there's some overlap we could share here

amnn · 2024-12-26T19:27:10Z

Re: metrics, that would be great -- I would love to have a plug in metrics service that we could use in any of the services we build. That should generally raise the bar for our observability and while I originally put the metrics service in the framework my feeling is now that this should be a separate concern. Let's discuss more in the new year.

Re: upgrading jsonrpsee everywhere -- I did anticipate this question and went back and forth on it. Ultimately for me it was a backlog item for three reasons:

This new crate is primarily a binary crate and I'm actively trying to avoid it taking dependencies on the old JSONRPC stack (to make the latter easier to delete) so even with the two jsonrpsee versions in play, it's unlikely that they will show up in the same build, slowing it down (maybe I've miscalculated here though -- if we do end up having to take a dependency on both versions, then it makes sense to bump the priority of unifying versions -- this may end up happening because I don't plan on reimplementing all the base serde types).
I don't believe the newer versions of jsonrpsee make it possible to do anything we don't already do (e.g. in the case of metrics) they just allow us to get the same result with less boilerplate, so if we're not planning on building on top of the existing RPC implementation much, I don't think there's a big payout from upgrading, but the switching cost will be high. It requires understanding and untangling all that boilerplate and checking for backwards compatibility with not great test coverage in some cases.
The on node impl has to deal with some complexities that we don't need to deal with in the standalone RPC (serving multiple HTTP API routes, conditionally serving routes based on what indices are available, various traffic control things). I don't have a good picture of whether after factoring all those things we would gain anything from the upgrade or whether we would still end up having to roll our own versions of everything.

For me it's really a toss up between there coming a time when the benefits outweigh the costs to upgrading jsonrpsee for our existing crates and there coming a time where we can just delete the on-node JSONRPC. Even though deletion is realistically a while away, I'm trying hard not to get bogged down on these additional complexities, because it removes advantages of starting from a fresh page.

emmazzz · 2025-01-07T22:36:02Z

crates/sui-indexer-alt-jsonrpc/src/main.rs

+    let registry = Registry::new_custom(Some("jsonrpc_alt".into()), None)
+        .context("Failed to create Prometheus registry.")?;
+
+    let metrics = MetricsService::new(metrics_args, registry, cancel.child_token());


The PR description mentioned that MetricsService is factored out so that in a cluster metrics can be served from the same endpoint for both reader and writer. Does that mean, when we stand up this reader and writer we will want to have only one instance of MetricsService and have both reader and writer use metrics.registry(), like this?

let cancel = CancellationToken::new(); let registry = Registry::new_custom(Some("reader_and_writer".into()), None) .context("Failed to create Prometheus registry.")?; let metrics = MetricsService::new(metrics_args, registry, cancel.child_token()); let h_rpc = start_rpc(db_args, rpc_args, metrics.registry(), cancel.child_token()).await?; let h_indexer = start_indexer(metrics.registry(), cancel.child_token(), other_params).awaits?; let h_metrics = metrics.run().await?; let _ = h_indexer.await; let _ = h_rpc.await; cancel.cancel(); let _ = h_metrics.await;

Yep, that's right!

emmazzz · 2025-01-08T20:49:06Z

crates/sui-indexer-alt-jsonrpc/src/api/rpc_module.rs

+// SPDX-License-Identifier: Apache-2.0
+
+/// A JSONRPC module implementation coupled with a description of its schema.
+pub trait RpcModule: Sized {


Looks this is a wrapper of jsonrpsee's RpcModule, with the only difference being the added schema method. And right now this schema method is used to add a module's schema to RpcService's schema. How is this schema going to be used? For generating documentation and/or showing users the available schema?

This trait does two things:

It creates a standard interface for going from a T to an RpcModule<T>. jsonrpsee generates an into_rpc function, but it's not abstracted behind a single, common interface, which we need for the reader to have an interface where you can register any module.

The added schema, as you mentioned. This is used to implement the rpc.discover method.

emmazzz · 2025-01-08T21:17:14Z

crates/sui-indexer-alt-jsonrpc/src/metrics/mod.rs

+impl RpcMetrics {
+    pub(crate) fn new(registry: &Registry) -> Arc<Self> {
+        Arc::new(Self {
+            db_latency: register_histogram_with_registry!(


I wonder if it's possible/makes sense to have these db metrics labeled by methods as well.

I would be hesitant to go down this route because imposing that constraint may make the implementation less efficient -- think of things like the DataLoader pattern where we group together multiple disparate method requests into one DB request and shoot that off to the DB.

In cases where you can associate a DB request with a specific method, you should get a pretty good picture of the cost specific to that method using the request_latency metric. This can fall apart when the method implementation is sufficiently complicated that it's making multiple DB requests in sequence. The incidence of this should be rare for JSON-RPC, but we might have some examples (e.g. fetch ObjectIDs and then fetch their contents), but those are also the cases where we might employ things like the DataLoader pattern.

amnn · 2025-01-13T14:09:20Z

crates/sui-indexer-alt-jsonrpc/src/api/rpc_module.rs

+// SPDX-License-Identifier: Apache-2.0
+
+/// A JSONRPC module implementation coupled with a description of its schema.
+pub trait RpcModule: Sized {


This trait does two things:

It creates a standard interface for going from a T to an RpcModule<T>. jsonrpsee generates an into_rpc function, but it's not abstracted behind a single, common interface, which we need for the reader to have an interface where you can register any module.

The added schema, as you mentioned. This is used to implement the rpc.discover method.

amnn · 2025-01-13T14:10:48Z

crates/sui-indexer-alt-jsonrpc/src/lib.rs

+        // Add a method to serve the schema to clients.
+        modules
+            .register_method("rpc.discover", move |_, _, _| json!(schema.clone()))
+            .context("Failed to add schema discovery method")?;


@emmazzz, here we are adding the rpc.discover method that uses the results from calling RpcModule::schema.

amnn · 2025-01-13T14:18:51Z

crates/sui-indexer-alt-jsonrpc/src/metrics/mod.rs

+impl RpcMetrics {
+    pub(crate) fn new(registry: &Registry) -> Arc<Self> {
+        Arc::new(Self {
+            db_latency: register_histogram_with_registry!(


I would be hesitant to go down this route because imposing that constraint may make the implementation less efficient -- think of things like the DataLoader pattern where we group together multiple disparate method requests into one DB request and shoot that off to the DB.

In cases where you can associate a DB request with a specific method, you should get a pretty good picture of the cost specific to that method using the request_latency metric. This can fall apart when the method implementation is sufficiently complicated that it's making multiple DB requests in sequence. The incidence of this should be rare for JSON-RPC, but we might have some examples (e.g. fetch ObjectIDs and then fetch their contents), but those are also the cases where we might employ things like the DataLoader pattern.

## Description Add back a dedicated top-level configuration for consistency, to control the retention of the consistent range. Now that this range is controlled by the pruner, it can be configured by a `PrunerConfig` as well, rather than a dedicated `ConsistencyConfig`. The motivation for this is that the E2E testing setup needs to configure the consistent range, and it would be better to avoid having to enumerate all the consistent pipelines in the test harness because it would be easy to forget to update that list, whereas with this approach, we take advantage of the fact that there is an `add_consistent` macro to register the pipeline, and it will do the right thing. ## Test plan ``` sui$ cargo check sui$ cargo nextest run -p sui-indexer-alt sui$ cargo run -p sui-indexer-alt -- generate-config ```

## Description Track request latencies, success and failure rates, report them through a dedicated metrics endpoint, and report them through tracing as well. ## Test plan New unit tests for request counts and graceful shutdown, and manual testing for tracing: ``` sui$ RUST_LOG=DEBUG cargo run -p -- sui-indexer-alt-jsonrpc 2024-12-21T00:02:46.720775Z INFO sui_indexer_alt_jsonrpc: Starting JSON-RPC service on 0.0.0.0:6000 2024-12-21T00:02:46.720775Z INFO sui_indexer_alt_jsonrpc::metrics: Starting metrics service on 0.0.0.0:9184 2024-12-21T00:02:49.676408Z DEBUG jsonrpsee-server: Accepting new connection 1/100 2024-12-21T00:02:49.691414Z DEBUG sui_indexer_alt_jsonrpc::api: SELECT "kv_epoch_starts"."reference_gas_price" FROM "kv_epoch_starts" ORDER BY "kv_epoch_starts"."epoch" DESC LIMIT $1 -- binds: [1] 2024-12-21T00:02:49.694647Z INFO sui_indexer_alt_jsonrpc::metrics::middleware: Request succeeded method="suix_getReferenceGasPrice" elapsed_ms=1.7868458000000002e-5 2024-12-21T00:03:02.630356Z DEBUG jsonrpsee-server: Accepting new connection 1/100 2024-12-21T00:03:02.630585Z INFO sui_indexer_alt_jsonrpc::metrics::middleware: Request failed method="suix_getReferenceGasPrice_non_existent" code=-32601 elapsed_ms=3.9333e-8 ```

## Description Request metrics are labelled with their method, which is a user input. Users can pollute our metrics by sending methods that the RPC does not support. This change mitigates that by detecting unrecognized methods and normalizing them to an "<UNKNOWN>" label. ## Test plan Updated unit tests: ``` sui$ cargo nextest run -p sui-indexer-alt-jsonrpc ```

## Description Track how long each Database request is taking, how many requests we've issued, and how many have succeeded/failed. ## Test plan Run server, make request, check metrics.

## Description Add an automatic function to advertise the RPC's schema (version, methods, types) to clients. ## Test plan New unit tests: ``` sui$ cargo nextest run -p sui-indexer-alt-jsonrpc ```

## Description The RPC and Indexer implementations use essentially the same metrics service, and when they are combined in a single process they need to use the same instance of that metrics service as well. This change starts that process, by pulling out the implementation from `sui-indexer-alt`. In a later change, the metrics service in the RPC implementation will switch over to this common crate as well. This unblocks running both services together in one process, which is something we will need to do as part of E2E tests. ## Test plan Run the indexer and make sure it can be cancelled (with Ctrl-C), and that it also shuts down cleanly when run to a specific last checkpoint.

## Description ...and use it in the RPC implementation as well. A parameter has been added to add a prefix to all metric names. This ensures that if an Indexer and RPC service are sharing a metrics service, they won't stomp on each others metrics. ## Test plan Run the RPC service and check its metrics.

amnn requested review from lxfind, bmwill, emmazzz, gegaowp and wlmyng December 24, 2024 16:09

amnn self-assigned this Dec 24, 2024

amnn mentioned this pull request Dec 24, 2024

rpc-alt: e2e tests #20729

Open

7 tasks

emmazzz reviewed Jan 9, 2025

View reviewed changes

amnn commented Jan 13, 2025

View reviewed changes

amnn force-pushed the amnn/rpc-metrics branch from a7b71c6 to 636037b Compare January 14, 2025 15:11

amnn temporarily deployed to sui-typescript-aws-kms-test-env January 14, 2025 15:11 — with GitHub Actions Inactive

vercel bot deployed to Preview – sui-docs January 14, 2025 15:16 View deployment

amnn added 4 commits January 15, 2025 11:13

rpc-alt: track database metrics

e5af194

## Description Track how long each Database request is taking, how many requests we've issued, and how many have succeeded/failed. ## Test plan Run server, make request, check metrics.

amnn force-pushed the amnn/rpc-metrics branch from 636037b to 42b9904 Compare January 15, 2025 14:31

amnn temporarily deployed to sui-typescript-aws-kms-test-env January 15, 2025 14:31 — with GitHub Actions Inactive

vercel bot deployed to Preview – sui-docs January 15, 2025 14:33 View deployment

amnn added 4 commits January 15, 2025 18:39

rpc-alt: rpc.discover method

ce7c2bb

## Description Add an automatic function to advertise the RPC's schema (version, methods, types) to clients. ## Test plan New unit tests: ``` sui$ cargo nextest run -p sui-indexer-alt-jsonrpc ```

rpc-alt: Use common MetricsService

43618c0

amnn force-pushed the amnn/rpc-metrics branch from 42b9904 to a44d664 Compare January 15, 2025 18:42

amnn temporarily deployed to sui-typescript-aws-kms-test-env January 15, 2025 18:42 — with GitHub Actions Inactive

vercel bot deployed to Preview – sui-docs January 15, 2025 18:43 View deployment

amnn requested a review from emmazzz January 16, 2025 10:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc-alt: metrics #20728

rpc-alt: metrics #20728

amnn commented Dec 24, 2024 •

edited

Loading

vercel bot commented Dec 24, 2024 •

edited

Loading

lxfind commented Dec 24, 2024

amnn commented Dec 24, 2024

bmwill commented Dec 26, 2024

bmwill commented Dec 26, 2024

amnn commented Dec 26, 2024

emmazzz Jan 7, 2025

amnn Jan 13, 2025

emmazzz Jan 8, 2025

amnn Jan 13, 2025

emmazzz Jan 8, 2025

amnn Jan 13, 2025

amnn Jan 13, 2025

amnn Jan 13, 2025

amnn Jan 13, 2025

rpc-alt: metrics #20728

Are you sure you want to change the base?

rpc-alt: metrics #20728

Conversation

amnn commented Dec 24, 2024 • edited Loading

Description

Test plan

Stack

Release notes

vercel bot commented Dec 24, 2024 • edited Loading

lxfind commented Dec 24, 2024

amnn commented Dec 24, 2024

bmwill commented Dec 26, 2024

bmwill commented Dec 26, 2024

amnn commented Dec 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amnn commented Dec 24, 2024 •

edited

Loading

vercel bot commented Dec 24, 2024 •

edited

Loading