Future Design of Metrics #1469

smklein · 2022-07-20T13:45:12Z

#1348 provides an initial implementation of metrics, but there are a couple areas where we'd like to be able to improve in future iterations. This issue documents those improvements.

Although the current design is resource-centric (to query for metrics on a disk, an endpoint filtering by org/project/disk_name is used), it may make sense to migrate to a metric-centric approach where filters can be applied. Prior art.

Route

Concretely, where the current route is:

/organizations/{organization_name}/projects/{project_name}/disks/{disk_name}/metrics/{metric_name}

We should consider an route like the following:

/organizations/{organization_name}/projects/{project_name}/metrics/{metric_name}

Where filters like instance_id and/or disk_id may be supplied as query parameters.

An important use case is an "instance-centric flow", where a user can query for information about their particular instance. This becomes feasible by directly being able to filter on instance_id. This is not yet feasible today without oxidecomputer/crucible#375 , but is a worthwhile goal.

Org/Project Scoping

Additionally, there's some consideration whether we'd like to add an endpoint to view metrics "globally", e.g., outside the context of an organization / project. This view may be useful for operators who which to analyze performance across a sled / rack / AZ, as opposed to a user aiming for a more instance-centric flow.

Lifetimes

It's worth considering how we'd like to enable users to query for metrics of objects that have been deleted. Use-cases like a "short-lived instance" are still valid, and have measurement information stored within Clickhouse.

If we enable "query-by-name", this is more complicated, as names may be re-used after deletion of resources. However, if we provide "query-by-ID", this seems like less of an issue.

The text was updated successfully, but these errors were encountered:

leftwo · 2022-07-20T15:23:45Z

If we want to analyze performance of a specific sled for example, would we need to record as part of the metrics which sled a disk was in (and which slot)?

rmustacc · 2022-07-20T15:26:27Z

If we want to analyze performance of a specific sled for example, would we need to record as part of the metrics which sled a disk was in (and which slot)?

Yes, if you wanted to look at stats for physical hardware, if you don't want to look up and translate it the physical disk to a server dynamically via records (potentially historical) in the database, the server uuid a disk was in at the time would need to be recorded the same way the instance is for a crucible volume.

rmustacc · 2022-07-20T15:28:04Z

There are a bunch of things in here that I appreciate us calling out. I think we'll really want a full, proper RFD on this before we redesign in earnest.

smklein · 2022-07-22T14:55:28Z

Starting on an RFD to gather some of these considerations now. I'll link when it's ready to discuss.

smklein · 2022-08-12T16:15:22Z

My first pass at this RFD exists here: https://rfd.shared.oxide.computer/rfd/0304 - feedback is welcome

smklein mentioned this issue Jul 20, 2022

add disk metrics endpoint #1348

Merged

3 tasks

smklein added the Metrics label Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Future Design of Metrics #1469

Future Design of Metrics #1469

smklein commented Jul 20, 2022

leftwo commented Jul 20, 2022

rmustacc commented Jul 20, 2022

rmustacc commented Jul 20, 2022

smklein commented Jul 22, 2022

smklein commented Aug 12, 2022

Future Design of Metrics #1469

Future Design of Metrics #1469

Comments

smklein commented Jul 20, 2022

Route

Org/Project Scoping

Lifetimes

leftwo commented Jul 20, 2022

rmustacc commented Jul 20, 2022

rmustacc commented Jul 20, 2022

smklein commented Jul 22, 2022

smklein commented Aug 12, 2022