Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future Design of Metrics #1469

Open
smklein opened this issue Jul 20, 2022 · 5 comments
Open

Future Design of Metrics #1469

smklein opened this issue Jul 20, 2022 · 5 comments
Labels

Comments

@smklein
Copy link
Collaborator

smklein commented Jul 20, 2022

#1348 provides an initial implementation of metrics, but there are a couple areas where we'd like to be able to improve in future iterations. This issue documents those improvements.

Although the current design is resource-centric (to query for metrics on a disk, an endpoint filtering by org/project/disk_name is used), it may make sense to migrate to a metric-centric approach where filters can be applied. Prior art.

Route

Concretely, where the current route is:

/organizations/{organization_name}/projects/{project_name}/disks/{disk_name}/metrics/{metric_name}

We should consider an route like the following:

/organizations/{organization_name}/projects/{project_name}/metrics/{metric_name}

Where filters like instance_id and/or disk_id may be supplied as query parameters.

An important use case is an "instance-centric flow", where a user can query for information about their particular instance. This becomes feasible by directly being able to filter on instance_id. This is not yet feasible today without oxidecomputer/crucible#375 , but is a worthwhile goal.

Org/Project Scoping

Additionally, there's some consideration whether we'd like to add an endpoint to view metrics "globally", e.g., outside the context of an organization / project. This view may be useful for operators who which to analyze performance across a sled / rack / AZ, as opposed to a user aiming for a more instance-centric flow.

Lifetimes

It's worth considering how we'd like to enable users to query for metrics of objects that have been deleted. Use-cases like a "short-lived instance" are still valid, and have measurement information stored within Clickhouse.

If we enable "query-by-name", this is more complicated, as names may be re-used after deletion of resources. However, if we provide "query-by-ID", this seems like less of an issue.

@smklein smklein mentioned this issue Jul 20, 2022
3 tasks
@leftwo
Copy link
Contributor

leftwo commented Jul 20, 2022

If we want to analyze performance of a specific sled for example, would we need to record as part of the metrics which sled a disk was in (and which slot)?

@rmustacc
Copy link

If we want to analyze performance of a specific sled for example, would we need to record as part of the metrics which sled a disk was in (and which slot)?

Yes, if you wanted to look at stats for physical hardware, if you don't want to look up and translate it the physical disk to a server dynamically via records (potentially historical) in the database, the server uuid a disk was in at the time would need to be recorded the same way the instance is for a crucible volume.

@rmustacc
Copy link

There are a bunch of things in here that I appreciate us calling out. I think we'll really want a full, proper RFD on this before we redesign in earnest.

@smklein
Copy link
Collaborator Author

smklein commented Jul 22, 2022

Starting on an RFD to gather some of these considerations now. I'll link when it's ready to discuss.

@smklein
Copy link
Collaborator Author

smklein commented Aug 12, 2022

My first pass at this RFD exists here: https://rfd.shared.oxide.computer/rfd/0304 - feedback is welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants