Pull metrics out of Clickhouse, expose 'em through Nexus' API #1131

smklein · 2022-05-27T19:32:46Z

Here's the end-user flow we'd like:

Through the console (or perhaps the CLI?) a user can view metrics for some category of information. For example: "show me the metrics for HTTP endpoint latency. Show me metrics for disk/network usage. etc.
- Point to consider: "operator" usage vs "end-user" usage -- each may see different metrics. We will want a different set of ACLs, at bare minimum, even if the Nexus implementation is mechanically similar.
- Open question: how many endpoints? what query parameters are exposed? What would be useful for console?
This should trigger a request to the external Nexus API, which itself should be able to make requests to Clickhouse
- Presumably, Nexus will act as an ACL validator + proxy to Clickhouse. Hopefully not too much post-processing of data is necessary.

What already exists:

There's machinery around oximeter to collect metrics from services, and store such information within Clickhouse itself. Although we should definitely add more metrics here (see: Upstairs disk stats -> Oximeter crucible#341 as an example), this half of the problem space is considered out-of-scope for this issue.
Since we already have HTTP endpoint latency wired up and dumped into Clickhouse, this may be an easy "first target". For utility, however, user-visible metrics (instance stats, disk/networking metrics, etc) will be high-value targets.

rmustacc · 2022-05-27T20:41:17Z

Just to expand on a few things here when it comes to the API and related:

There's a question of metric discovery as you list. This ties into how we actually want to phrase things here, how do different breakdowns get expressed, etc.
The other major consumer than the console is the API as this is how we expect customers to pull out stats into the broader system in the short to medium term. In particular, we need to think about how do folk efficiently grab ranges of data, how to tell what's missing versus what's not there, etc. This has a lot of implications on the query parameters and calling this out because for many customers if they can't scrape things conveniently and plug it into their long term store, it will be quickly become something they don't use.

bnaecker · 2022-05-27T21:32:53Z

Thanks for putting this up @smklein. I'll try to add the writing I've done on this in the past as I find it, to at least start collecting some thoughts and ideas. A few things come to mind right now.

@rmustacc As far as discover, there's an existing endpoint in Nexus to list the timeseries schema. These are not the individual timeseries, such as the number of bytes send out of this particular guest NIC, but the general schema that. Though it doesn't yet exist, an endpoint for listing the actual timeseries that exist, possibly restricted to a particular schema, could be both useful and relatively straightforward.

It's when we get to running general queries against the actual timeseries data that I'm less confident of the interface. I should say that I've already written a tool that uses a prototype interface for selecting and filtering data. That's implemented here, but in general, it:

Starts with a timeseries schema
Adds zero or more filtering conditions on the fields (of either the target or metric), e.g., id=d54a2952-2367-40e5-9640-c472f58b3f41
Adds zero or one filtering conditions on the time range (either endpoint may be unbounded)

This spits out a couple of SQL queries that are run against the tables in ClickHouse. Out pops the data, which may correspond to zero or more timeseries.

This all works "fine," in that you can get the correct data out of the database. For the API in Nexus itself, I'm not sure how to structure this. We could transliterate the existing query-builder tooling, which would mean a pretty generic endpoint like /timeseries/data, with a POST body that included all the filtering parameters.

This would also probably work just fine, and is likely the easiest way to meet the criteria of getting raw data out that consumers can use. My concern is that it's not very useful for anything else. I don't know how we do aggregations in the database, how to correlate different timeseries (or even align them), or really anything beyond a simple SELECT of the raw data. We might be able to defer that work, though. That would mean any aggregations or analysis would be done in the client. That sucks, but it's at least feasible in that many languages have software tools for operating on table-like objects (e.g., pandas in Python). But it would also be extremely useful for customers that wish to ingest this data into existing monitoring infrastructure.

As I mentioned, I'll try to collect more of my thoughts and writings, and either include them here or start an RFD.

david-crespo · 2022-06-06T21:30:03Z

I would be fine with the limited, simple API and doing a bit of processing on the client for now. Do you think we'd able to specify a granularity or maximum number of data points for a given range or something? That would make things easier on us and limit the response size.

smklein · 2022-06-23T20:25:36Z

As demo'd by @leftwo on 6/23's hypervisor sync, I think we will very soon have metrics from Crucible volumes and Propolis instances too. Both will be sitting in Clickhouse for now.

bnaecker · 2022-06-23T20:46:07Z

I wanted to drop some thoughts that I've not yet had time to write up formally.

I was initially leaning towards a "query-first" API, where clients can basically select ranges of raw data from timeseries and process it however they want. That's flexible, and makes sense when it's not clear how most folks will actually use the data. They get to decide that. On the other hand, the API is harder to build and requires work for the clients we do have, such as graphing data in the console.

Talking with others and thinking more about it, a "resource-first" approach may be better. That is, we have endpoints for collecting metrics about a specific resource, such as a VM instance. That would just send back an object that has the latest sample value for some set of metrics. Those metrics may or may not be the same as the metrics stored in the database itself (e.g., this could include a median response latency for Nexus's HTTP server, rather than the histogram we store in ClickHouse). That means, each endpoint would basically boil down to one or more queries to ClickHouse to get the fields it needs and stuff them into the blob returned to the client.

This has the drawback of a lot of endpoints, and thus a lot of types. But it's also nice that it decouples the database representation and the actual metrics we're exporting in the API.

andrewjstone · 2022-06-23T21:04:35Z

I wanted to drop some thoughts that I've not yet had time to write up formally.

I was initially leaning towards a "query-first" API, where clients can basically select ranges of raw data from timeseries and process it however they want. That's flexible, and makes sense when it's not clear how most folks will actually use the data. They get to decide that. On the other hand, the API is harder to build and requires work for the clients we do have, such as graphing data in the console.

Talking with others and thinking more about it, a "resource-first" approach may be better. That is, we have endpoints for collecting metrics about a specific resource, such as a VM instance. That would just send back an object that has the latest sample value for some set of metrics. Those metrics may or may not be the same as the metrics stored in the database itself (e.g., this could include a median response latency for Nexus's HTTP server, rather than the histogram we store in ClickHouse). That means, each endpoint would basically boil down to one or more queries to ClickHouse to get the fields it needs and stuff them into the blob returned to the client.

This has the drawback of a lot of endpoints, and thus a lot of types. But it's also nice that it decouples the database representation and the actual metrics we're exporting in the API.

I think I'm fine with either approach, but I want us also to consider the internal usage of clickhouse data for things like failure detection and placement decisions. The internal API will clearly be a different endpoint, but I'm not sure if it should use a similar mechanism. There may be computation that needs to be done across some large chunks of data, but it's also possible we can write specific queries for this type of data and return rust types here also. A placement engine can only use certain data to make decisions, so having a single optimized query to get "placement input" could work.

I haven't thought a whole lot about this yet, but I just wanted to make sure the use case was visible.

bnaecker · 2022-06-23T22:16:34Z

I think that would be supported by the "resource-first" approach. That's maybe a poor term. All I meant was that individual endpoints export some data, which they derive from whatever database queries they want. That query could in theory be anything, the main point is that the client doesn't necessarily get raw data from the database, at least in a way that's obvious. They make a GET request to some endpoint, and that returns some chunk of data. The relationship between that response and the raw data in the database is hidden, so that we (1) aren't necessarily required to build a full, generic query language, and (2) aren't foisting all the work of generating useful information on the client.

david-crespo · 2022-06-23T22:20:11Z

As a client dev I don't really know what I would do with "raw" data anyway. For example if I asked for a big date range I would want to be able to ensure I wasn't getting a billion data points.

Crucible changes: Remove unused fields in IOop (#1149) New downstairs clone subcommand. (#1129) Simplify the do_work_task loop (#1150) Move `Guest` stuff into a module (#1125) Bump nix to 0.27.1 and use new safer Fd APIs (#1110) Move `FramedWrite` work to a separate task (#1145) Use fewer borrows in ExtentInner API (#1147) Update Rust crate reedline to 0.28.0 (#1141) Update Rust crate tokio to 1.36 (#1143) Update Rust crate slog-bunyan to 2.5.0 (#1139) Update Rust crate rayon to 1.8.1 (#1138) Update Rust crate itertools to 0.12.1 (#1137) Update Rust crate byte-unit to 5.1.4 (#1136) Update Rust crate base64 to 0.21.7 (#1135) Update Rust crate async-trait to 0.1.77 (#1134) Discard deferred msgs (#1131) Minor Downstairs cleanup (#1127) Update test_fail_live_repair to support pstop (#1128) Ignore client messages after stopping the IO task (#1126) Move client IO task into a struct (#1124) Bump Rust to 1.75 and fix new Clippy lints (#1123) Propolis changes: PHD: convert to async (#633) PHD: assume specialized Windows images (#636) propolis-standalone-config needn't be a crate standalone: Use tar for snapshot/restore phd: use latest "lab-2.0-opte" target, not a specific version (#637) PHD: add tests for migration of running processes (#623) PHD: fix `cargo xtask phd` tidy not doing anything (#630) PHD: add documentation for `cargo xtask phd` (#629) standalone: improve virtual device creation errors (#632) phd: add Windows Server 2019 guest adapter (#627) PHD: add `cargo xtask phd` to make using PHD nicer (#619)

Crucible changes: Remove unused fields in IOop (#1149) New downstairs clone subcommand. (#1129) Simplify the do_work_task loop (#1150) Move `Guest` stuff into a module (#1125) Bump nix to 0.27.1 and use new safer Fd APIs (#1110) Move `FramedWrite` work to a separate task (#1145) Use fewer borrows in ExtentInner API (#1147) Update Rust crate reedline to 0.28.0 (#1141) Update Rust crate tokio to 1.36 (#1143) Update Rust crate slog-bunyan to 2.5.0 (#1139) Update Rust crate rayon to 1.8.1 (#1138) Update Rust crate itertools to 0.12.1 (#1137) Update Rust crate byte-unit to 5.1.4 (#1136) Update Rust crate base64 to 0.21.7 (#1135) Update Rust crate async-trait to 0.1.77 (#1134) Discard deferred msgs (#1131) Minor Downstairs cleanup (#1127) Update test_fail_live_repair to support pstop (#1128) Ignore client messages after stopping the IO task (#1126) Move client IO task into a struct (#1124) Bump Rust to 1.75 and fix new Clippy lints (#1123) Propolis changes: PHD: convert to async (#633) PHD: assume specialized Windows images (#636) propolis-standalone-config needn't be a crate standalone: Use tar for snapshot/restore phd: use latest "lab-2.0-opte" target, not a specific version (#637) PHD: add tests for migration of running processes (#623) PHD: fix `cargo xtask phd` tidy not doing anything (#630) PHD: add documentation for `cargo xtask phd` (#629) standalone: improve virtual device creation errors (#632) phd: add Windows Server 2019 guest adapter (#627) PHD: add `cargo xtask phd` to make using PHD nicer (#619) Co-authored-by: Alan Hanson <alan@oxide.computer>

smklein added enhancement New feature or request. api Related to the API. nexus Related to nexus labels May 27, 2022

smklein added the Remote Access Preview label Jun 1, 2022

iliana self-assigned this Jun 14, 2022

iliana mentioned this issue Jul 2, 2022

add disk metrics endpoint #1348

Merged

3 tasks

leftwo mentioned this issue Feb 9, 2024

Update Crucible and Propolis #5039

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull metrics out of Clickhouse, expose 'em through Nexus' API #1131

Pull metrics out of Clickhouse, expose 'em through Nexus' API #1131

smklein commented May 27, 2022

rmustacc commented May 27, 2022 •

edited by smklein

Loading

bnaecker commented May 27, 2022

david-crespo commented Jun 6, 2022

smklein commented Jun 23, 2022

bnaecker commented Jun 23, 2022

andrewjstone commented Jun 23, 2022

bnaecker commented Jun 23, 2022

david-crespo commented Jun 23, 2022

Pull metrics out of Clickhouse, expose 'em through Nexus' API #1131

Pull metrics out of Clickhouse, expose 'em through Nexus' API #1131

Comments

smklein commented May 27, 2022

rmustacc commented May 27, 2022 • edited by smklein Loading

bnaecker commented May 27, 2022

david-crespo commented Jun 6, 2022

smklein commented Jun 23, 2022

bnaecker commented Jun 23, 2022

andrewjstone commented Jun 23, 2022

bnaecker commented Jun 23, 2022

david-crespo commented Jun 23, 2022

rmustacc commented May 27, 2022 •

edited by smklein

Loading