Allow workers to report additional prometheus metrics #210

patricoferris · 2023-01-16T00:00:27Z

In order to support workers reporting energy metrics we need to allow them to proxy more prometheus endpoints. I've added an additional_metric method to the worker interface to let them do this where we can add an arbitrary number of <name>:<uri> pairs for the worker to collect similarly to what is currently done by the node-exporter.

This would allow us to then run clarke on some workers and have them proxy the prometheus metrics, in addition to that we could run other prometheus reporters too.

I'll need to set up some things in order to test this properly but thought I would open the PR now to get some thoughts on the API.

tmcgilchrist

Comments / questions inline.
This looks good.

tmcgilchrist · 2023-02-01T04:01:33Z

bin/scheduler.ml

@@ -62,6 +74,7 @@ module Web = struct
      Server.respond_string ~status:`OK ~headers ~body ()
    | `GET, ["pool"; pool; "worker"; worker; "metrics"] -> get_metrics ~sched ~pool ~worker ~source:`Agent
    | `GET, ["pool"; pool; "worker"; worker; "host-metrics"] -> get_metrics ~sched ~pool ~worker ~source:`Host
+    | `GET, ["pool"; pool; "worker"; worker; extra] -> get_metrics ~sched ~pool ~worker ~source:(`Extra extra)


What does this API look like for extra requests? For example I want to request the clarke stats, would I make a request "/pool/linux-ppc64/worker/ppc-worker-1/clarke" and that gives me a collection of reported clarke stats?

Yep that's the idea, finally managed to get on a linux machine and take it for a spin. Seems to be working, so for example /pool/linux-x86_64/worker/my-host/clarke would give Clarke metrics. The main ones we want:

#HELP Clarke_meter_intensity Current carbon intensity in gCO2/kWh #TYPE Clarke_meter_intensity gauge Clarke_meter_intensity 0.000000 #HELP Clarke_meter_watts Current power usage measured in watts #TYPE Clarke_meter_watts gauge Clarke_meter_watts 100.000000

tmcgilchrist · 2023-02-01T04:32:16Z

bin/worker.ml

+  Arg.opt Arg.(list additional_metric_conv) [] @@
+  Arg.info
+    ~doc:"Additional prometheus endpoints to scrape in the form <name>:<uri> \
+    presented as a comma separated list."


So this would be used to point to other local metrics collectors and re-export those metrics over capnp, and the name:uri would match the metrics name configured on the scheduler and the URI is where on the localhost to connect (It could reach out to elsewhere given it's a URI).

Exactly. We would run Clarke at localhost:9090/metrics and then add to ocluster-worker --additional-metrics=clarke:http://localhost:9090/metrics.

Good point re:URI, if this isn't a useful feature we could perhaps limit to a <name>:<port> mapping and just always combine that into http://localhost:<port>/metrics?

patricoferris · 2023-02-07T00:05:58Z

Managed to test this and seems to be working 👍

MisterDA · 2023-02-10T10:22:24Z

Thanks, merged as of 88a917c.

@MisterDA

- ocaml-dockerfile + Build and install opam master from source in Windows images. (@MisterDA ocurrent/ocaml-dockerfile#140) + Include the ocaml-beta-repository in the images. (@kit-ty-kate ocurrent/ocaml-dockerfile#132, review by @MisterDA) - ocluster + Custom healthcheck period. (@mtelvers ocurrent/ocluster#214) + Allow workers to report additional prometheus metrics. (@patricoferris ocurrent/ocluster#210) + other minor things.

@MisterDA

- ocaml-dockerfile + Build and install opam master from source in Windows images. (@MisterDA ocurrent/ocaml-dockerfile#140) + Include the ocaml-beta-repository in the images. (@kit-ty-kate ocurrent/ocaml-dockerfile#132, review by @MisterDA) - ocluster + Custom healthcheck period. (@mtelvers ocurrent/ocluster#214) + Allow workers to report additional prometheus metrics. (@patricoferris ocurrent/ocluster#210) + other minor things. - ocaml-version + Expose 4.08.1 and 4.14.1 (@MisterDA ocurrent/ocaml-version#60)

@MisterDA

- ocaml-dockerfile + Build and install opam master from source in Windows images. (@MisterDA ocurrent/ocaml-dockerfile#140 ocurrent/ocaml-dockerfile#142) + Include the ocaml-beta-repository in the images. (@kit-ty-kate ocurrent/ocaml-dockerfile#132, review by @MisterDA) - ocluster + Custom healthcheck period. (@mtelvers ocurrent/ocluster#214) + Allow workers to report additional prometheus metrics. (@patricoferris ocurrent/ocluster#210) + other minor things. - ocaml-version + Expose 4.08.1 and 4.14.1 (@MisterDA ocurrent/ocaml-version#60) fixup

@MisterDA

…uster (0.2.1) CHANGES: - Expose the ocluster-worker library in the ocluster-worker package (@MisterDA @art-w, ocurrent/ocluster#219 ocurrent/ocluster#217 ocurrent/ocluster#151, reviewed by @tmcgilchrist) - Remove corrupted repositories from the cache (@kit-ty-kate ocurrent/ocluster#216, reviewed by @talex5) - Allow workers to report additional prometheus metrics (@patricoferris ocurrent/ocluster#210, reviewed by @tmcgilchrist, @MisterDA) - Smother Cap'n Proto and TLS debug logs (@MisterDA ocurrent/ocluster#213, reviewed by @talex5) - Added command line option to set obuilder health check period (@mtelvers ocurrent/ocluster#214, reviewed by @tmcgilchrist) - Conditionally compile macos user_temp fetcher (@tmcgilchrist ocurrent/ocluster#209, reviewed by @MisterDA, @mtelvers) - Make rsync-mode mandatory when using rsync store (@tmcgilchrist ocurrent/ocluster#202, reviewed by @MisterDA) - Windows service bugfixes (@MisterDA ocurrent/ocluster#200, reviewed by @tmcgilchrist) - Fix build and opam metadata (@MisterDA @tmcgilchrist ocurrent/ocluster#199 ocurrent/ocluster#203)

@MisterDA

…uster (0.2.1) CHANGES: - Expose the ocluster-worker library in the ocluster-worker package (@MisterDA @art-w, ocurrent/ocluster#219 ocurrent/ocluster#217 ocurrent/ocluster#151, reviewed by @tmcgilchrist) - Remove corrupted repositories from the cache (@kit-ty-kate ocurrent/ocluster#216, reviewed by @talex5) - Allow workers to report additional prometheus metrics (@patricoferris ocurrent/ocluster#210, reviewed by @tmcgilchrist, @MisterDA) - Smother Cap'n Proto and TLS debug logs (@MisterDA ocurrent/ocluster#213, reviewed by @talex5) - Added command line option to set obuilder health check period (@mtelvers ocurrent/ocluster#214, reviewed by @tmcgilchrist) - Conditionally compile macos user_temp fetcher (@tmcgilchrist ocurrent/ocluster#209, reviewed by @MisterDA, @mtelvers) - Make rsync-mode mandatory when using rsync store (@tmcgilchrist ocurrent/ocluster#202, reviewed by @MisterDA) - Windows service bugfixes (@MisterDA ocurrent/ocluster#200, reviewed by @tmcgilchrist) - Fix build and opam metadata (@MisterDA @tmcgilchrist ocurrent/ocluster#199 ocurrent/ocluster#203)

mtelvers · 2023-03-30T14:03:58Z

These are somewhat belated comments from me; sorry about that. Perhaps I'm missing something, but why is there a distinction between Host and Extra? Isn't Host just a specific type of Extra data? Not all workers provide Host data. We mask that by not trying to scrape it. I would also support the shortening of the URI, as discussed above. Thus having a typical worker running with: --additional-metrics=host:9100,clarke:9090

Allow workers to report additional prometheus metrics

14e588e

patricoferris force-pushed the additional-metrics branch from 0c1ca83 to 14e588e Compare January 16, 2023 09:13

tmcgilchrist approved these changes Feb 1, 2023

View reviewed changes

MisterDA added a commit that referenced this pull request Feb 10, 2023

Merge pull request #210 from patricoferris/additional-metrics

d538800

MisterDA closed this Feb 10, 2023

MisterDA mentioned this pull request Mar 2, 2023

[new release] ocluster, ocluster-worker, ocluster-api and current_ocluster (0.2.1) ocaml/opam-repository#23443

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow workers to report additional prometheus metrics #210

Allow workers to report additional prometheus metrics #210

patricoferris commented Jan 16, 2023

tmcgilchrist left a comment

tmcgilchrist Feb 1, 2023

patricoferris Feb 7, 2023

tmcgilchrist Feb 1, 2023

patricoferris Feb 7, 2023

patricoferris commented Feb 7, 2023

MisterDA commented Feb 10, 2023 •

edited

Loading

mtelvers commented Mar 30, 2023

Allow workers to report additional prometheus metrics #210

Allow workers to report additional prometheus metrics #210

Conversation

patricoferris commented Jan 16, 2023

tmcgilchrist left a comment

Choose a reason for hiding this comment

tmcgilchrist Feb 1, 2023

Choose a reason for hiding this comment

patricoferris Feb 7, 2023

Choose a reason for hiding this comment

tmcgilchrist Feb 1, 2023

Choose a reason for hiding this comment

patricoferris Feb 7, 2023

Choose a reason for hiding this comment

patricoferris commented Feb 7, 2023

MisterDA commented Feb 10, 2023 • edited Loading

mtelvers commented Mar 30, 2023

MisterDA commented Feb 10, 2023 •

edited

Loading