Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow workers to report additional prometheus metrics #210

Closed

Conversation

patricoferris
Copy link
Contributor

In order to support workers reporting energy metrics we need to allow them to proxy more prometheus endpoints. I've added an additional_metric method to the worker interface to let them do this where we can add an arbitrary number of <name>:<uri> pairs for the worker to collect similarly to what is currently done by the node-exporter.

This would allow us to then run clarke on some workers and have them proxy the prometheus metrics, in addition to that we could run other prometheus reporters too.

I'll need to set up some things in order to test this properly but thought I would open the PR now to get some thoughts on the API.

Copy link
Member

@tmcgilchrist tmcgilchrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments / questions inline.
This looks good.

@@ -62,6 +74,7 @@ module Web = struct
Server.respond_string ~status:`OK ~headers ~body ()
| `GET, ["pool"; pool; "worker"; worker; "metrics"] -> get_metrics ~sched ~pool ~worker ~source:`Agent
| `GET, ["pool"; pool; "worker"; worker; "host-metrics"] -> get_metrics ~sched ~pool ~worker ~source:`Host
| `GET, ["pool"; pool; "worker"; worker; extra] -> get_metrics ~sched ~pool ~worker ~source:(`Extra extra)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this API look like for extra requests? For example I want to request the clarke stats, would I make a request "/pool/linux-ppc64/worker/ppc-worker-1/clarke" and that gives me a collection of reported clarke stats?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that's the idea, finally managed to get on a linux machine and take it for a spin. Seems to be working, so for example /pool/linux-x86_64/worker/my-host/clarke would give Clarke metrics. The main ones we want:

#HELP Clarke_meter_intensity Current carbon intensity in gCO2/kWh
#TYPE Clarke_meter_intensity gauge
Clarke_meter_intensity 0.000000
#HELP Clarke_meter_watts Current power usage measured in watts
#TYPE Clarke_meter_watts gauge
Clarke_meter_watts 100.000000

Arg.opt Arg.(list additional_metric_conv) [] @@
Arg.info
~doc:"Additional prometheus endpoints to scrape in the form <name>:<uri> \
presented as a comma separated list."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this would be used to point to other local metrics collectors and re-export those metrics over capnp, and the name:uri would match the metrics name configured on the scheduler and the URI is where on the localhost to connect (It could reach out to elsewhere given it's a URI).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. We would run Clarke at localhost:9090/metrics and then add to ocluster-worker --additional-metrics=clarke:http://localhost:9090/metrics.

Good point re:URI, if this isn't a useful feature we could perhaps limit to a <name>:<port> mapping and just always combine that into http://localhost:<port>/metrics?

@patricoferris
Copy link
Contributor Author

Managed to test this and seems to be working 👍

@MisterDA
Copy link
Contributor

MisterDA commented Feb 10, 2023

Thanks, merged as of 88a917c.

@MisterDA MisterDA closed this Feb 10, 2023
MisterDA added a commit to ocurrent/docker-base-images that referenced this pull request Feb 10, 2023
- ocaml-dockerfile
  + Build and install opam master from source in Windows images.
    (@MisterDA ocurrent/ocaml-dockerfile#140)
  + Include the ocaml-beta-repository in the images.
    (@kit-ty-kate ocurrent/ocaml-dockerfile#132, review by @MisterDA)

- ocluster
  + Custom healthcheck period. (@mtelvers ocurrent/ocluster#214)
  + Allow workers to report additional prometheus metrics. (@patricoferris ocurrent/ocluster#210)
  + other minor things.
MisterDA added a commit to ocurrent/docker-base-images that referenced this pull request Feb 10, 2023
- ocaml-dockerfile
  + Build and install opam master from source in Windows images.
    (@MisterDA ocurrent/ocaml-dockerfile#140)
  + Include the ocaml-beta-repository in the images.
    (@kit-ty-kate ocurrent/ocaml-dockerfile#132, review by @MisterDA)

- ocluster
  + Custom healthcheck period. (@mtelvers ocurrent/ocluster#214)
  + Allow workers to report additional prometheus metrics. (@patricoferris ocurrent/ocluster#210)
  + other minor things.

- ocaml-version
  + Expose 4.08.1 and 4.14.1 (@MisterDA ocurrent/ocaml-version#60)
MisterDA added a commit to ocurrent/docker-base-images that referenced this pull request Feb 13, 2023
- ocaml-dockerfile
  + Build and install opam master from source in Windows images.
    (@MisterDA ocurrent/ocaml-dockerfile#140 ocurrent/ocaml-dockerfile#142)
  + Include the ocaml-beta-repository in the images.
    (@kit-ty-kate ocurrent/ocaml-dockerfile#132, review by @MisterDA)

- ocluster
  + Custom healthcheck period. (@mtelvers ocurrent/ocluster#214)
  + Allow workers to report additional prometheus metrics. (@patricoferris ocurrent/ocluster#210)
  + other minor things.

- ocaml-version
  + Expose 4.08.1 and 4.14.1 (@MisterDA ocurrent/ocaml-version#60)

fixup
MisterDA added a commit to MisterDA/opam-repository that referenced this pull request Mar 2, 2023
…uster (0.2.1)

CHANGES:

- Expose the ocluster-worker library in the ocluster-worker package
  (@MisterDA @art-w, ocurrent/ocluster#219 ocurrent/ocluster#217 ocurrent/ocluster#151, reviewed by @tmcgilchrist)
- Remove corrupted repositories from the cache (@kit-ty-kate ocurrent/ocluster#216, reviewed by @talex5)
- Allow workers to report additional prometheus metrics (@patricoferris ocurrent/ocluster#210, reviewed by @tmcgilchrist, @MisterDA)
- Smother Cap'n Proto and TLS debug logs (@MisterDA ocurrent/ocluster#213, reviewed by @talex5)
- Added command line option to set obuilder health check period (@mtelvers ocurrent/ocluster#214, reviewed by @tmcgilchrist)
- Conditionally compile macos user_temp fetcher (@tmcgilchrist ocurrent/ocluster#209, reviewed by @MisterDA, @mtelvers)
- Make rsync-mode mandatory when using rsync store (@tmcgilchrist ocurrent/ocluster#202, reviewed by @MisterDA)
- Windows service bugfixes (@MisterDA ocurrent/ocluster#200, reviewed by @tmcgilchrist)
- Fix build and opam metadata (@MisterDA @tmcgilchrist ocurrent/ocluster#199 ocurrent/ocluster#203)
MisterDA added a commit to MisterDA/opam-repository that referenced this pull request Mar 3, 2023
…uster (0.2.1)

CHANGES:

- Expose the ocluster-worker library in the ocluster-worker package
  (@MisterDA @art-w, ocurrent/ocluster#219 ocurrent/ocluster#217 ocurrent/ocluster#151, reviewed by @tmcgilchrist)
- Remove corrupted repositories from the cache (@kit-ty-kate ocurrent/ocluster#216, reviewed by @talex5)
- Allow workers to report additional prometheus metrics (@patricoferris ocurrent/ocluster#210, reviewed by @tmcgilchrist, @MisterDA)
- Smother Cap'n Proto and TLS debug logs (@MisterDA ocurrent/ocluster#213, reviewed by @talex5)
- Added command line option to set obuilder health check period (@mtelvers ocurrent/ocluster#214, reviewed by @tmcgilchrist)
- Conditionally compile macos user_temp fetcher (@tmcgilchrist ocurrent/ocluster#209, reviewed by @MisterDA, @mtelvers)
- Make rsync-mode mandatory when using rsync store (@tmcgilchrist ocurrent/ocluster#202, reviewed by @MisterDA)
- Windows service bugfixes (@MisterDA ocurrent/ocluster#200, reviewed by @tmcgilchrist)
- Fix build and opam metadata (@MisterDA @tmcgilchrist ocurrent/ocluster#199 ocurrent/ocluster#203)
@mtelvers
Copy link
Member

These are somewhat belated comments from me; sorry about that. Perhaps I'm missing something, but why is there a distinction between Host and Extra? Isn't Host just a specific type of Extra data? Not all workers provide Host data. We mask that by not trying to scrape it. I would also support the shortening of the URI, as discussed above. Thus having a typical worker running with: --additional-metrics=host:9100,clarke:9090

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants