-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose controller-runtime metrics #786
Conversation
I would personally not merge this until controller-runtime creates a new release as this would mean all the new operators would pin against master instead of a new release. But as we agreed we should open this to review and pin against the master for now. |
Tested this locally and this is the example metrics as served on 8080:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM other than 2 nits.
1e27f3f
to
7c0f0ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM just personal preference around constant files
59682b1
to
9691354
Compare
As agreed we will put this on hold and wait until controller-runtime issues a new release. |
3eb575a
to
ef4e08c
Compare
8bc16b3
to
93be61a
Compare
51cb32f
to
f255a39
Compare
@joelanford @hasbro17 Can you please have another look, thanks! Adjusted per your suggestions. |
pkg/metrics/metrics.go
Outdated
kubeconfig, err := config.GetConfig() | ||
err = createService(mgr, s) | ||
|
||
return s, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the service already exists, we return the one returned by initOperatorService
, not the existing one. Just making sure that's what we want to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there is a missing error there in general 🤦♀️
Just making sure that's what we want to happen.
Yes not sure there, do you think requesting a Service would be best and return that, or just always error out and not return a service and by that removing the exist check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a straightforward answer due to the fact that an operator might be running in a deployment with leader election enabled. I think I've opened up a can of worms.
So there are two related concerns here:
- Which pods are selected by the Service's selector? Right now it looks like all pods in the deployment? I'm not totally familiar with how the metrics work, but I'd imagine that we want Prometheus to scrape only the leader's metrics, right?
- What's the lifecycle of the Service? Does it live for the duration of the leader pod or of the operator deployment?
I think we need to answer those questions to make sure we get the logic correct in ExposeMetricsPort
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which pods are selected by the Service's selector? Right now it looks like all pods in the deployment?
Yes since the service uses the same selector as the Deployment name=<operator-name>
it will expose the ports for all Deployment pods.
And it's alright if we scrape the metrics for all non-leader pods as well. If a leader pod steps down for some reason it can have different metrics from the new leader (e.g number of reconciles) which is worth exporting.
What's the lifecycle of the Service? Does it live for the duration of the leader pod or of the operator deployment?
I think it should live for the duration of the deployment. Recreating the service every time there's a new leader doesn't make much sense if it's going to be selecting all replicas. So if a service does not exist, the leader pod should create it with the ownerref set to the Deployement that owns it.
do you think requesting a Service would be best and return that, or just always error out and not return a service and by that removing the exist check?
I think ExposeMetricsPort()
should return the actual service that exists. Either it get's created by the pod if it doesn't exist, or it gets and returns the service created by a previous pod.
Although it's worth considering any drawbacks to tying the service lifecycle to the operator Deployment if any. I can't think of any right now though.
/cc @shawn-hurley
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes since the service uses the same selector as the Deployment name= it will expose the ports for all Deployment pods.
And it's alright if we scrape the metrics for all non-leader pods as well. If a leader pod steps down for some reason it can have different metrics from the new leader (e.g number of reconciles) which is worth exporting.
Will prometheus scrape each service endpoint individually or will it scrape the service and get round-robin-ed among each endpoint? If the former, then that sounds good. If the latter, won't that cause problems with prometheus (e.g. counters will jump up and down depending on which pod services the request)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with option 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did some testing and 1. doesn't work because metrics get served only when Start
is called. I would like to suggest upstream to serve metrics before that as it should be independantly of Start
. I am assuming that we should hold the lock before that, right?
They do handle leader election and serve metrics if there is a lock, but we use our own leader logic so we can't rely on that IIUC.
I guess for now so we have some metrics, I would suggest we go with option 2. and contribute upstream if they agree on serving metrics independantly. SGTY @shawn-hurley @hasbro17 @joelanford ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sum up from discussion with @shawn-hurley:
- Will look into disabling serving of metrics in controller-runtime, so we can serve the metrics they expose in operator-sdk.
- Open issue to discuss starting serving metrics independently of
Start()
being called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Option 2 sounds fine until we can work upstream or find a workaround to expose the metrics before getting the lock and calling start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Option 2 sounds fine until we can work upstream or find a workaround to expose the metrics before getting the lock and calling start.
Okay, will open an issue to make it work for all pods, not just leader when this is merged 👍 Until then we will at least have some metrics.
@lilic Looks good but just a few more nits. Can you also please update the CHANGELOG to mention that the SDK scaffold for |
Co-Authored-By: LiliC <cosiclili@gmail.com>
Co-Authored-By: LiliC <cosiclili@gmail.com>
Co-Authored-By: LiliC <cosiclili@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Due to the cache not being started at the time when we attempt to query for the Service, we instead create a new client in a similiar way as we do with leader elections.
b57b6c2
to
84d2432
Compare
We couldn't get the @hasbro17 @joelanford Tested locally and added to the |
|
||
service, err := k8sutil.InitOperatorService() | ||
// ExposeMetricsPort creates a Kubernetes Service to expose the passed metrics port. | ||
func ExposeMetricsPort(ctx context.Context, port int32) (*v1.Service, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 On passing in the context.
Will we want to change this function signature at all if kubernetes-sigs/controller-runtime#273 gets merged? Would we go back to passing in the manager (or maybe the client directly)?
If so, I'm wondering if it would be worth anticipating that now to avoid an API change, or if we should just wait since we don't know exactly how it'll look.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, I'm wondering if it would be worth anticipating that now to avoid an API change, or if we should just wait since we don't know exactly how it'll look.
Yes was thinking about that as well, but as currently we have no idea if that will get merged and how it will look like in the end, so not sure we can fully predict it and think about not breaking the API. And we will have to change Leader functions signature in the case we use from above PR, not sure it makes a difference here. So yes, most likely if that gets merged we will break the API, or just decide to leave it as is, we always have that choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about needing to change the leader election API as well.
In that case, I agree with waiting and breaking the API for both if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just one more question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Maybe it's wrong place to ask but how to enable metrics? When I'm just start in container (built from latest sources):
there is nothing on 8383 port:
|
@stepin I believe that we as the ansible operator need to turn on metrics in our main file. |
Description of the change:
Bring in the controller-runtime metrics by exposing and creating a Service object. By default it is on 8080 the same as in controller-runtime we just pass the default value in case the user changes it this is then reflected in the Service creation as well.
Motivation for the change:
As decided offline: as currently the controller-runtime uses a global registry we should use that instead of creating a new one and serving the metrics ourselves.
Closes #222