Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Rework metrics #715

Closed
wants to merge 13 commits into from

Conversation

lilic
Copy link
Member

@lilic lilic commented Nov 7, 2018

Description of the change:
This PR:

  • removes the old metrics as due to changes to operator-sdk they were not used anymore and will in the future come from controller-runtime.
  • adds new go and process metrics about operator-sdk and the helper function to create them.
  • all the metrics functions are now under pkg/metrics/...

Note: The NewOperatoSDKPrometheusRegistery and the HandlerFuncs are exposed because sometimes the user if they already have a registry or want to register additional things can use those functions instead of the ExposeMetricsPort which does it all. Will add documention for that and the rest in a seperate PR.

Also Note: The metrics have not yet landed in controller-runtime release so will open a PR with the changes to add/expose ports they use in a different PR when that release happens. I have tested it locally and it should work with little changes.

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 7, 2018
@lilic lilic force-pushed the lili/fix-metrics-port branch 5 times, most recently from 4770c26 to 1ba0565 Compare November 7, 2018 16:31
@estroz
Copy link
Member

estroz commented Nov 7, 2018

I've made a few changes to metrics in #707 to do with logging and returning errors, which this PR addresses. @lilic would you mind waiting until that gets merged to merge this?

@hasbro17
Copy link
Contributor

hasbro17 commented Nov 7, 2018

@estroz We probably don't need to block this for #707 unless the overlapping logging changes are significant. From what I can see we just have a couple more logging statements in the templates. Either PR can easily be rebased with changes from the other.

Copy link
Member

@estroz estroz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. Overall looks good.

commands/operator-sdk/cmd/new.go Outdated Show resolved Hide resolved
pkg/k8sutil/k8sutil.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
OperatorSDKPrometheusMetricsPort = 60000

// OperatorSDKPrometheusMetricsPortName define the port name used in kubernetes deployment and service
OperatorSDKPrometheusMetricsPortName = "sdk-metrics"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to rename the port to sdk-metrics?
I'm thinking metrics is simple and clear enough. The sdk prefix doesn't add much.

Copy link
Member Author

@lilic lilic Nov 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the reason being port names need to be unique in the deployment files and metrics is a very generic name, so if a user defines a new one the metrics port would not make sense any more. Plus in the next few PRs when the controller-runtime metrics gets into a release we want to add a port to the deployment and service resources and 8080 where they expose their metrics and we want to name it differently. Hope that makes sense?

Note: This just means the name of the port not the actual path is called sdk-metrics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are going to serve two metrics endpoints one for the SDK and one for the controller runtime metrics?

Could we just register the controller-runtime metrics as apart of the sdk metrics and then serve via a single endpoint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for now, well currently there are no metrics in controller-runtime release so there are no metrics there yet.
But see #715 (comment)

My plan is to adjust controller-runtime to not use global registry and we can then do a follow up PR to reuse that registry once that is merged.

Hope that clears things up?

@hasbro17
Copy link
Contributor

hasbro17 commented Nov 8, 2018

I might be wrong but I think we don't need ExposeMetricsPort() anymore. Or rather we don't need ExposeMetricsPort() to create a new Prometheus Registry, and to ListenAndServe those metrics on the desired port.

The Manager already creates a Server for our desired port to serve the controller metrics from the runtime's metrics.Registry.

With the option MetricsBindAddress we can specify what port that should be.

opts := manager.Options{MetricsBindAddress: ":"+strconv.Itoa(OperatorSDKPrometheusMetricsPort}
mgr, err := manager.New(cfg, opts)

All ExposeMetricsPort() should do is essentially just InitOperatorService() i.e create the Service to expose the desired port.

Of course this means users need to register their own custom operator metrics with metrics.Registry. Meaning you can't have 2 registries.
Also a second Registry would mean we need a different endpoint to serve those metrics since the Manager's Server is already serving at the localhost:<port-num>/metrics endpoint.
I guess we could have a separate Registry for custom metrics and the serve that on a different endpoint e,g localhost:<port-num>/custom-metrics.

@lilic WDYT? And let me know if I'm off here since you've already tested this out locally.

@lilic
Copy link
Member Author

lilic commented Nov 8, 2018

@hasbro17 I agree that we should reuse the registry from controller-runtime and my plan is to do just that, but currently we cannot do that. As it uses a global registry, which means we can't replace that registry with the prometheus.Registry or vice versa, because of the same problem. My plan is to adjust controller-runtime to not use global registry and we can then do a follow up PR to reuse that registry once that is merged. But for now we would have metrics about go and resources. The functions signature will not change when that happens, we will just skip creating a registry in the ExposeMetricsPort() and instead use the controller-runtime registry there.

The port name change does not effect that the metrics are exposed on ...:60000/metrics, its k8s specific name of the port not the port path (endpoint) itself, that has not changed. This just helps if the user wants to add different ports and make the naming more clear.

This way InitOperatorService can accept custom names of the port and
port name.
Due to the changes to operator-sdk the old metrics were obsolete,
instead we just want the go and process specific collectors.
Due to possible conflicts rename the port, to make it more clear.
@lilic
Copy link
Member Author

lilic commented Nov 8, 2018

@ecordell @hasbro17 I addressed the issues, please have another look, thanks!

@lilic lilic force-pushed the lili/fix-metrics-port branch 2 times, most recently from 47fcc7e to 613d23c Compare November 8, 2018 14:39
Copy link
Member

@shawn-hurley shawn-hurley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits and a question. I like the addition of the metrics package to the top level.

)

// ExposeMetricsPort generate a Kubernetes Service to expose metrics port
func ExposeMetricsPort() (*v1.Service, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is such a nit, but what if we change this to ExposePort()
The call changes from metrics.ExposeMetricsPort to metrics.ExposePort which I think is just as clear and has less redundancy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I wasn't happy with that naming, not sure ExposePort is the way to go either. This function creates service, exposes port and starts to listen and serve for prom. registry. So maybe metrics.SetupPrometheus() would make more sense? WDYT?

Copy link
Member

@shawn-hurley shawn-hurley Nov 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about metrics.Start() or metrics.Expose() or metrics.Setup().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metrics.Setup() sounds good to me, will do. Only wanted to include Prometheus in the name to make it more clear it's specific to it.

pkg/metrics/metrics.go Outdated Show resolved Hide resolved
OperatorSDKPrometheusMetricsPort = 60000

// OperatorSDKPrometheusMetricsPortName define the port name used in kubernetes deployment and service
OperatorSDKPrometheusMetricsPortName = "sdk-metrics"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are going to serve two metrics endpoints one for the SDK and one for the controller runtime metrics?

Could we just register the controller-runtime metrics as apart of the sdk metrics and then serve via a single endpoint?

@lilic
Copy link
Member Author

lilic commented Nov 9, 2018

@shawn-hurley @hasbro17 @estroz Tests pass and I addressed all the comments. Please have another look, thanks!

@hasbro17
Copy link
Contributor

My plan is to adjust controller-runtime to not use global registry

@lilic I think we're going to receive some pushback on that.
kubernetes-sigs/controller-runtime#132 (comment)

I personally think the controller-runtime having a global pkg level registry is fine.

we will just skip creating a registry in the ExposeMetricsPort() and instead use the controller-runtime registry there

We can already use the controller-runtime's global registry to serve those metrics on whatever endpoint we want. e.g in your PR that would just be:

import (
    "sigs.k8s.io/controller-runtime/pkg/metrics"
)
...
mux.Handle(PrometheusMetricsPath, promhttp.HandlerFor(metrics.Registry, ...))

As it uses a global registry, which means we can't replace that registry with the prometheus.Registry or vice versa, because of the same problem

Any reason why we can't do the above? Or some other reason we need to change the controller-runtime to not have a global registry.

As for the question of how to serve both the controller-runtime metrics and custom operator metrics we have two options:

  1. Use a single endpoint /metrics for both. I think this should be the default. This means using the same registry, i.e users have to register any custom metrics with metrics.Registry.

  2. Use 2 endpoints. /metrics for controller-runtime metrics and a user specified endpoint for custom metrics e.g /memcached-metrics. In this case users will create their own Registry (e.g within their reconciler), register metrics with it, and specify the custom endpoint. We can provide helpers to do this.

However in both cases I think we should keep the actual tcp port the same localhost:<port-num> and just have 2 endpoints if needed.
I don't think we'd want to have two separate ports for runtime and custom metrics by default. I'm unaware if having separate ports is a common pattern.

@lilic Unless I'm missing why we can't use the runtime's global registry I think it might help if you could use the latest commit from the controller-runtime in this PR to show what you intended to do in your follow up PR.
Because it seems your follow up would undo some changes from your current PR like using our own registry.
You can change version=v0.1.4 to revision=902ff11606ac95d149cd0618c5c66157e6ae4e70 in the following:
https://github.com/operator-framework/operator-sdk/blob/master/Gopkg.toml#L31
https://github.com/operator-framework/operator-sdk/blob/master/pkg/scaffold/gopkgtoml.go#L74

Also it seems like the default runtime registry will give us the process and Go metrics as well.
kubernetes-sigs/controller-runtime@40e1154


// HandlerFuncs registers the handles the /healthz and /metrics, to be able to serve the prometheus registry.
func HandlerFuncs(mux *http.ServeMux, reg *prometheus.Registry) {
mux.HandleFunc("/healthz", handlerHealthz)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to register the healthz endpoint? Right now handlerHealthz() would just ensure that the operator is always healthy. Which is the default even if we don't set this.

Liveness(healthz) and Readiness(readyz) endpoints/probes should be based on the operator logic or possibly related to leader election.

Does prometheus need to query the /healthz endpoint before scraping the actual metrics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Liveness(healthz) and Readiness(readyz) endpoints/probes should be based on the operator logic or possibly related to leader election.

Good point will open an issue about the healtz endpoint, if we don't have one already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename this to RegisterHandlerFuncs for added clarity? HandlerFuncs made me think it would return handler functions.

@lilic
Copy link
Member Author

lilic commented Nov 12, 2018

@hasbro17

Use a single endpoint /metrics for both. I think this should be the default. This means using the same registry, i.e users have to register any custom metrics with metrics.Registry.

This is already done, all the metrics would be served on /metrics just different ports.

Use 2 endpoints. /metrics for controller-runtime metrics and a user specified endpoint for custom metrics e.g /memcached-metrics. In this case users will create their own Registry (e.g within their reconciler), register metrics with it, and specify the custom endpoint. We can provide helpers to do this.

Not sure I see this is needed, can you explain maybe why it would be needed to do custom endpoint?

@lilic
Copy link
Member Author

lilic commented Nov 12, 2018

Thanks @hasbro17 I missed kubernetes-sigs/controller-runtime#132 (comment) that comment completely! As far as I understand global registry is something that needs to be avoided and instead we should refactor so the controller runtime registry is configurable. But I could be wrong, pinging @mxinden @brancz for second opinion on the above.

If they agree will open issue on controller-runtime and adjust controller-runtime metrics registry and get back to this PR after that.

@mxinden
Copy link

mxinden commented Nov 12, 2018

@lilic yes, we should try to avoid global registries, more details in the upcoming Kubernetes enhancement proposal kubernetes/community#2909:

Kubernetes also makes extensive use of a global metrics registry to register metrics to be exposed. Aside from general shortcomings of global variables, Kubernetes is seeing actual effects of this, causing a number of components to use sync.Once or other mechanisms to ensure to not panic, when registering metrics. Instead a metrics registry should be passed to each component in order to explicitly register metrics instead of through init methods or other global, non-obvious executions.

@brancz
Copy link

brancz commented Nov 13, 2018

For what it’s worth, we regret having introduced the global registry as a pattern and in the 0.10 release of the golang client library there will be breaking changes in this regard (as we don’t want to break the world forever there will be a global registry, but in a different package) and overall working with non global registries is getting a lot of great functionality that is simply not possible with global ones (such as prefixed registries and label curried registries).

@lilic
Copy link
Member Author

lilic commented Nov 13, 2018

Thanks!

I opened an issue to discuss this in controller-runtime kubernetes-sigs/controller-runtime#210 Depending on that outcome, will continue with this PR.

@lilic lilic changed the title Rework metrics WIP: Rework metrics Nov 16, 2018
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 16, 2018
Copy link
Member

@joelanford joelanford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one nit. LGTM otherwise.


// HandlerFuncs registers the handles the /healthz and /metrics, to be able to serve the prometheus registry.
func HandlerFuncs(mux *http.ServeMux, reg *prometheus.Registry) {
mux.HandleFunc("/healthz", handlerHealthz)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename this to RegisterHandlerFuncs for added clarity? HandlerFuncs made me think it would return handler functions.

@@ -45,7 +45,7 @@ func GetOperatorName() (string, error) {
}

// InitOperatorService return the static service which expose operator metrics
func InitOperatorService() (*v1.Service, error) {
func InitOperatorService(port int32, portName string) (*v1.Service, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you say something about adding an OwnerReference to the service? Is this the service or am I getting that confused with something else?

@lilic
Copy link
Member Author

lilic commented Nov 29, 2018

As this PR has comments that are not needed since we went with using the global registry from controller-runtime I decided to close and open a new PR to avoid confusion. Closing in favor of #786

@lilic lilic closed this Nov 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants