Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect metrics #505

Merged
merged 5 commits into from
Mar 7, 2021
Merged

Conversation

ingvagabund
Copy link
Contributor

@ingvagabund ingvagabund commented Feb 22, 2021

Fixes: #348

Have the descheduler collect metrics and serve them through /metrics endpoint on 10258 secure port.

How to test it? Have descheduler running with some descheduling internal (e.g. 1s) and have some pods evicted. Then just run curl https://localhost:10258/metrics -k.

Metrics bits mostly copy-pasted from kube-scheduler.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Feb 22, 2021
@ingvagabund
Copy link
Contributor Author

#348 (comment)

@seanmalloy I merged pods_evicted_success and pods_evicted_failed into pods_evicted as a labeled CounterVec. Though, not against having two metrics either.

@seanmalloy
Copy link
Member

/cc

@@ -97,19 +98,21 @@ func (pe *PodEvictor) TotalEvicted() int {
// EvictPod returns non-nil error only when evicting a pod on a node is not
// possible (due to maxPodsToEvictPerNode constraint). Success is true when the pod
// is evicted on the server side.
func (pe *PodEvictor) EvictPod(ctx context.Context, pod *v1.Pod, node *v1.Node, reasons ...string) (bool, error) {
func (pe *PodEvictor) EvictPod(ctx context.Context, pod *v1.Pod, node *v1.Node, strategy string, reasons ...string) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this new strategy param used anywhere? It's kind of confusing to have to update all the strategies with lines like podEvictor.EvictPod(ctx, pod, node, "NodeAffinity", "NodeAffinity")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like it either. We might drop reasons field since I don't see it used for anything but passing a strategy name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reasons is fine, but specifically the strategy param you added here... I don't see it used anywhere, and it seems like you're just using reasons anyway, right?

Copy link
Contributor Author

@ingvagabund ingvagabund Feb 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. My bad. metrics.PodsEvicted.With(map[string]string{"result": "maximum number reached", "strategy": reason, "namespace": pod.Namespace}).Inc() is supposed to have "strategy": strategy pair in it. Right now, reasons can be basically anything meaningful. Yet, we set reasons param only to individual strategy names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, I agree then we could either drop the ...reasons param all together, or add this strategy param, and just use that in each strategy (since reasons is optional, the strategy can just pass its name which will be used for metrics).

I can see a possible use case where one strategy may have different reasons for eviction, so I'm not sure it's worth getting rid of entirely yet. But it doesn't make much sense to keep if we're not using it. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth getting rid of entirely yet

Yeah, not completely sure either. I'll keep it there for the moment. Better than drop it and re-introduce it later.

@ingvagabund
Copy link
Contributor Author

$ curl -s https://localhost:10258/metrics -k | grep descheduler
# HELP descheduler_build_info [ALPHA] Build info about descheduler, including Go version, Descheduler version, Git SHA, Git branch
# TYPE descheduler_build_info gauge
descheduler_build_info{DeschedulerVersion="v20210301-v0.20.0-38-gf83615418",GitBranch="collect-metrics",GitSha1="f83615418e729595e6a39d8632304c05442bc67a",GoVersion="go1.15"} 0

@ingvagabund
Copy link
Contributor Author

After some time running over vanilla OpenShift 4.8 with the example policy config:

$ curl -s https://localhost:10258/metrics -k | grep descheduler
# HELP descheduler_build_info [ALPHA] Build info about descheduler, including Go version, Descheduler version, Git SHA, Git branch
# TYPE descheduler_build_info gauge
descheduler_build_info{DeschedulerVersion="v20210301-v0.20.0-38-gf83615418",GitBranch="collect-metrics",GitSha1="f83615418e729595e6a39d8632304c05442bc67a",GoVersion="go1.15"} 0
# HELP descheduler_pods_evicted [ALPHA] Number of successfully evicted pods, by the result, by the strategy, by the namespace. 'failed' result means a pod could not be evicted
# TYPE descheduler_pods_evicted counter
descheduler_pods_evicted{namespace="openshift-marketplace",result="success",strategy="RemoveDuplicatePods"} 1

@@ -97,19 +98,21 @@ func (pe *PodEvictor) TotalEvicted() int {
// EvictPod returns non-nil error only when evicting a pod on a node is not
// possible (due to maxPodsToEvictPerNode constraint). Success is true when the pod
// is evicted on the server side.
func (pe *PodEvictor) EvictPod(ctx context.Context, pod *v1.Pod, node *v1.Node, reasons ...string) (bool, error) {
func (pe *PodEvictor) EvictPod(ctx context.Context, pod *v1.Pod, node *v1.Node, strategy string, reasons ...string) (bool, error) {
var reason string
if len(reasons) > 0 {
reason = " (" + strings.Join(reasons, ", ") + ")"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing from what we're talking about above (https://github.com/kubernetes-sigs/descheduler/pull/505/files#r584629100), maybe we can change this line to:

		reason = strategy + " (" + strings.Join(reasons, ", ") + ")"

then, all the calls to EvictPod can be shortened to just 1 parameter (so, no longer passing in the optional reasons list, but still keeping that available):

podEvictor.EvictPod(ctx, pod, nodeMap[nodeName], "RemoveDuplicatePods")

this way, we don't have to remove the reasons parameter but we clean up those weird-looking function calls. We can then start some work to actually use reasons for each strategy in a more informative way than how it is currently. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!!! Updated.

@ingvagabund ingvagabund changed the title wip: Collect metrics Collect metrics Mar 2, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 2, 2021
@ingvagabund
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 2, 2021
@ingvagabund
Copy link
Contributor Author

Not sure if 10258 port is allocated by a different component. The PR also pulls a lot of new dependencies (mainly apiserver bits) to provide secure serving of the metrics. Right now, the metrics are served by default and there's no way to disable it. We might provide additional option/configuration to disable the metrics explicitly, wdyt?

@ingvagabund
Copy link
Contributor Author

@damemi @seanmalloy @lixiang233 PTAL

"sigs.k8s.io/descheduler/pkg/apis/componentconfig"
"sigs.k8s.io/descheduler/pkg/apis/componentconfig/v1alpha1"
deschedulerscheme "sigs.k8s.io/descheduler/pkg/descheduler/scheme"
)

const (
DefaultDeschedulerPort = 10258
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mention in somewhere that port 10258 is used for metrics by default and users can use command line flag --secure-port to custom port number? It's not easy to get this info from the help command.

Copy link
Contributor Author

@ingvagabund ingvagabund Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments into --disable-metrics flag and the readme.

metrics/metrics.go Outdated Show resolved Hide resolved
@lixiang233
Copy link
Contributor

We might provide additional option/configuration to disable the metrics explicitly, wdyt?

+1 for providing a way to disable metrics.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 2, 2021
@ingvagabund
Copy link
Contributor Author

Let's merge #513 first

New metrics:
- build_info: Build info about descheduler, including Go version, Descheduler version, Git SHA, Git branch
- pods_evicted: Number of successfully evicted pods, by the result, by the strategy, by the namespace
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2021
@@ -521,6 +521,16 @@ Setting `--v=4` or greater on the Descheduler will log all reasons why any pod i
Pods subject to a Pod Disruption Budget(PDB) are not evicted if descheduling violates its PDB. The pods
are evicted by using the eviction subresource to handle PDB.

## Metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the table of contents too? (we should add a verify script for that...)
Other than that, lgtm
/approve

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: damemi, ingvagabund

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 5, 2021
@seanmalloy
Copy link
Member

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 7, 2021
@seanmalloy
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 7, 2021
@k8s-ci-robot k8s-ci-robot merged commit 24a0651 into kubernetes-sigs:master Mar 7, 2021
@ingvagabund ingvagabund deleted the collect-metrics branch March 8, 2021 09:47
briend pushed a commit to briend/descheduler that referenced this pull request Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Support For Prometheus Metrics
5 participants