add status label to node metrics #1746

andyxning · 2017-08-01T14:02:42Z

This PR add a ready and schedulable label to node metrics, such as cpu/node_capacity or cpu/node_allocatable with values like below:

label	value
ready	true/false/unknown
schedulable	true/false

When nodes in unschedulable status or notready condition, usability label value will be not_usable. For nodes not in unschedulable or notready status, status label value will be usable.

The reason for adding usability label to node metrics is:

Currently, heapster will auto filter out notready nodes. We can not get the metrics from notready nodes when the node is marked as notready and kubelet evict the pods on them. IIRC, this will default to 5 minutes. By adding a notready status, we can get all the metrics for notready nodes even they have been marked notready. If we want to filter out notready nodes, we can query based on usability = "not_usable".
As for unschedulable nodes, we should also filter out them when altering on the percent of remain resource and available resource. Currently, we can not do this as we can not judge the nodes in unschedulable status.

Backward Incompatibility:

As we have added a ready and schedulable labels to node metrics. Queries for all available resource, we should filter with ready = true AND schedulable = true. This may be incompatible with existing queries for all available resource without ready = true AND schedulable = true using the old node metric shcema.

/cc @DirectXMan12 @piosz

k8s-reviewable · 2017-08-01T14:02:48Z

This change is

DirectXMan12 · 2017-08-01T22:36:07Z

I like the general concept of this, but I think maybe I'd prefer to separate not_usable into unschedulable and unready. If you want usuable nodes, you can then say usability=usable

Thoughts @piosz ?

andyxning · 2017-08-01T22:45:45Z

@DirectXMan12 I have thought about that. If we drain a knode, the knode will both in unschedulable and notready status. I have a hesitation about which one to chose to represent the knode status.

In fact, when we want to compute the remain resource and total available resource, notready and unschedulable is the same, we can not use node in either two status.

piosz · 2017-08-04T10:56:59Z

Instead of introducing a new concept of "usable" and "not usable" nodes, how about just re-using something from Kubernetes. Maybe node conditions? (I don't have an answer for this - I just think that introducing something new might be confusing).

Also the question is whether you will be able to to gather any metrics from not-ready kubelet? In the current implementation by design we do not scrape not ready nodes because there is a big chance that we will fail anyway and when something wrong is going on the node we do not want to add out 0.02$ to this problem.

WDYT?

andyxning · 2017-08-04T13:16:04Z

Also the question is whether you will be able to to gather any metrics from not-ready kubelet?

The reason for mark a knode as notready may be a network problem or some other problem happens on a knode, such as docker hung will make a knode notready for now. We may not enumerate all the situations.

Instead of introducing a new concept of "usable" and "not usable" nodes, how about just re-using something from Kubernetes. Maybe node conditions?

How about we introduce two labels ready and schedulable. As they can co-exists. The possible combination is like below:

label	value
ready	true/false/unknown
schedulable	true/false

@DirectXMan12 @piosz

DirectXMan12 · 2017-08-04T19:56:23Z

as I'm thinking about this a bit more, is there a reason not to just always report node metrics, and then correlate this after the fact? adding an extra label like that does weird things to the metrics "model", since whether or not something's schedulable shouldn't affect its identity...

andyxning · 2017-08-04T23:05:01Z

Yep. One of the key points is that we need the node status at least about ready and schedulable status to calculate cluster resource usage and alarm.

Always check a node can solve the first problem described above. As for the change to metrics model, it is not backward compatible indeed.

andyxning · 2017-08-08T00:33:03Z

Friendly ping @DirectXMan12 @piosz :)

andyxning · 2017-08-19T00:12:33Z

Again Ping @piosz .PTAL.

piosz · 2017-08-21T15:35:37Z

@andyxning ready&schedulable sounds better to me. While I'm perfectly fine with schedulable bit I'm not super-convinced that scraping metrics from not-ready nodes is the right approach.

Can you extract the logic which adds schedulable label to a different PR, so that we can merge it and focus on the discussion on scraping not-ready nodes.

@dashpole @timstclair @yujuhong from node team for their opinion on that.

andyxning · 2017-08-22T14:47:51Z

@piosz @DirectXMan12 I have proposed a sub-pr about adding schedulable label to node metrics. #1782 PTAL.

piosz · 2017-08-24T07:27:21Z

@andyxning the other PR is merged. Looking for feedback from node team about scraping not ready nodes.

andyxning · 2017-08-24T07:34:05Z

Friendly ping @yujuhong @Random-Liu @dchen1107 .
Could you please take a look at this about scraping not ready nodes.

piosz · 2017-08-29T09:17:58Z

ping @kubernetes/sig-node-pr-reviews

andyxning · 2017-08-29T15:24:45Z

@piosz How about we use ready label like this:

add ready label to node metrics. The available values for ready label are true/false/unknown
Stop scraping metrics for nodes with ready label values be false/unknown but still emit the node level metrics. With this heapster will not scrape not ready nodes but we still get node level metrics indicating that a node is not ready.

yujuhong · 2017-08-29T16:06:53Z

How are you going to interpret metrics from the not-ready nodes? I don't think they are particularly useful especially because kubelet may simply be non-reachable.
Also, there are ongoing efforts to sort through node conditions (e.g., kubernetes/kubernetes#45717). I'd suggest not adding this until we are clear what "not ready" means.

andyxning · 2017-09-08T08:55:57Z

@yujuhong You're right. After a deep thinking about this, the not-ready condition is complicated and we can not simply conclude a clear reason for it. So, agreed on not scraping not-ready condition nodes.

closing this. @piosz @DirectXMan12

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 1, 2017

googlebot added the cla: yes label Aug 1, 2017

k8s-ci-robot requested review from DirectXMan12 and piosz August 1, 2017 14:02

andyxning force-pushed the add_status_label_to_node_metrics branch 3 times, most recently from 664eff8 to b6017aa Compare August 1, 2017 15:19

piosz self-assigned this Aug 4, 2017

andyxning force-pushed the add_status_label_to_node_metrics branch from b6017aa to 58e8cf5 Compare August 4, 2017 13:32

add usability label to node metrics

6c1daba

andyxning force-pushed the add_status_label_to_node_metrics branch from 58e8cf5 to 6c1daba Compare August 4, 2017 14:03

andyxning mentioned this pull request Aug 22, 2017

add schedulable label to node metric #1782

Merged

andyxning closed this Sep 8, 2017

cblecker unassigned piosz Nov 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add status label to node metrics #1746

add status label to node metrics #1746

andyxning commented Aug 1, 2017 •

edited

Loading

k8s-reviewable commented Aug 1, 2017

DirectXMan12 commented Aug 1, 2017

andyxning commented Aug 1, 2017 •

edited

Loading

piosz commented Aug 4, 2017

andyxning commented Aug 4, 2017 •

edited

Loading

DirectXMan12 commented Aug 4, 2017

andyxning commented Aug 4, 2017 •

edited

Loading

andyxning commented Aug 8, 2017

andyxning commented Aug 19, 2017

piosz commented Aug 21, 2017

andyxning commented Aug 22, 2017 •

edited

Loading

piosz commented Aug 24, 2017

andyxning commented Aug 24, 2017

piosz commented Aug 29, 2017

andyxning commented Aug 29, 2017

yujuhong commented Aug 29, 2017 •

edited

Loading

andyxning commented Sep 8, 2017

add status label to node metrics #1746

add status label to node metrics #1746

Conversation

andyxning commented Aug 1, 2017 • edited Loading

k8s-reviewable commented Aug 1, 2017

DirectXMan12 commented Aug 1, 2017

andyxning commented Aug 1, 2017 • edited Loading

piosz commented Aug 4, 2017

andyxning commented Aug 4, 2017 • edited Loading

DirectXMan12 commented Aug 4, 2017

andyxning commented Aug 4, 2017 • edited Loading

andyxning commented Aug 8, 2017

andyxning commented Aug 19, 2017

piosz commented Aug 21, 2017

andyxning commented Aug 22, 2017 • edited Loading

piosz commented Aug 24, 2017

andyxning commented Aug 24, 2017

piosz commented Aug 29, 2017

andyxning commented Aug 29, 2017

yujuhong commented Aug 29, 2017 • edited Loading

andyxning commented Sep 8, 2017

andyxning commented Aug 1, 2017 •

edited

Loading

andyxning commented Aug 1, 2017 •

edited

Loading

andyxning commented Aug 4, 2017 •

edited

Loading

andyxning commented Aug 4, 2017 •

edited

Loading

andyxning commented Aug 22, 2017 •

edited

Loading

yujuhong commented Aug 29, 2017 •

edited

Loading