-
Notifications
You must be signed in to change notification settings - Fork 1.3k
add status label to node metrics #1746
add status label to node metrics #1746
Conversation
664eff8
to
b6017aa
Compare
I like the general concept of this, but I think maybe I'd prefer to separate Thoughts @piosz ? |
@DirectXMan12 I have thought about that. If we drain a knode, the knode will both in In fact, when we want to compute the remain resource and total available resource, |
Instead of introducing a new concept of "usable" and "not usable" nodes, how about just re-using something from Kubernetes. Maybe node conditions? (I don't have an answer for this - I just think that introducing something new might be confusing). Also the question is whether you will be able to to gather any metrics from not-ready kubelet? In the current implementation by design we do not scrape not ready nodes because there is a big chance that we will fail anyway and when something wrong is going on the node we do not want to add out 0.02$ to this problem. WDYT? |
The reason for mark a knode as
How about we introduce two labels
|
b6017aa
to
58e8cf5
Compare
58e8cf5
to
6c1daba
Compare
as I'm thinking about this a bit more, is there a reason not to just always report node metrics, and then correlate this after the fact? adding an extra label like that does weird things to the metrics "model", since whether or not something's schedulable shouldn't affect its identity... |
Yep. One of the key points is that we need the node status at least about ready and schedulable status to calculate cluster resource usage and alarm. Always check a node can solve the first problem described above. As for the change to metrics model, it is not backward compatible indeed. |
Friendly ping @DirectXMan12 @piosz :) |
Again Ping @piosz .PTAL. |
@andyxning ready&schedulable sounds better to me. While I'm perfectly fine with Can you extract the logic which adds schedulable label to a different PR, so that we can merge it and focus on the discussion on scraping not-ready nodes. @dashpole @timstclair @yujuhong from node team for their opinion on that. |
@piosz @DirectXMan12 I have proposed a sub-pr about adding |
@andyxning the other PR is merged. Looking for feedback from node team about scraping not ready nodes. |
Friendly ping @yujuhong @Random-Liu @dchen1107 . |
ping @kubernetes/sig-node-pr-reviews |
@piosz How about we use
|
How are you going to interpret metrics from the not-ready nodes? I don't think they are particularly useful especially because kubelet may simply be non-reachable. |
@yujuhong You're right. After a deep thinking about this, the closing this. @piosz @DirectXMan12 |
This PR add a
ready
andschedulable
label to node metrics, such ascpu/node_capacity
orcpu/node_allocatable
with values like below:When nodes in
unschedulable
status ornotready
condition,usability
label value will benot_usable
. For nodes not inunschedulable
ornotready
status,status
label value will beusable
.The reason for adding
usability
label to node metrics is:notready
nodes. We can not get the metrics fromnotready
nodes when the node is marked asnotready
and kubelet evict the pods on them. IIRC, this will default to 5 minutes. By adding anotready
status, we can get all the metrics fornotready
nodes even they have been markednotready
. If we want to filter outnotready
nodes, we can query based onusability = "not_usable"
.unschedulable
nodes, we should also filter out them when altering on the percent of remain resource and available resource. Currently, we can not do this as we can not judge the nodes inunschedulable
status.Backward Incompatibility:
ready
andschedulable
labels to node metrics. Queries for all available resource, we should filter withready = true AND schedulable = true
. This may be incompatible with existing queries for all available resource withoutready = true AND schedulable = true
using the old node metric shcema./cc @DirectXMan12 @piosz