-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for OVS flow operations metrics on node #866
Add support for OVS flow operations metrics on node #866
Conversation
Thanks for your PR. The following commands are available:
These commands can only be run by members of the vmware-tanzu organization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change, Yuki.
pkg/agent/metrics/prometheus.go
Outdated
@@ -67,6 +67,54 @@ var ( | |||
Help: "Flow count for each OVS flow table. The TableID is used as a label.", | |||
StabilityLevel: metrics.STABLE, | |||
}, []string{"table_id"}) | |||
|
|||
OVSFlowAddErrorCount = metrics.NewCounter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can have one metric OVSFlowOpsErrorCount
and have add, modify and delete as labels?
Updating correct ref for labels in summary (golang): https://github.com/kubernetes/component-base/blob/release-1.18/metrics/summary.go#L30 SummaryOpts have constLabels to consume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll take a look at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some research and think that it's good to use NewConterVec for ErrorCount and NewSummaryVec for Duration.
But please kindly let me know if SummaryOpts is better than them.
I think it's good because constLabels are static and put to all measured metrics.
Hence I try to define the metrics using the below label.
5ef4a78
to
14d4802
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choice of metrics lgtm
pkg/agent/metrics/prometheus.go
Outdated
if err := legacyregistry.Register(OVSFlowOpsErrorCount); err != nil { | ||
klog.Error("Failed to register antrea_agent_ovs_flow_ops_error_count with Prometheus") | ||
} | ||
OVSFlowOpsErrorCount.WithLabelValues("add") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this "initialization" needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a program perspective, initialization is not required.
But I thought it's good to initialize them so that prometheus can know which metrics are there and there are no errors until now.
Without initialization, antrea_agent_ovs_flow_ops_error_count won't come out until you hit errors.
Please let me know your thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, thanks for the explanation. I recommend adding a comment in the code with this explanation to avoid confusion in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thank you for your suggestion.
pkg/agent/metrics/prometheus.go
Outdated
if err := legacyregistry.Register(OVSFlowOpsDuration); err != nil { | ||
klog.Error("Failed to register antrea_agent_ovs_flow_ops_duration_milliseconds with Prometheus") | ||
} | ||
OVSFlowOpsDuration.WithLabelValues("add") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question as above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented on above. I'll change this in the same way.
pkg/agent/metrics/prometheus.go
Outdated
OVSFlowOpsDuration = metrics.NewSummaryVec( | ||
&metrics.SummaryOpts{ | ||
Name: "antrea_agent_ovs_flow_ops_duration_milliseconds", | ||
Help: "The duration of OVS flow operation", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Help: "The duration of OVS flow operation", | |
Help: "The latency of OVS flow operations.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment, I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vector metrics sound good.
pkg/agent/metrics/prometheus.go
Outdated
[]string{"operation"}, | ||
) | ||
|
||
OVSFlowOpsDuration = metrics.NewSummaryVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not your change. There are other summary/summary vector metrics in Antrea. As per Kubernetes metric overhaul, it is recommended to use histograms instead of summaries. And the summary metrics are tagged for deprecation.
Main advantages with histogram are aggregation and inexpensive. Any comments @ksamoray ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antoninbas @tnqn any thoughts on above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not an expert, but we have been using the STABLE
stability level for all these metrics. According to that contract, the type of the metric will not be modified. So while I am fine with using histograms instead of summaries for new metrics, do we actually want to update existing metrics to use histograms? Or should we follow the guidelines and deprecate the old metrics while introducing new ones with the histograms type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for new metrics in this PR, I'll follow the recommendation to use histograms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So while I am fine with using histograms instead of summaries for new metrics, do we actually want to update existing metrics to use histograms? Or should we follow the guidelines and deprecate the old metrics while introducing new ones with the histograms type?
Thanks for the response. Yes, making the current metric deprecated and add the new metrics with histogram type is suggested. After a couple of releases, removing deprecated metrics is suggested as a guideline. I am wondering this makes sense if there are consumers (w/dashboards) or third party software dependent on these metrics. If not, could we just update them with histogram type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we probably have few users (or none) relying on these metrics. I don't think I would be opposed to just switching the metric type, providing we sent the appropriate notices on Slack and the mailing list and waited a couple of days for feedback. In the future, it may be a good idea to not tag these metrics as STABLE
right away...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not defining them as STABLE sounds good. We should define new metrics as ALPHA and move them to STABLE after a release or two; or when confident that these were being used effectively.
Let me open an issue and post this in slack and mailing list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed to go with Alpha in a few first releases.
I'll make a change to this PR.
@srikartati @antoninbas But do you think we also need to make existing metrics Alpha?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to make them alpha and turn them into STABLE after the next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comments.
pkg/agent/metrics/prometheus.go
Outdated
klog.Error("Failed to register antrea_agent_ovs_flow_ops_error_count with Prometheus") | ||
} | ||
OVSFlowOpsErrorCount.WithLabelValues("add") | ||
OVSFlowOpsErrorCount.WithLabelValues("modify") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as Antonin. Are these needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please have a look at the above comment. I'd like to hear your opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Yuki, Will more descriptive 'help' sufficient mentioning label values add, delete and modify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment. I'll make help more descriptive and also add comments about why we have initialization for add, delete and modify.
14d4802
to
14f47f6
Compare
- Number of OVS flow operations, partitioned by operations(add, modify and delete) - Number of OVS flow operation errors, partitioned by operations(add, modify and delete) - Latency of OVS flow operations, partitioned by operations(add, modify and delete) Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
14f47f6
to
63c68f7
Compare
- Number of OVS flow operations, partitioned by operations(add, modify and delete) - Number of OVS flow operation errors, partitioned by operations(add, modify and delete) - Latency of OVS flow operations, partitioned by operations(add, modify and delete) Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
63c68f7
to
df5e6fd
Compare
- Number of OVS flow operations, partitioned by operations(add, modify and delete) - Number of OVS flow operation errors, partitioned by operations(add, modify and delete) - The latency of OVS flow operations, partitioned by operations(add, modify and delete) Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
df5e6fd
to
0efd77b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I took a crack at adding tests for some Prometheus metrics using existing integration tests.
#916
Do you think these metrics can be tested in similar fashion?
Hi @srikartati |
Sure, different PR for testing sounds good. |
/test-all |
Sure, @srikartati I'll work on a test case in a different PR. |
/test-networkpolicy |
/test-conformance |
/test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one nit, otherwise LGTM
pkg/agent/metrics/prometheus.go
Outdated
OVSFlowOpsCount = metrics.NewCounterVec( | ||
&metrics.CounterOpts{ | ||
Name: "antrea_agent_ovs_flow_ops_count", | ||
Help: "Number of OVS flow operations, partitioned by operations(add, modify and delete).", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/partitioned by operations(add, modify and delete)/partitioned by operation type (add, modify and delete)
same for the other places below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment. Fixed.
- Number of OVS flow operations, partitioned by operations(add, modify and delete) - Number of OVS flow operation errors, partitioned by operations(add, modify and delete) - The latency of OVS flow operations, partitioned by operations(add, modify and delete) Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
0efd77b
to
250d29e
Compare
The metrics name are different depending on source. From pods, From prometheus server, |
I'm planning to have separate expected metrics for pods and prometheus server. |
Thanks for root causing it. It is because of the histogram as you mentioned, Can you elaborate a bit more on why this is happening? Just curious about why the Prometheus server cannot treat it as one metric? |
A follow up question: Does Antrea agent metrics handler output for histogram metric is in following format? If so, adding bucket, count, sum metrics make sense. |
Hi @srikartati https://prometheus.io/docs/concepts/metric_types/#histogram Here is an example from an agent pod
So the behavior of exposing metrics is fine. Let me take antrea_agent_ovs_flow_ops_latency_milliseconds as an example. This TextToMetricFamillies returns the basename of metrics. In testMetricsFromPrometheusServer in e2e, we get metrics from prometheus and parse it using json library. That is why either of the tests failed depending on expected metrics. I think if we can use basename as expected metrics in testMetricsFromPrometheusServer too, it will be better. |
I think expfmt gets basename by cutting out So I think we can cut out |
It sounds good. Keeping the behavior consistent for Antrea components and Prometheus server would be good in test code. |
- Number of OVS flow operations, partitioned by operations(add, modify and delete) - Number of OVS flow operation errors, partitioned by operations(add, modify and delete) - The latency of OVS flow operations, partitioned by operations(add, modify and delete) - Use prometheus v2.19.2 image to use API of querying target metadata in the e2e test Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
6903963
to
4fbe993
Compare
On second thought, if we can use the prometheus v2.19.2, we can get basename without cutting out the metrics name. @antoninbas @ksamoray Do we have any specific reason to use prometheus v2.2.1? |
Is this version Prometheus server version or Prometheus library version. Maybe it's better to do the version change in separate PR? Just cut out the metric names for now? |
Thank you for your response. |
- Number of OVS flow operations, partitioned by operations(add, modify and delete) - Number of OVS flow operation errors, partitioned by operations(add, modify and delete) - The latency of OVS flow operations, partitioned by operations(add, modify and delete) Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
4fbe993
to
4708fed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering if you run the Prometheus test on vagrant setup using the following commands?
To load Antrea into the cluster with Prometheus enabled, use: ./infra/vagrant/push_antrea.sh --prometheus
To run the Prometheus tests within the e2e suite, use: go test -v github.com/vmware-tanzu/antrea/test/e2e --prometheus
test/e2e/prometheus_test.go
Outdated
@@ -276,7 +279,12 @@ func testMetricsFromPrometheusServer(t *testing.T, data *TestData, prometheusJob | |||
// Create a map of all the metrics which were found on the server | |||
testMap := make(map[string]bool) | |||
for _, metric := range output.Data { | |||
testMap[metric["__name__"]] = true | |||
name := metric["__name__"] | |||
switch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't "if" sufficient here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, yes 'if' is enough. For the first time, I thought we have to consider summary as well. but it's not required. I'll make a change as you suggested.
test/e2e/prometheus_test.go
Outdated
testMap[metric["__name__"]] = true | ||
name := metric["__name__"] | ||
switch { | ||
case isBucket(name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strings.Contains and strings.TrimSuffix could be used here to make this more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I'll make a change as you suggested.
Sorry, I wasn't aware that we can run e2e test on my testbed. |
- Number of OVS flow operations, partitioned by operations(add, modify and delete) - Number of OVS flow operation errors, partitioned by operations(add, modify and delete) - The latency of OVS flow operations, partitioned by operations(add, modify and delete) Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
4708fed
to
a0d6530
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit otherwise LGTM.
- Number of OVS flow operations, partitioned by operations(add, modify and delete) - Number of OVS flow operation errors, partitioned by operations(add, modify and delete) - The latency of OVS flow operations, partitioned by operations(add, modify and delete) Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
a0d6530
to
5dc364d
Compare
/test-all |
/test-windows-conformance |
1 similar comment
/test-windows-conformance |
@srikartati Thank you so much for your help |
Thanks for working on the PR. Merging this. |
- Number of OVS flow operations, partitioned by operations(add, modify and delete) - Number of OVS flow operation errors, partitioned by operations(add, modify and delete) - The latency of OVS flow operations, partitioned by operations(add, modify and delete) Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Add support for OVS flow operations metrics on node
This PR is a part of #713 feature request
Signed-off-by: Yuki Tsuboi ytsuboi@vmware.com