Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for OVS flow operations metrics on node #866

Merged

Conversation

yktsubo
Copy link
Contributor

@yktsubo yktsubo commented Jun 24, 2020

Add support for OVS flow operations metrics on node

  • Number of OVS flow operations, partitioned by operations(add, modify and delete)
  • Number of OVS flow operation errors, partitioned by operations(add, modify and delete)
  • The latency of OVS flow operations, partitioned by operations(add, modify and delete)

This PR is a part of #713 feature request

Signed-off-by: Yuki Tsuboi ytsuboi@vmware.com

@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-whole-conformance: to trigger all conformance tests on linux.
  • /skip-whole-conformance: to skip all conformance tests on linux.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-windows-conformance: to trigger windows conformance tests.
  • /skip-windows-conformance: to skip windows conformance tests.
  • /test-all: to trigger all tests (except whole conformance).
  • /skip-all: to skip all tests (except whole conformance).

These commands can only be run by members of the vmware-tanzu organization.

Copy link
Member

@srikartati srikartati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change, Yuki.

@@ -67,6 +67,54 @@ var (
Help: "Flow count for each OVS flow table. The TableID is used as a label.",
StabilityLevel: metrics.STABLE,
}, []string{"table_id"})

OVSFlowAddErrorCount = metrics.NewCounter(
Copy link
Member

@srikartati srikartati Jun 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can have one metric OVSFlowOpsErrorCount and have add, modify and delete as labels?
Updating correct ref for labels in summary (golang): https://github.com/kubernetes/component-base/blob/release-1.18/metrics/summary.go#L30 SummaryOpts have constLabels to consume.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll take a look at it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some research and think that it's good to use NewConterVec for ErrorCount and NewSummaryVec for Duration.
But please kindly let me know if SummaryOpts is better than them.
I think it's good because constLabels are static and put to all measured metrics.

Hence I try to define the metrics using the below label.

  • antrea_agent_ovs_flow_ops_error_count - differentiate operations: operation="add|modify|delete"
    image

  • antrea_agent_ovs_flow_ops_duration_milliseconds - differentiate operations: operation="add|modify|delete"
    image

@yktsubo yktsubo force-pushed the add_ovs_flow_operation_metrics_on_agent branch from 5ef4a78 to 14d4802 Compare June 28, 2020 16:32
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choice of metrics lgtm

if err := legacyregistry.Register(OVSFlowOpsErrorCount); err != nil {
klog.Error("Failed to register antrea_agent_ovs_flow_ops_error_count with Prometheus")
}
OVSFlowOpsErrorCount.WithLabelValues("add")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this "initialization" needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a program perspective, initialization is not required.
But I thought it's good to initialize them so that prometheus can know which metrics are there and there are no errors until now.
Without initialization, antrea_agent_ovs_flow_ops_error_count won't come out until you hit errors.
Please let me know your thoughts on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, thanks for the explanation. I recommend adding a comment in the code with this explanation to avoid confusion in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thank you for your suggestion.

if err := legacyregistry.Register(OVSFlowOpsDuration); err != nil {
klog.Error("Failed to register antrea_agent_ovs_flow_ops_duration_milliseconds with Prometheus")
}
OVSFlowOpsDuration.WithLabelValues("add")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question as above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented on above. I'll change this in the same way.

pkg/agent/metrics/prometheus.go Show resolved Hide resolved
OVSFlowOpsDuration = metrics.NewSummaryVec(
&metrics.SummaryOpts{
Name: "antrea_agent_ovs_flow_ops_duration_milliseconds",
Help: "The duration of OVS flow operation",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Help: "The duration of OVS flow operation",
Help: "The latency of OVS flow operations.",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment, I'll fix it.

Copy link
Member

@srikartati srikartati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vector metrics sound good.

[]string{"operation"},
)

OVSFlowOpsDuration = metrics.NewSummaryVec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not your change. There are other summary/summary vector metrics in Antrea. As per Kubernetes metric overhaul, it is recommended to use histograms instead of summaries. And the summary metrics are tagged for deprecation.
Main advantages with histogram are aggregation and inexpensive. Any comments @ksamoray ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoninbas @tnqn any thoughts on above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not an expert, but we have been using the STABLE stability level for all these metrics. According to that contract, the type of the metric will not be modified. So while I am fine with using histograms instead of summaries for new metrics, do we actually want to update existing metrics to use histograms? Or should we follow the guidelines and deprecate the old metrics while introducing new ones with the histograms type?

Copy link
Contributor Author

@yktsubo yktsubo Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for new metrics in this PR, I'll follow the recommendation to use histograms.

Copy link
Member

@srikartati srikartati Jul 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So while I am fine with using histograms instead of summaries for new metrics, do we actually want to update existing metrics to use histograms? Or should we follow the guidelines and deprecate the old metrics while introducing new ones with the histograms type?

Thanks for the response. Yes, making the current metric deprecated and add the new metrics with histogram type is suggested. After a couple of releases, removing deprecated metrics is suggested as a guideline. I am wondering this makes sense if there are consumers (w/dashboards) or third party software dependent on these metrics. If not, could we just update them with histogram type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we probably have few users (or none) relying on these metrics. I don't think I would be opposed to just switching the metric type, providing we sent the appropriate notices on Slack and the mailing list and waited a couple of days for feedback. In the future, it may be a good idea to not tag these metrics as STABLE right away...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not defining them as STABLE sounds good. We should define new metrics as ALPHA and move them to STABLE after a release or two; or when confident that these were being used effectively.
Let me open an issue and post this in slack and mailing list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed to go with Alpha in a few first releases.
I'll make a change to this PR.
@srikartati @antoninbas But do you think we also need to make existing metrics Alpha?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to make them alpha and turn them into STABLE after the next release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comments.

klog.Error("Failed to register antrea_agent_ovs_flow_ops_error_count with Prometheus")
}
OVSFlowOpsErrorCount.WithLabelValues("add")
OVSFlowOpsErrorCount.WithLabelValues("modify")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as Antonin. Are these needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have a look at the above comment. I'd like to hear your opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Yuki, Will more descriptive 'help' sufficient mentioning label values add, delete and modify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment. I'll make help more descriptive and also add comments about why we have initialization for add, delete and modify.

@yktsubo yktsubo force-pushed the add_ovs_flow_operation_metrics_on_agent branch from 14d4802 to 14f47f6 Compare July 3, 2020 15:44
@yktsubo yktsubo changed the title Add metrics of error count and duration of OVS flow operation on node Add support for OVS flow operations metrics on node Jul 3, 2020
yktsubo pushed a commit to yktsubo/antrea that referenced this pull request Jul 3, 2020
- Number of OVS flow operations, partitioned by operations(add, modify and delete)
- Number of OVS flow operation errors, partitioned by operations(add, modify and delete)
- Latency of OVS flow operations, partitioned by operations(add, modify and delete)

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
@yktsubo yktsubo force-pushed the add_ovs_flow_operation_metrics_on_agent branch from 14f47f6 to 63c68f7 Compare July 3, 2020 15:49
yktsubo pushed a commit to yktsubo/antrea that referenced this pull request Jul 3, 2020
- Number of OVS flow operations, partitioned by operations(add, modify and delete)
- Number of OVS flow operation errors, partitioned by operations(add, modify and delete)
- Latency of OVS flow operations, partitioned by operations(add, modify and delete)

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
@yktsubo yktsubo force-pushed the add_ovs_flow_operation_metrics_on_agent branch from 63c68f7 to df5e6fd Compare July 3, 2020 15:53
yktsubo pushed a commit to yktsubo/antrea that referenced this pull request Jul 3, 2020
- Number of OVS flow operations, partitioned by operations(add, modify and delete)
- Number of OVS flow operation errors, partitioned by operations(add, modify and delete)
- The latency of OVS flow operations, partitioned by operations(add, modify and delete)

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
@yktsubo yktsubo force-pushed the add_ovs_flow_operation_metrics_on_agent branch from df5e6fd to 0efd77b Compare July 3, 2020 15:55
Copy link
Member

@srikartati srikartati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I took a crack at adding tests for some Prometheus metrics using existing integration tests.
#916
Do you think these metrics can be tested in similar fashion?

@yktsubo
Copy link
Contributor Author

yktsubo commented Jul 7, 2020

Hi @srikartati
Thank you for your comment. I think we can test these metrics, as well.
About latency, we cannot expect exact value but we can check it's not zero.
Depending on your PR merge, I can add my test on top of it.

@srikartati
Copy link
Member

Hi @srikartati
Thank you for your comment. I think we can test these metrics, as well.
About latency, we cannot expect exact value but we can check it's not zero.
Depending on your PR merge, I can add my test on top of it.

Sure, different PR for testing sounds good.

@srikartati
Copy link
Member

/test-all

@yktsubo
Copy link
Contributor Author

yktsubo commented Jul 9, 2020

Sure, @srikartati I'll work on a test case in a different PR.

@srikartati
Copy link
Member

/test-networkpolicy

@srikartati
Copy link
Member

/test-conformance

@srikartati
Copy link
Member

/test-all

srikartati
srikartati previously approved these changes Jul 14, 2020
Copy link
Member

@srikartati srikartati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@srikartati
Copy link
Member

/test-e2e

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit, otherwise LGTM

OVSFlowOpsCount = metrics.NewCounterVec(
&metrics.CounterOpts{
Name: "antrea_agent_ovs_flow_ops_count",
Help: "Number of OVS flow operations, partitioned by operations(add, modify and delete).",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/partitioned by operations(add, modify and delete)/partitioned by operation type (add, modify and delete)

same for the other places below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment. Fixed.

yktsubo pushed a commit to yktsubo/antrea that referenced this pull request Jul 15, 2020
- Number of OVS flow operations, partitioned by operations(add, modify and delete)
- Number of OVS flow operation errors, partitioned by operations(add, modify and delete)
- The latency of OVS flow operations, partitioned by operations(add, modify and delete)

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
@yktsubo yktsubo force-pushed the add_ovs_flow_operation_metrics_on_agent branch from 0efd77b to 250d29e Compare July 15, 2020 00:06
@yktsubo
Copy link
Contributor Author

yktsubo commented Jul 18, 2020

The metrics name are different depending on source.

From pods,
antrea_agent_ovs_flow_ops_latency_milliseconds

From prometheus server,
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket
antrea_agent_ovs_flow_ops_latency_milliseconds_count
antrea_agent_ovs_flow_ops_latency_milliseconds_sum

@yktsubo
Copy link
Contributor Author

yktsubo commented Jul 18, 2020

I'm planning to have separate expected metrics for pods and prometheus server.
Please let me know if it doesn't sound good.

@srikartati
Copy link
Member

The metrics name are different depending on source.

From pods,
antrea_agent_ovs_flow_ops_latency_milliseconds

From prometheus server,
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket
antrea_agent_ovs_flow_ops_latency_milliseconds_count
antrea_agent_ovs_flow_ops_latency_milliseconds_sum

Thanks for root causing it. It is because of the histogram as you mentioned, Can you elaborate a bit more on why this is happening? Just curious about why the Prometheus server cannot treat it as one metric?

@srikartati
Copy link
Member

The metrics name are different depending on source.
From pods,
antrea_agent_ovs_flow_ops_latency_milliseconds
From prometheus server,
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket
antrea_agent_ovs_flow_ops_latency_milliseconds_count
antrea_agent_ovs_flow_ops_latency_milliseconds_sum

Thanks for root causing it. It is because of the histogram as you mentioned, Can you elaborate a bit more on why this is happening? Just curious about why the Prometheus server cannot treat it as one metric?

A follow up question: Does Antrea agent metrics handler output for histogram metric is in following format? If so, adding bucket, count, sum metrics make sense.
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{le="0.005",} xx
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{le="0.01",} xx
....
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{le="+Inf",} xx
antrea_agent_ovs_flow_ops_latency_milliseconds_count xx
antrea_agent_ovs_flow_ops_latency_milliseconds_sum xxx

@yktsubo
Copy link
Contributor Author

yktsubo commented Jul 20, 2020

Hi @srikartati

https://prometheus.io/docs/concepts/metric_types/#histogram
As mentioned in the official page, antrea exposes metrics like below.
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{le="0.005",} xx
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{le="0.01",} xx
....
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{le="+Inf",} xx
antrea_agent_ovs_flow_ops_latency_milliseconds_count xx
antrea_agent_ovs_flow_ops_latency_milliseconds_sum xxx

Here is an example from an agent pod

$ curl -k -H "Authorization: Bearer $token" https://10.16.181.21:10350/metrics 2> /dev/null | grep antrea_agent_ovs_flow_ops_latency_milliseconds
# HELP antrea_agent_ovs_flow_ops_latency_milliseconds [ALPHA] The latency of OVS flow operations, partitioned by operation type (add, modify and delete).
# TYPE antrea_agent_ovs_flow_ops_latency_milliseconds histogram
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="add",le="0.005"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="add",le="0.01"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="add",le="0.025"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="add",le="0.05"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="add",le="0.1"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="add",le="0.25"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="add",le="0.5"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="add",le="1"} 4
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="add",le="2.5"} 9
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="add",le="5"} 10
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="add",le="10"} 10
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="add",le="+Inf"} 10
antrea_agent_ovs_flow_ops_latency_milliseconds_sum{operation="add"} 17
antrea_agent_ovs_flow_ops_latency_milliseconds_count{operation="add"} 10
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="delete",le="0.005"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="delete",le="0.01"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="delete",le="0.025"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="delete",le="0.05"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="delete",le="0.1"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="delete",le="0.25"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="delete",le="0.5"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="delete",le="1"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="delete",le="2.5"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="delete",le="5"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="delete",le="10"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="delete",le="+Inf"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_sum{operation="delete"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_count{operation="delete"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="modify",le="0.005"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="modify",le="0.01"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="modify",le="0.025"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="modify",le="0.05"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="modify",le="0.1"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="modify",le="0.25"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="modify",le="0.5"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="modify",le="1"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="modify",le="2.5"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="modify",le="5"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="modify",le="10"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_bucket{operation="modify",le="+Inf"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_sum{operation="modify"} 0
antrea_agent_ovs_flow_ops_latency_milliseconds_count{operation="modify"} 0

So the behavior of exposing metrics is fine.
The issue is in our test code side. (Sorry I should have provided more clear information)

Let me take antrea_agent_ovs_flow_ops_latency_milliseconds as an example.
In testPrometheusMetricsOnPods in e2e, we use expfmt library to parse metrics from pod
https://github.com/vmware-tanzu/antrea/blob/a858c9ad25c215418003e4cb92a68ec62c7fddbf/test/e2e/prometheus_test.go#L162

This TextToMetricFamillies returns the basename of metrics.
Hence, parsed metrics will get antrea_agent_ovs_flow_ops_latency_milliseconds.

In testMetricsFromPrometheusServer in e2e, we get metrics from prometheus and parse it using json library.
Therefore, it will get <basename>_bucket, <basename>_sum and <basename>_count.

That is why either of the tests failed depending on expected metrics.

I think if we can use basename as expected metrics in testMetricsFromPrometheusServer too, it will be better.
I'm looking at documents to get basename from prometheus API. If it's not possible, I'll see if it's ok we can simply cut back _bucket from metrics.

@yktsubo
Copy link
Contributor Author

yktsubo commented Jul 20, 2020

I think expfmt gets basename by cutting out _bucket

https://github.com/prometheus/common/blob/546f1fd8d7df61d94633b254641f9f8f48248ada/expfmt/text_parse.go#L665

So I think we can cut out _bucket from metrics gotten from prometheus.
@srikartati let me know if it doesn't sound good.

@srikartati
Copy link
Member

I think expfmt gets basename by cutting out _bucket

https://github.com/prometheus/common/blob/546f1fd8d7df61d94633b254641f9f8f48248ada/expfmt/text_parse.go#L665

So I think we can cut out _bucket from metrics gotten from Prometheus.
@srikartati let me know if it doesn't sound good.

It sounds good. Keeping the behavior consistent for Antrea components and Prometheus server would be good in test code.

yktsubo pushed a commit to yktsubo/antrea that referenced this pull request Jul 20, 2020
- Number of OVS flow operations, partitioned by operations(add, modify and delete)
- Number of OVS flow operation errors, partitioned by operations(add, modify and delete)
- The latency of OVS flow operations, partitioned by operations(add, modify and delete)
- Use prometheus v2.19.2 image to use API of querying target metadata in the e2e test

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
@yktsubo yktsubo force-pushed the add_ovs_flow_operation_metrics_on_agent branch from 6903963 to 4fbe993 Compare July 20, 2020 16:34
@yktsubo
Copy link
Contributor Author

yktsubo commented Jul 20, 2020

On second thought, if we can use the prometheus v2.19.2, we can get basename without cutting out the metrics name.
Since some of our metrics are using _count, I think using querying target metadata API to list up scraped metrics.
https://prometheus.io/docs/prometheus/latest/querying/api/#querying-target-metadata

@antoninbas @ksamoray Do we have any specific reason to use prometheus v2.2.1?
If not, I'd like to upgrade it to v2.19.2 to use a new API in e2e test.

@srikartati
Copy link
Member

On second thought, if we can use the prometheus v2.19.2, we can get basename without cutting out the metrics name.
Since some of our metrics are using _count, I think using querying target metadata API to list up scraped metrics.
https://prometheus.io/docs/prometheus/latest/querying/api/#querying-target-metadata

@antoninbas @ksamoray Do we have any specific reason to use prometheus v2.2.1?
If not, I'd like to upgrade it to v2.19.2 to use a new API in e2e test.

Is this version Prometheus server version or Prometheus library version. Maybe it's better to do the version change in separate PR? Just cut out the metric names for now?

@yktsubo
Copy link
Contributor Author

yktsubo commented Jul 22, 2020

Thank you for your response.
I'll create a separate PR for Prometheus version update.

yktsubo pushed a commit to yktsubo/antrea that referenced this pull request Jul 22, 2020
- Number of OVS flow operations, partitioned by operations(add, modify and delete)
- Number of OVS flow operation errors, partitioned by operations(add, modify and delete)
- The latency of OVS flow operations, partitioned by operations(add, modify and delete)

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
@yktsubo yktsubo force-pushed the add_ovs_flow_operation_metrics_on_agent branch from 4fbe993 to 4708fed Compare July 22, 2020 00:32
Copy link
Member

@srikartati srikartati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if you run the Prometheus test on vagrant setup using the following commands?

To load Antrea into the cluster with Prometheus enabled, use: ./infra/vagrant/push_antrea.sh --prometheus
To run the Prometheus tests within the e2e suite, use: go test -v github.com/vmware-tanzu/antrea/test/e2e --prometheus

@@ -276,7 +279,12 @@ func testMetricsFromPrometheusServer(t *testing.T, data *TestData, prometheusJob
// Create a map of all the metrics which were found on the server
testMap := make(map[string]bool)
for _, metric := range output.Data {
testMap[metric["__name__"]] = true
name := metric["__name__"]
switch {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't "if" sufficient here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, yes 'if' is enough. For the first time, I thought we have to consider summary as well. but it's not required. I'll make a change as you suggested.

testMap[metric["__name__"]] = true
name := metric["__name__"]
switch {
case isBucket(name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings.Contains and strings.TrimSuffix could be used here to make this more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I'll make a change as you suggested.

@yktsubo
Copy link
Contributor Author

yktsubo commented Jul 23, 2020

Just wondering if you run the Prometheus test on vagrant setup using the following commands?

To load Antrea into the cluster with Prometheus enabled, use: ./infra/vagrant/push_antrea.sh --prometheus
To run the Prometheus tests within the e2e suite, use: go test -v github.com/vmware-tanzu/antrea/test/e2e --prometheus

Sorry, I wasn't aware that we can run e2e test on my testbed.
I'll make sure that e2e is verified before pushing my code.

yktsubo pushed a commit to yktsubo/antrea that referenced this pull request Jul 23, 2020
- Number of OVS flow operations, partitioned by operations(add, modify and delete)
- Number of OVS flow operation errors, partitioned by operations(add, modify and delete)
- The latency of OVS flow operations, partitioned by operations(add, modify and delete)

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
@yktsubo yktsubo force-pushed the add_ovs_flow_operation_metrics_on_agent branch from 4708fed to a0d6530 Compare July 23, 2020 05:59
srikartati
srikartati previously approved these changes Jul 23, 2020
Copy link
Member

@srikartati srikartati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit otherwise LGTM.

test/e2e/prometheus_test.go Outdated Show resolved Hide resolved
- Number of OVS flow operations, partitioned by operations(add, modify and delete)
- Number of OVS flow operation errors, partitioned by operations(add, modify and delete)
- The latency of OVS flow operations, partitioned by operations(add, modify and delete)

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
@srikartati
Copy link
Member

/test-all

@srikartati
Copy link
Member

/test-windows-conformance

1 similar comment
@srikartati
Copy link
Member

/test-windows-conformance

@yktsubo
Copy link
Contributor Author

yktsubo commented Jul 24, 2020

@srikartati Thank you so much for your help

@srikartati
Copy link
Member

Thanks for working on the PR. Merging this.

@srikartati srikartati merged commit 33eee8c into antrea-io:master Jul 24, 2020
GraysonWu pushed a commit to GraysonWu/antrea that referenced this pull request Sep 22, 2020
- Number of OVS flow operations, partitioned by operations(add, modify and delete)
- Number of OVS flow operation errors, partitioned by operations(add, modify and delete)
- The latency of OVS flow operations, partitioned by operations(add, modify and delete)

Signed-off-by: Yuki Tsuboi <ytsuboi@vmware.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants