Change prometheus metrics type from summary to histogram #1202

dreamtalen · 2020-09-02T18:30:05Z

The summary types are tagged for deprecation, Kubernetes recommended to use histograms instead of summaries. The main advantages of histogram types are aggregation and inexpensive. In this commit, we changed three Antrea controller metrics from summary to histogram type. They are DurationAppliedToGroupSyncing, DurationAddressGroupSyncing, and DurationInternalNetworkPolicySyncing.

Did the manually test to check the type of these metrics using following commands:

make
./test/e2e/infra/vagrant/push_antrea.sh --prometheus
kubectl exec -it <antrea_controller_pod> -n kube-system -- bash
TOKEN=$(cat /var/run/antrea/apiserver/loopback-client-token)
curl --insecure --header "Authorization: Bearer $TOKEN" https://127.0.0.1:10349/metrics

Test results:
# HELP antrea_controller_applied_to_group_sync_duration_milliseconds [ALPHA] The duration of syncing applied-to-group
# TYPE antrea_controller_applied_to_group_sync_duration_milliseconds histogram
# HELP antrea_controller_address_group_sync_duration_milliseconds [ALPHA] The duration of syncing address-group
# TYPE antrea_controller_address_group_sync_duration_milliseconds histogram
# HELP antrea_controller_network_policy_sync_duration_milliseconds [ALPHA] The duration of syncing internal-networkpolicy
# TYPE antrea_controller_network_policy_sync_duration_milliseconds histogram

antrea-bot · 2020-09-02T18:30:18Z

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

/test-e2e: to trigger e2e tests.
/skip-e2e: to skip e2e tests.
/test-conformance: to trigger conformance tests.
/skip-conformance: to skip conformance tests.
/test-whole-conformance: to trigger all conformance tests on linux.
/skip-whole-conformance: to skip all conformance tests on linux.
/test-networkpolicy: to trigger networkpolicy tests.
/skip-networkpolicy: to skip networkpolicy tests.
/test-windows-conformance: to trigger windows conformance tests.
/skip-windows-conformance: to skip windows conformance tests.
/test-windows-networkpolicy: to trigger windows networkpolicy tests.
/skip-windows-networkpolicy: to skip windows networkpolicy tests.
/test-hw-offload: to trigger ovs hardware offload test.
/skip-hw-offload: to skip ovs hardware offload test.
/test-all: to trigger all tests (except whole conformance).
/skip-all: to skip all tests (except whole conformance).

codecov-commenter · 2020-09-02T18:34:09Z

Codecov Report

Merging #1202 into master will increase coverage by 0.19%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1202      +/-   ##
==========================================
+ Coverage   56.06%   56.26%   +0.19%     
==========================================
  Files         108      108              
  Lines       11887    12027     +140     
==========================================
+ Hits         6665     6767     +102     
- Misses       4634     4671      +37     
- Partials      588      589       +1

Flag	Coverage Δ
#integration-tests	`46.12% <ø> (-0.21%)`	⬇️
#unit-tests	`42.51% <ø> (+0.27%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/agent/types/networkpolicy.go	`16.66% <0.00%> (-8.34%)`	⬇️
pkg/agent/openflow/network_policy.go	`73.61% <0.00%> (-0.84%)`	⬇️
...ntroller/networkpolicy/networkpolicy_controller.go	`72.92% <0.00%> (-0.13%)`	⬇️
pkg/agent/openflow/client.go	`35.60% <0.00%> (ø)`
pkg/agent/openflow/pipeline.go	`57.11% <0.00%> (+0.31%)`	⬆️
pkg/ovs/openflow/ofctrl_bridge.go	`72.24% <0.00%> (+0.66%)`	⬆️
pkg/ovs/openflow/ofctrl_action.go	`83.79% <0.00%> (+1.58%)`	⬆️
pkg/ovs/openflow/ofctrl_builder.go	`74.91% <0.00%> (+2.60%)`	⬆️

srikartati

Please use "Fixes: #" in the PR description, so that the issue could be linked and closed automatically when PR is merged.
LGTM otherwise.

This reminds me that we could enhance testing for Prometheus metrics at the controller like agent (#799 ).. possibly using network policy controller unit tests (networkpolicy_controller_test.go). I will update the issue.

antoninbas

LGTM but I think the PR description should mention the effect (if any) of changing the metrics type on the consumers of these metrics. If I have metrics enabled and I upgrade my version of Antrea, will there be any issue in the Prometheus server, on in a dashboard like Grafana?

ksamoray · 2020-09-09T08:48:20Z

LGTM but I think the PR description should mention the effect (if any) of changing the metrics type on the consumers of these metrics. If I have metrics enabled and I upgrade my version of Antrea, will there be any issue in the Prometheus server, on in a dashboard like Grafana?

For that purpose, K8S metrics wrapper has deprecation capabilities. We could mark the summary metric as deprecated and add the histogram.
At a given version, dispose the deprecated metrics.
Not sure if we should go through this at this stage before we have a solid install base. Maybe just document this change in the RN.

srikartati · 2020-09-09T19:54:55Z

For that purpose, K8S metrics wrapper has deprecation capabilities. We could mark the summary metric as deprecated and add the histogram.
At a given version, dispose the deprecated metrics.
Not sure if we should go through this at this stage before we have a solid install base. Maybe just document this change in the RN.

@ksamoray Yes, you are right. Ideally, we should go through the process of K8s deprecation capabilities. As mentioned by @antoninbas, some consumers may get affected, if they have predefined Grafana dashboards or query scripts. They may need to update their scripts.
Therefore, we posted this change in slack/google groups to notify if anyone is consuming the metric as a Summary metric and they are affected by it. Since we did not get any response, we thought we could go ahead with this change.

What is RN?

ksamoray · 2020-09-10T05:40:12Z

What is RN?

Release notes

srikartati · 2020-09-10T07:19:39Z

/test-all

tnqn · 2020-09-10T14:37:05Z

/skip-hw-offload as it's not related.

tnqn · 2020-09-10T14:44:51Z

@dreamtalen the commit message is not well wrapped, could you follow https://github.com/golang/go/wiki/CommitMessage to revise?

the text should be wrapped to ~76 characters (to appease git viewing tools, mainly), unless you really need longer lines (e.g. for ASCII art, tables, or long links)

dreamtalen · 2020-09-10T19:12:09Z

@dreamtalen the commit message is not well wrapped, could you follow https://github.com/golang/go/wiki/CommitMessage to revise?

the text should be wrapped to ~76 characters (to appease git viewing tools, mainly), unless you really need longer lines (e.g. for ASCII art, tables, or long links)

Thanks for pointing out. I will revise it.

tnqn · 2020-09-11T05:41:14Z

Sorry for being nit-picking, the commit message seems not wrapped neatly, and a newline is usually inserted between paragraphs.
This is the auto-wrapped message in my editor (I use 72 chars per line):

The summary types are tagged for deprecation, Kubernetes recommended to
use histograms instead of summaries. The main advantages of histogram
types are aggregation and inexpensive.

In this commit, we changed three Antrea controller metrics from summary
to histogram type. They are DurationAppliedToGroupSyncing,
DurationAddressGroupSyncing, and DurationInternalNetworkPolicySyncing.

Fixes #905

The summary types are tagged for deprecation, Kubernetes recommended to use histograms instead of summaries. The main advantages of histogram types are aggregation and inexpensive. In this commit, we changed three Antrea controller metrics from summary to histogram type. They are DurationAppliedToGroupSyncing, DurationAddressGroupSyncing, and DurationInternalNetworkPolicySyncing. Fixes antrea-io#905

dreamtalen · 2020-09-11T19:48:27Z

Sorry for being nit-picking, the commit message seems not wrapped neatly, and a newline is usually inserted between paragraphs.
This is the auto-wrapped message in my editor (I use 72 chars per line):
The summary types are tagged for deprecation, Kubernetes recommended to
use histograms instead of summaries. The main advantages of histogram
types are aggregation and inexpensive.

In this commit, we changed three Antrea controller metrics from summary
to histogram type. They are DurationAppliedToGroupSyncing,
DurationAddressGroupSyncing, and DurationInternalNetworkPolicySyncing.

Fixes #905

Never mind. Thanks for your advice.

srikartati · 2020-09-11T19:51:03Z

/test-all

srikartati · 2020-09-11T20:32:09Z

/test-windows-conformance

tnqn

thanks, LGTM

srikartati · 2020-09-14T16:54:52Z

/skip-hw-offload as this is not related test. Merging the PR.

) The summary types are tagged for deprecation, Kubernetes recommended to use histograms instead of summaries. The main advantages of histogram types are aggregation and inexpensive. In this commit, we changed three Antrea controller metrics from summary to histogram type. They are DurationAppliedToGroupSyncing, DurationAddressGroupSyncing, and DurationInternalNetworkPolicySyncing. Fixes antrea-io#905 Co-authored-by: Yongming Ding <dyongming@vmware.com>

vmwclabot added the cla-not-required label Sep 2, 2020

antoninbas requested review from ksamoray, yktsubo and srikartati September 2, 2020 21:07

srikartati previously approved these changes Sep 2, 2020

View reviewed changes

antoninbas previously approved these changes Sep 3, 2020

View reviewed changes

antoninbas mentioned this pull request Sep 10, 2020

Deprecate prometheus run time info metrics #1217

Merged

ksamoray previously approved these changes Sep 10, 2020

View reviewed changes

dreamtalen dismissed stale reviews from ksamoray, antoninbas, and srikartati via 237c921 September 10, 2020 19:20

dreamtalen force-pushed the local_branch branch 3 times, most recently from 00e3cd0 to fe66434 Compare September 11, 2020 01:34

dreamtalen force-pushed the local_branch branch from fe66434 to 9f64d5c Compare September 11, 2020 19:47

tnqn approved these changes Sep 14, 2020

View reviewed changes

srikartati merged commit 7d06431 into antrea-io:master Sep 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change prometheus metrics type from summary to histogram #1202

Change prometheus metrics type from summary to histogram #1202

dreamtalen commented Sep 2, 2020 •

edited

Loading

antrea-bot commented Sep 2, 2020

codecov-commenter commented Sep 2, 2020 •

edited

Loading

srikartati left a comment

antoninbas left a comment

ksamoray commented Sep 9, 2020

srikartati commented Sep 9, 2020

ksamoray commented Sep 10, 2020

srikartati commented Sep 10, 2020

tnqn commented Sep 10, 2020

tnqn commented Sep 10, 2020

dreamtalen commented Sep 10, 2020

tnqn commented Sep 11, 2020

dreamtalen commented Sep 11, 2020

srikartati commented Sep 11, 2020

srikartati commented Sep 11, 2020

tnqn left a comment

srikartati commented Sep 14, 2020 •

edited

Loading

Change prometheus metrics type from summary to histogram #1202

Change prometheus metrics type from summary to histogram #1202

Conversation

dreamtalen commented Sep 2, 2020 • edited Loading

antrea-bot commented Sep 2, 2020

codecov-commenter commented Sep 2, 2020 • edited Loading

Codecov Report

srikartati left a comment

Choose a reason for hiding this comment

antoninbas left a comment

Choose a reason for hiding this comment

ksamoray commented Sep 9, 2020

srikartati commented Sep 9, 2020

ksamoray commented Sep 10, 2020

srikartati commented Sep 10, 2020

tnqn commented Sep 10, 2020

tnqn commented Sep 10, 2020

dreamtalen commented Sep 10, 2020

tnqn commented Sep 11, 2020

dreamtalen commented Sep 11, 2020

srikartati commented Sep 11, 2020

srikartati commented Sep 11, 2020

tnqn left a comment

Choose a reason for hiding this comment

srikartati commented Sep 14, 2020 • edited Loading

dreamtalen commented Sep 2, 2020 •

edited

Loading

codecov-commenter commented Sep 2, 2020 •

edited

Loading

srikartati commented Sep 14, 2020 •

edited

Loading