Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common stats reporting for validation and mutation #976

Merged
merged 1 commit into from
Dec 10, 2020

Conversation

mmirecki
Copy link
Contributor

The PR is opened for discussion on this comment.

It extracts some common code for validation and mutation stats reporting.

Depends on #881

@codecov-io
Copy link

codecov-io commented Nov 20, 2020

Codecov Report

Merging #976 (bba2dba) into master (d6c5389) will increase coverage by 0.49%.
The diff coverage is 70.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #976      +/-   ##
==========================================
+ Coverage   46.70%   47.19%   +0.49%     
==========================================
  Files          62       62              
  Lines        3987     3994       +7     
==========================================
+ Hits         1862     1885      +23     
+ Misses       1879     1869      -10     
+ Partials      246      240       -6     
Flag Coverage Δ
unittests 47.19% <70.83%> (+0.49%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/webhook/common.go 48.27% <ø> (ø)
pkg/webhook/mutation.go 17.52% <0.00%> (+0.17%) ⬆️
pkg/webhook/policy.go 24.32% <0.00%> (+0.13%) ⬆️
pkg/webhook/stats_reporter.go 86.20% <94.44%> (+2.53%) ⬆️
pkg/readiness/object_tracker.go 79.14% <0.00%> (+1.22%) ⬆️
...onstrainttemplate/constrainttemplate_controller.go 56.49% <0.00%> (+3.89%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d6c5389...bba2dba. Read the comment docs.

Copy link
Contributor

@maxsmythe maxsmythe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some thoughts. Adding @sozercan, as I know he is interested in metrics

mutationSkipResponse mutationResponse = "skip"
mutationUnknownResponse mutationResponse = "unknown"
mutationSkipResponse requestResponse = "skip"
mutationUnknownResponse requestResponse = "unknown"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we also need error and success?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

// Captures req count metric, recording the count and the duration
func (r *reporter) ReportMutationRequest(response mutationResponse, d time.Duration) error {
func (r *reporter) ReportMutationRequest(response requestResponse, d time.Duration) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should create a separate metric: mutation_request_count, the equivalent of request_count but for mutation.

Another option would be to add a separate tag, called "webhook" that returns either "mutation" or "validation"

The second is probably cleaner, the first is backwards-compatible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are considering changing at the risk of backward compatibility, I would vote for validating_request_count to replace request_count and validating_requestDurationMetricName to replace requestDurationMetricName. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be less painful for users if we kept request_count and all and added special mutation_ metric names if we go the two separate metrics route.

This is because users may have developed alerts and aggregation rules for prometheus that depend on the old metric names.

If we want to migrate, maybe we should have some sort of deprecation pathway?

Copy link
Member

@sozercan sozercan Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on deprecation pathway if we are going to change names

should we publish all 3 (no prefix, validation and mutation) and then deprecate non prefix (which is same as validation)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works for me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on 3 and deprecate request_count later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added validation_request_count/ validation_request_duration_seconds
The validating webhook will now report both: here
Please confirm this is what we want.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added #1010 to track this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm late to this discussion but have just been going over the metrics code.
Backwards compatibility aside, I'm not sure separate metrics for separate webhooks is internally consistent. Controller-runtime already exposes a number of metrics for webhooks automatically, and they use labels for distinguishing between them:

# HELP controller_runtime_webhook_requests_in_flight Current number of admission requests being served.
# TYPE controller_runtime_webhook_requests_in_flight gauge
controller_runtime_webhook_requests_in_flight{webhook="/v1/admit"} 0
controller_runtime_webhook_requests_in_flight{webhook="/v1/admitlabel"} 0

Additionally, with a single metric it is easier to perform aggregation across all webhooks when desired.

Do we have a sense of how we want to consume these metrics and what kind of questions they will help answer? That might help understand the aggregation use case better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per my original comment, I agree the extra label/tag solution is cleaner. I also don't see a downside internally as I don't think there is much difference between a new metric and a new tag. New tag may actually have slightly less overhead.

My biggest concern would be people who may have built up Prometheus alerts and/or aggregation rules around the metrics as they currently exist. Does adding data from mutation affect how they've tuned their alerts? What are the backwards compatibility expectations for metrics?

@maxsmythe maxsmythe requested a review from sozercan November 24, 2020 02:46
@mmirecki mmirecki force-pushed the commonStatsReporing branch from c61f415 to f983d24 Compare November 24, 2020 14:58
@mmirecki mmirecki changed the title Common stats reporing for validation and mutation Common stats reporting for validation and mutation Nov 25, 2020
return r.reportRequest(response, mutationStatusKey, mutationResponseTimeInSecM.M(d.Seconds()))
}

// TODO: should this be exported, and ReportAdmissionRequest and ReportMutationRequest removed?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on exporting this function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should have left this as two different functions. My reasoning from a different comment:

We should collapse these into two functions:

ReportValidationRequest

and

ReportMutationRequest

That way we can do the fanout into two metrics internally in the metrics interface. That will make a future removal of the deprecated metric easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved back to having different functions for each.
Note that we have 3 functions now (once deprected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can ReportValidationRequest() increment both the new and the deprecated metric? That would avoid code duplication.

Then once we deprecate the old metric we only need to remove the one line.

@sozercan sozercan mentioned this pull request Dec 1, 2020
@mmirecki mmirecki force-pushed the commonStatsReporing branch 5 times, most recently from c3c17fe to 1c73bf6 Compare December 4, 2020 15:17
Copy link
Contributor

@maxsmythe maxsmythe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the outccome looks basically right.

My biggest comment would be we should change the interface of the stats reporter to hide which metrics actually get reported from the webhook code.

mutationErrorResponse mutationResponse = "error"
mutationErrorResponse requestResponse = "error"
mutationSuccessResponse requestResponse = "success"
mutationSkipResponse requestResponse = "skip"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we using the "skip" response?

Copy link
Contributor Author

@mmirecki mmirecki Dec 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, here
We use it when skipping because of namespace exclusion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@mmirecki mmirecki force-pushed the commonStatsReporing branch from 1c73bf6 to 1eff92b Compare December 7, 2020 14:31
Copy link
Member

@sozercan sozercan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mmirecki mmirecki force-pushed the commonStatsReporing branch from 1eff92b to ee7fbe2 Compare December 8, 2020 14:18
tag.Insert(admissionStatusKey, string(response)),
)
if err != nil {
func (r *reporter) ReportAdmissionRequest(response requestResponse, d time.Duration) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this looks like it can go away now that it's unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mmirecki mmirecki force-pushed the commonStatsReporing branch from ee7fbe2 to 35a8fb2 Compare December 8, 2020 19:46
Copy link
Contributor

@maxsmythe maxsmythe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@maxsmythe
Copy link
Contributor

@sozercan Still LGTY after the changes?

Copy link
Member

@ritazh ritazh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Signed-off-by: mmirecki <mmirecki@redhat.com>
@mmirecki mmirecki force-pushed the commonStatsReporing branch from 35a8fb2 to bba2dba Compare December 9, 2020 09:09
@mmirecki
Copy link
Contributor Author

mmirecki commented Dec 9, 2020

Rebased on master

@maxsmythe maxsmythe merged commit 321dd44 into open-policy-agent:master Dec 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants