-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Common stats reporting for validation and mutation #976
Common stats reporting for validation and mutation #976
Conversation
Codecov Report
@@ Coverage Diff @@
## master #976 +/- ##
==========================================
+ Coverage 46.70% 47.19% +0.49%
==========================================
Files 62 62
Lines 3987 3994 +7
==========================================
+ Hits 1862 1885 +23
+ Misses 1879 1869 -10
+ Partials 246 240 -6
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some thoughts. Adding @sozercan, as I know he is interested in metrics
pkg/webhook/mutation.go
Outdated
mutationSkipResponse mutationResponse = "skip" | ||
mutationUnknownResponse mutationResponse = "unknown" | ||
mutationSkipResponse requestResponse = "skip" | ||
mutationUnknownResponse requestResponse = "unknown" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we also need error
and success
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
// Captures req count metric, recording the count and the duration | ||
func (r *reporter) ReportMutationRequest(response mutationResponse, d time.Duration) error { | ||
func (r *reporter) ReportMutationRequest(response requestResponse, d time.Duration) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should create a separate metric: mutation_request_count
, the equivalent of request_count
but for mutation.
Another option would be to add a separate tag, called "webhook" that returns either "mutation" or "validation"
The second is probably cleaner, the first is backwards-compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are considering changing at the risk of backward compatibility, I would vote for validating_request_count
to replace request_count
and validating_requestDurationMetricName
to replace requestDurationMetricName
. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be less painful for users if we kept request_count
and all and added special mutation_
metric names if we go the two separate metrics route.
This is because users may have developed alerts and aggregation rules for prometheus that depend on the old metric names.
If we want to migrate, maybe we should have some sort of deprecation pathway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on deprecation pathway if we are going to change names
should we publish all 3 (no prefix, validation and mutation) and then deprecate non prefix (which is same as validation)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on 3 and deprecate request_count
later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added validation_request_count/ validation_request_duration_seconds
The validating webhook will now report both: here
Please confirm this is what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added #1010 to track this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm late to this discussion but have just been going over the metrics code.
Backwards compatibility aside, I'm not sure separate metrics for separate webhooks is internally consistent. Controller-runtime already exposes a number of metrics for webhooks automatically, and they use labels for distinguishing between them:
# HELP controller_runtime_webhook_requests_in_flight Current number of admission requests being served.
# TYPE controller_runtime_webhook_requests_in_flight gauge
controller_runtime_webhook_requests_in_flight{webhook="/v1/admit"} 0
controller_runtime_webhook_requests_in_flight{webhook="/v1/admitlabel"} 0
Additionally, with a single metric it is easier to perform aggregation across all webhooks when desired.
Do we have a sense of how we want to consume these metrics and what kind of questions they will help answer? That might help understand the aggregation use case better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my original comment, I agree the extra label/tag solution is cleaner. I also don't see a downside internally as I don't think there is much difference between a new metric and a new tag. New tag may actually have slightly less overhead.
My biggest concern would be people who may have built up Prometheus alerts and/or aggregation rules around the metrics as they currently exist. Does adding data from mutation affect how they've tuned their alerts? What are the backwards compatibility expectations for metrics?
c61f415
to
f983d24
Compare
pkg/webhook/stats_reporter.go
Outdated
return r.reportRequest(response, mutationStatusKey, mutationResponseTimeInSecM.M(d.Seconds())) | ||
} | ||
|
||
// TODO: should this be exported, and ReportAdmissionRequest and ReportMutationRequest removed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on exporting this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should have left this as two different functions. My reasoning from a different comment:
We should collapse these into two functions:
ReportValidationRequest
and
ReportMutationRequest
That way we can do the fanout into two metrics internally in the metrics interface. That will make a future removal of the deprecated metric easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved back to having different functions for each.
Note that we have 3 functions now (once deprected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can ReportValidationRequest()
increment both the new and the deprecated metric? That would avoid code duplication.
Then once we deprecate the old metric we only need to remove the one line.
c3c17fe
to
1c73bf6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the outccome looks basically right.
My biggest comment would be we should change the interface of the stats reporter to hide which metrics actually get reported from the webhook code.
pkg/webhook/mutation.go
Outdated
mutationErrorResponse mutationResponse = "error" | ||
mutationErrorResponse requestResponse = "error" | ||
mutationSuccessResponse requestResponse = "success" | ||
mutationSkipResponse requestResponse = "skip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we using the "skip" response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, here
We use it when skipping because of namespace exclusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
1c73bf6
to
1eff92b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
1eff92b
to
ee7fbe2
Compare
pkg/webhook/stats_reporter.go
Outdated
tag.Insert(admissionStatusKey, string(response)), | ||
) | ||
if err != nil { | ||
func (r *reporter) ReportAdmissionRequest(response requestResponse, d time.Duration) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this looks like it can go away now that it's unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ee7fbe2
to
35a8fb2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@sozercan Still LGTY after the changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Signed-off-by: mmirecki <mmirecki@redhat.com>
35a8fb2
to
bba2dba
Compare
Rebased on master |
The PR is opened for discussion on this comment.
It extracts some common code for validation and mutation stats reporting.
Depends on #881