-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new config and metric for reporting ruler query execution wall time. #4317
Conversation
…time. Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
… than after the if block. Add a unit test to validate we're tracking time correctly. Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
pkg/ruler/compat_test.go
Outdated
queryTime := prometheus.NewCounterVec(prometheus.CounterOpts{}, []string{"user"}) | ||
|
||
mockFunc := func(ctx context.Context, q string, t time.Time) (promql.Vector, error) { | ||
time.Sleep(1 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen we use sleep in other unit tests, but it still feels kind of wrong. Is there a better way to simulate time passing?
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Sorry but we have begun the process of cutting a new release; please rebase from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! I left few comments.
pkg/ruler/compat.go
Outdated
startTime = time.Now() | ||
defer func() { queryTime.WithLabelValues(userID).Add(time.Since(startTime).Seconds()) }() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use prometheus.NewTimer()
for this purpose, instead of tracking manually.
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@treid314 re: your comment about faking time pasing, we have this package: https://github.com/weaveworks/common/blob/master/mtime/mtime.go
I trust you've all agreed there's a reason to add this metric, but just off the top of my head I'm questioning how useful a metric for the total time spent executing queries in the ruler for a user is. What scenario is this metric here to help diagnose? Would a histogram with only a few buckets not be better? Or logging when the ruler finishes evaluating a rule with a field for how long that took?
It's for billing purposes (reason why it's a counter and not an histogram). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (modulo a nit)
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job. I would suggest to implement this tracking in separate function, to avoid breaking existing clients of MetricsQueryFunc
ouside of Cortex.
…uler query time. Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
pkg/ruler/compat.go
Outdated
func RecordAndReportRuleQueryMetrics(qf rules.QueryFunc, queryTime prometheus.Counter, logger log.Logger) rules.QueryFunc { | ||
return func(ctx context.Context, qs string, t time.Time) (promql.Vector, error) { | ||
// If we've been passed a counter we want to record the wall time spent executing this request. | ||
if queryTime != nil { | ||
timer := prometheus.NewTimer(nil) | ||
defer func() { | ||
querySeconds := timer.ObserveDuration().Seconds() | ||
queryTime.Add(querySeconds) | ||
|
||
// Log ruler query stats. | ||
logMessage := append([]interface{}{ | ||
"msg", "ruler query stats", | ||
"cortex_ruler_query_seconds_total", querySeconds, | ||
}, qs) | ||
level.Info(util_log.WithContext(ctx, logger)).Log(logMessage...) | ||
}() | ||
} | ||
|
||
result, err := qf(ctx, qs, t) | ||
return result, err | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simplify this function:
- we don't need to return new function, if
queryTime
is nil. No need to check it each time. - not sure what's going on with logging here, but we can just call
Log
, right?
func RecordAndReportRuleQueryMetrics(qf rules.QueryFunc, queryTime prometheus.Counter, logger log.Logger) rules.QueryFunc { | |
return func(ctx context.Context, qs string, t time.Time) (promql.Vector, error) { | |
// If we've been passed a counter we want to record the wall time spent executing this request. | |
if queryTime != nil { | |
timer := prometheus.NewTimer(nil) | |
defer func() { | |
querySeconds := timer.ObserveDuration().Seconds() | |
queryTime.Add(querySeconds) | |
// Log ruler query stats. | |
logMessage := append([]interface{}{ | |
"msg", "ruler query stats", | |
"cortex_ruler_query_seconds_total", querySeconds, | |
}, qs) | |
level.Info(util_log.WithContext(ctx, logger)).Log(logMessage...) | |
}() | |
} | |
result, err := qf(ctx, qs, t) | |
return result, err | |
} | |
} | |
func RecordAndReportRuleQueryMetrics(qf rules.QueryFunc, queryTime prometheus.Counter, logger log.Logger) rules.QueryFunc { | |
if queryTime == nil { | |
return qf | |
} | |
return func(ctx context.Context, qs string, t time.Time) (promql.Vector, error) { | |
timer := prometheus.NewTimer(nil) | |
defer func() { | |
querySeconds := timer.ObserveDuration().Seconds() | |
queryTime.Add(querySeconds) | |
level.Info(util_log.WithContext(ctx, logger)).Log("msg", "ruler query stats", "cortex_ruler_query_seconds_total", querySeconds, "query", qs) | |
}() | |
result, err := qf(ctx, qs, t) | |
return result, err | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what's going on with logging here, but we can just call Log, right?
I'm trying to follow what we do for query front-end logging in case a user would also want to record this log message in the ruler and use it for billing instead of the added metric.
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Not going to merge until @pstibrany has had a chance to see the changes.
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
…time. (cortexproject#4317) * Add a new config and metric for reporting ruler query execution wall time. Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Spacing and PR number fixup Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Wrap the defer in a function to make it defer after the return rather than after the if block. Add a unit test to validate we're tracking time correctly. Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Use seconds for our duration rather than nanoseconds Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Review comment fixes Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Update config flag in the config docs Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Pass counter rather than counter vector for metrics query function Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Fix comment in MetricsQueryFunction Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Move query metric and log to separate function. Add log message for ruler query time. Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Update config file and change log to show this a per user metric Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * code review fixes Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * update log message for ruler query metrics Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Remove append and just use the array for key values in the log messag Signed-off-by: Tyler Reid <tyler.reid@grafana.com> * Add query-frontend component to front end log message Signed-off-by: Tyler Reid <tyler.reid@grafana.com> Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Tyler Reid tyler.reid@grafana.com
What this PR does:
This PR adds a new metric for tracking the execution wall time of queries in the ruler. We have a similar metric for query execution time on query front-ends today but we also wish to track query execution time in the rulers. This metric is called
cortex_ruler_query_seconds_total
and has a label ofuser
containing the UserID of the tenant who is tagged in the rule. This metric can be enabled with the-ruler.query-stats-enabled
boolean config option.Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]