Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new config and metric for reporting ruler query execution wall time. #4317

Merged
merged 16 commits into from
Jul 9, 2021

Conversation

treid314
Copy link
Contributor

@treid314 treid314 commented Jun 24, 2021

Signed-off-by: Tyler Reid tyler.reid@grafana.com

What this PR does:
This PR adds a new metric for tracking the execution wall time of queries in the ruler. We have a similar metric for query execution time on query front-ends today but we also wish to track query execution time in the rulers. This metric is called cortex_ruler_query_seconds_total and has a label of user containing the UserID of the tenant who is tagged in the rule. This metric can be enabled with the -ruler.query-stats-enabled boolean config option.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
… than after the if block. Add a unit test to validate we're tracking time correctly.

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
pkg/ruler/compat.go Outdated Show resolved Hide resolved
queryTime := prometheus.NewCounterVec(prometheus.CounterOpts{}, []string{"user"})

mockFunc := func(ctx context.Context, q string, t time.Time) (promql.Vector, error) {
time.Sleep(1 * time.Millisecond)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen we use sleep in other unit tests, but it still feels kind of wrong. Is there a better way to simulate time passing?

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
@treid314 treid314 marked this pull request as ready for review June 24, 2021 21:58
@bboreham
Copy link
Contributor

Sorry but we have begun the process of cutting a new release; please rebase from master and move your CHANGELOG entry to the top under ## master / unreleased

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I left few comments.

CHANGELOG.md Outdated Show resolved Hide resolved
pkg/ruler/ruler.go Outdated Show resolved Hide resolved
pkg/ruler/compat.go Outdated Show resolved Hide resolved
pkg/ruler/ruler.go Outdated Show resolved Hide resolved
Comment on lines 153 to 154
startTime = time.Now()
defer func() { queryTime.WithLabelValues(userID).Add(time.Since(startTime).Seconds()) }()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use prometheus.NewTimer() for this purpose, instead of tracking manually.

Tyler Reid added 2 commits June 29, 2021 08:54
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
pkg/ruler/compat.go Outdated Show resolved Hide resolved
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Copy link
Contributor

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@treid314 re: your comment about faking time pasing, we have this package: https://github.com/weaveworks/common/blob/master/mtime/mtime.go

I trust you've all agreed there's a reason to add this metric, but just off the top of my head I'm questioning how useful a metric for the total time spent executing queries in the ruler for a user is. What scenario is this metric here to help diagnose? Would a histogram with only a few buckets not be better? Or logging when the ruler finishes evaluating a rule with a field for how long that took?

@pracucci
Copy link
Contributor

pracucci commented Jul 1, 2021

What scenario is this metric here to help diagnose?

It's for billing purposes (reason why it's a counter and not an histogram).

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (modulo a nit)

pkg/ruler/compat.go Outdated Show resolved Hide resolved
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job. I would suggest to implement this tracking in separate function, to avoid breaking existing clients of MetricsQueryFunc ouside of Cortex.

pkg/ruler/compat.go Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
…uler query time.

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Comment on lines 166 to 187
func RecordAndReportRuleQueryMetrics(qf rules.QueryFunc, queryTime prometheus.Counter, logger log.Logger) rules.QueryFunc {
return func(ctx context.Context, qs string, t time.Time) (promql.Vector, error) {
// If we've been passed a counter we want to record the wall time spent executing this request.
if queryTime != nil {
timer := prometheus.NewTimer(nil)
defer func() {
querySeconds := timer.ObserveDuration().Seconds()
queryTime.Add(querySeconds)

// Log ruler query stats.
logMessage := append([]interface{}{
"msg", "ruler query stats",
"cortex_ruler_query_seconds_total", querySeconds,
}, qs)
level.Info(util_log.WithContext(ctx, logger)).Log(logMessage...)
}()
}

result, err := qf(ctx, qs, t)
return result, err
}
}
Copy link
Contributor

@pstibrany pstibrany Jul 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify this function:

  1. we don't need to return new function, if queryTime is nil. No need to check it each time.
  2. not sure what's going on with logging here, but we can just call Log, right?
Suggested change
func RecordAndReportRuleQueryMetrics(qf rules.QueryFunc, queryTime prometheus.Counter, logger log.Logger) rules.QueryFunc {
return func(ctx context.Context, qs string, t time.Time) (promql.Vector, error) {
// If we've been passed a counter we want to record the wall time spent executing this request.
if queryTime != nil {
timer := prometheus.NewTimer(nil)
defer func() {
querySeconds := timer.ObserveDuration().Seconds()
queryTime.Add(querySeconds)
// Log ruler query stats.
logMessage := append([]interface{}{
"msg", "ruler query stats",
"cortex_ruler_query_seconds_total", querySeconds,
}, qs)
level.Info(util_log.WithContext(ctx, logger)).Log(logMessage...)
}()
}
result, err := qf(ctx, qs, t)
return result, err
}
}
func RecordAndReportRuleQueryMetrics(qf rules.QueryFunc, queryTime prometheus.Counter, logger log.Logger) rules.QueryFunc {
if queryTime == nil {
return qf
}
return func(ctx context.Context, qs string, t time.Time) (promql.Vector, error) {
timer := prometheus.NewTimer(nil)
defer func() {
querySeconds := timer.ObserveDuration().Seconds()
queryTime.Add(querySeconds)
level.Info(util_log.WithContext(ctx, logger)).Log("msg", "ruler query stats", "cortex_ruler_query_seconds_total", querySeconds, "query", qs)
}()
result, err := qf(ctx, qs, t)
return result, err
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what's going on with logging here, but we can just call Log, right?

I'm trying to follow what we do for query front-end logging in case a user would also want to record this log message in the ruler and use it for billing instead of the added metric.

pkg/ruler/compat_test.go Outdated Show resolved Hide resolved
pkg/ruler/compat_test.go Outdated Show resolved Hide resolved
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
pkg/ruler/compat.go Show resolved Hide resolved
pkg/ruler/compat.go Outdated Show resolved Hide resolved
pkg/ruler/compat.go Outdated Show resolved Hide resolved
pkg/ruler/compat.go Outdated Show resolved Hide resolved
pkg/ruler/compat_test.go Outdated Show resolved Hide resolved
Tyler Reid added 2 commits July 7, 2021 08:37
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Copy link
Contributor

@tomwilkie tomwilkie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Not going to merge until @pstibrany has had a chance to see the changes.

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
@pracucci pracucci merged commit 4a9adff into cortexproject:master Jul 9, 2021
alvinlin123 pushed a commit to ac1214/cortex that referenced this pull request Jan 14, 2022
…time. (cortexproject#4317)

* Add a new config and metric for reporting ruler query execution wall time.

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>

* Spacing and PR number fixup

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>

* Wrap the defer in a function to make it defer after the return rather than after the if block. Add a unit test to validate we're tracking time correctly.

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>

* Use seconds for our duration rather than nanoseconds

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>

* Review comment fixes

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>

* Update config flag in the config docs

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>

* Pass counter rather than counter vector for metrics query function

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>

* Fix comment in MetricsQueryFunction

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>

* Move query metric and log to separate function. Add log message for ruler query time.

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>

* Update config file and change log to show this a per user metric

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>

* code review fixes

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>

* update log message for ruler query metrics

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>

* Remove append and just use the array for key values in the log messag

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>

* Add query-frontend component to front end log message

Signed-off-by: Tyler Reid <tyler.reid@grafana.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants