Skip to content

Commit

Permalink
Query: add query metrics to calls going through the Store API (#5741)
Browse files Browse the repository at this point in the history
* Implement granular query performance metrics for Thanos Query

These are grabbed from the data returned by multiple Store APIs after execution of a query.

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix some linter warnings

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Remove useless logs

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Refactor query tests

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix long function definition (newQuerier)

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Remove TODO comment

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix query tests

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Reformat query docs

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Remove useless return

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Put back old query docs

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Update query docs again

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix e2e env name

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Retrigger CI

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Add missing copyright notice.

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Retrigger CI

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Retrigger CI

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Bump wait time to twice scrape interval

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Retrigger CI

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Attempt to fix randomly failing test

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Checking more metrics to ensure the store is ready

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Clean up test

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Do not record store api metrics when didn't touch series or samples

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Retrigger CI

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Also skip store api metrics on zero chunks touched

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Update changelog

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Fix broken changelog after merge

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Remove extra empty line

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Refactor names and (un)exported types and fields

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Start listing metrics exported by Thanos Query

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Rename pkg/store/metrics -> pkg/store/telemetry

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Get rid of the pkg/store/telemetry package

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: Matej Gera <38492574+matej-g@users.noreply.github.com>
Co-authored-by: Matej Gera <38492574+matej-g@users.noreply.github.com>
  • Loading branch information
douglascamata and matej-g authored Oct 18, 2022
1 parent a3e246c commit be95bb1
Show file tree
Hide file tree
Showing 12 changed files with 478 additions and 51 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re
- [#5734](https://github.com/thanos-io/thanos/pull/5734) Store: Support disable block viewer UI.
- [#5411](https://github.com/thanos-io/thanos/pull/5411) Tracing: Add OpenTelemetry Protocol exporter.
- [#5779](https://github.com/thanos-io/thanos/pull/5779) Objstore: Support specifying S3 storage class.
- [#5741](https://github.com/thanos-io/thanos/pull/5741) Query: add metrics on how much data is being selected by downstream Store APIs.
- [#5673](https://github.com/thanos-io/thanos/pull/5673) Receive: Reload tenant limit configuration on file change.

### Changed
Expand Down
19 changes: 18 additions & 1 deletion cmd/thanos/query.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ import (
"github.com/prometheus/prometheus/discovery/targetgroup"
"github.com/prometheus/prometheus/model/labels"
"github.com/prometheus/prometheus/promql"
"google.golang.org/grpc"

v1 "github.com/prometheus/prometheus/web/api/v1"
"github.com/thanos-community/promql-engine/engine"
apiv1 "github.com/thanos-io/thanos/pkg/api/query"
Expand Down Expand Up @@ -54,7 +56,6 @@ import (
"github.com/thanos-io/thanos/pkg/targets"
"github.com/thanos-io/thanos/pkg/tls"
"github.com/thanos-io/thanos/pkg/ui"
"google.golang.org/grpc"
)

const (
Expand Down Expand Up @@ -194,6 +195,10 @@ func registerQuery(app *extkingpin.App) {
alertQueryURL := cmd.Flag("alert.query-url", "The external Thanos Query URL that would be set in all alerts 'Source' field.").String()
grpcProxyStrategy := cmd.Flag("grpc.proxy-strategy", "Strategy to use when proxying Series requests to leaf nodes. Hidden and only used for testing, will be removed after lazy becomes the default.").Default(string(store.EagerRetrieval)).Hidden().Enum(string(store.EagerRetrieval), string(store.LazyRetrieval))

queryTelemetryDurationQuantiles := cmd.Flag("query.telemetry.request-duration-seconds-quantiles", "The quantiles for exporting metrics about the request duration quantiles.").Default("0.1", "0.25", "0.75", "1.25", "1.75", "2.5", "3", "5", "10").Float64List()
queryTelemetrySamplesQuantiles := cmd.Flag("query.telemetry.request-samples-quantiles", "The quantiles for exporting metrics about the samples count quantiles.").Default("100", "1000", "10000", "100000", "1000000").Int64List()
queryTelemetrySeriesQuantiles := cmd.Flag("query.telemetry.request-series-seconds-quantiles", "The quantiles for exporting metrics about the series count quantiles.").Default("10", "100", "1000", "10000", "100000").Int64List()

cmd.Setup(func(g *run.Group, logger log.Logger, reg *prometheus.Registry, tracer opentracing.Tracer, _ <-chan struct{}, _ bool) error {
selectorLset, err := parseFlagLabels(*selectorLabels)
if err != nil {
Expand Down Expand Up @@ -305,6 +310,9 @@ func registerQuery(app *extkingpin.App) {
*alertQueryURL,
*grpcProxyStrategy,
component.Query,
*queryTelemetryDurationQuantiles,
*queryTelemetrySamplesQuantiles,
*queryTelemetrySeriesQuantiles,
promqlEngineType(*promqlEngine),
)
})
Expand Down Expand Up @@ -377,6 +385,9 @@ func runQuery(
alertQueryURL string,
grpcProxyStrategy string,
comp component.Component,
queryTelemetryDurationQuantiles []float64,
queryTelemetrySamplesQuantiles []int64,
queryTelemetrySeriesQuantiles []int64,
promqlEngine promqlEngineType,
) error {
if alertQueryURL == "" {
Expand Down Expand Up @@ -680,6 +691,12 @@ func runQuery(
extprom.WrapRegistererWithPrefix("thanos_query_concurrent_", reg),
maxConcurrentQueries,
),
store.NewSeriesStatsAggregator(
reg,
queryTelemetryDurationQuantiles,
queryTelemetrySamplesQuantiles,
queryTelemetrySeriesQuantiles,
),
reg,
)

Expand Down
19 changes: 19 additions & 0 deletions docs/components/query.md
Original file line number Diff line number Diff line change
Expand Up @@ -378,6 +378,15 @@ Flags:
be able to query without deduplication using
'dedup=false' parameter. Data includes time
series, recording rules, and alerting rules.
--query.telemetry.request-duration-seconds-quantiles=0.1... ...
The quantiles for exporting metrics about the
request duration quantiles.
--query.telemetry.request-samples-quantiles=100... ...
The quantiles for exporting metrics about the
samples count quantiles.
--query.telemetry.request-series-seconds-quantiles=10... ...
The quantiles for exporting metrics about the
series count quantiles.
--query.timeout=2m Maximum time to process query by query node.
--request.logging-config=<content>
Alternative to 'request.logging-config-file'
Expand Down Expand Up @@ -460,3 +469,13 @@ Flags:
of Prometheus.
```

## Exported metrics

Thanos Query also exports metrics about its own performance. You can find a list with these metrics below.

**Disclaimer**: this list is incomplete. The remaining metrics will be added over time.

| Name | Type | Labels | Description |
|-----------------------------------------|-----------|-----------------------|-------------------------------------------------------------------------------------------------------------------|
| thanos_store_api_query_duration_seconds | Histogram | samples_le, series_le | Duration of the Thanos Store API select phase for a query according to the amount of samples and series selected. |
2 changes: 2 additions & 0 deletions pkg/api/query/grpc.go
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@ func (g *GRPCAPI) Query(request *querypb.QueryRequest, server querypb.Query_Quer
request.EnableQueryPushdown,
false,
request.ShardInfo,
query.NoopSeriesStatsReporter,
)
qry, err := g.queryEngine.NewInstantQuery(queryable, &promql.QueryOpts{LookbackDelta: lookbackDelta}, request.Query, ts)
if err != nil {
Expand Down Expand Up @@ -168,6 +169,7 @@ func (g *GRPCAPI) QueryRange(request *querypb.QueryRangeRequest, srv querypb.Que
request.EnableQueryPushdown,
false,
request.ShardInfo,
query.NoopSeriesStatsReporter,
)

startTime := time.Unix(request.StartTimeSeconds, 0)
Expand Down
99 changes: 88 additions & 11 deletions pkg/api/query/v1.go
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,8 @@ import (
"github.com/prometheus/prometheus/promql"
"github.com/prometheus/prometheus/promql/parser"
"github.com/prometheus/prometheus/storage"
v1 "github.com/prometheus/prometheus/web/api/v1"

"github.com/prometheus/prometheus/util/stats"

v1 "github.com/prometheus/prometheus/web/api/v1"
"github.com/thanos-io/thanos/pkg/api"
"github.com/thanos-io/thanos/pkg/exemplars"
"github.com/thanos-io/thanos/pkg/exemplars/exemplarspb"
Expand All @@ -57,6 +55,7 @@ import (
"github.com/thanos-io/thanos/pkg/rules"
"github.com/thanos-io/thanos/pkg/rules/rulespb"
"github.com/thanos-io/thanos/pkg/runutil"
"github.com/thanos-io/thanos/pkg/store"
"github.com/thanos-io/thanos/pkg/store/storepb"
"github.com/thanos-io/thanos/pkg/targets"
"github.com/thanos-io/thanos/pkg/targets/targetspb"
Expand Down Expand Up @@ -107,6 +106,13 @@ type QueryAPI struct {
defaultMetadataTimeRange time.Duration

queryRangeHist prometheus.Histogram

seriesStatsAggregator seriesQueryPerformanceMetricsAggregator
}

type seriesQueryPerformanceMetricsAggregator interface {
Aggregate(seriesStats storepb.SeriesStatsCounter)
Observe(duration float64)
}

// NewQueryAPI returns an initialized QueryAPI type.
Expand Down Expand Up @@ -134,8 +140,12 @@ func NewQueryAPI(
defaultMetadataTimeRange time.Duration,
disableCORS bool,
gate gate.Gate,
statsAggregator seriesQueryPerformanceMetricsAggregator,
reg *prometheus.Registry,
) *QueryAPI {
if statsAggregator == nil {
statsAggregator = &store.NoopSeriesStatsAggregator{}
}
return &QueryAPI{
baseAPI: api.NewBaseAPI(logger, disableCORS, flagsMap),
logger: logger,
Expand All @@ -160,6 +170,7 @@ func NewQueryAPI(
defaultInstantQueryMaxSourceResolution: defaultInstantQueryMaxSourceResolution,
defaultMetadataTimeRange: defaultMetadataTimeRange,
disableCORS: disableCORS,
seriesStatsAggregator: statsAggregator,

queryRangeHist: promauto.With(reg).NewHistogram(prometheus.HistogramOpts{
Name: "thanos_query_range_requested_timespan_duration_seconds",
Expand Down Expand Up @@ -396,7 +407,24 @@ func (qapi *QueryAPI) query(r *http.Request) (interface{}, []error, *api.ApiErro
span, ctx := tracing.StartSpan(ctx, "promql_instant_query")
defer span.Finish()

qry, err := qapi.queryEngine.NewInstantQuery(qapi.queryableCreate(enableDedup, replicaLabels, storeDebugMatchers, maxSourceResolution, enablePartialResponse, qapi.enableQueryPushdown, false, shardInfo), &promql.QueryOpts{LookbackDelta: lookbackDelta}, r.FormValue("query"), ts)
var seriesStats []storepb.SeriesStatsCounter
qry, err := qapi.queryEngine.NewInstantQuery(
qapi.queryableCreate(
enableDedup,
replicaLabels,
storeDebugMatchers,
maxSourceResolution,
enablePartialResponse,
qapi.enableQueryPushdown,
false,
shardInfo,
query.NewAggregateStatsReporter(&seriesStats),
),
&promql.QueryOpts{LookbackDelta: lookbackDelta},
r.FormValue("query"),
ts,
)

if err != nil {
return nil, nil, &api.ApiError{Typ: api.ErrorBadData, Err: err}, func() {}
}
Expand All @@ -409,6 +437,7 @@ func (qapi *QueryAPI) query(r *http.Request) (interface{}, []error, *api.ApiErro
}
defer qapi.gate.Done()

beforeRange := time.Now()
res := qry.Exec(ctx)
if res.Err != nil {
switch res.Err.(type) {
Expand All @@ -421,6 +450,10 @@ func (qapi *QueryAPI) query(r *http.Request) (interface{}, []error, *api.ApiErro
}
return nil, nil, &api.ApiError{Typ: api.ErrorExec, Err: res.Err}, qry.Close
}
for i := range seriesStats {
qapi.seriesStatsAggregator.Aggregate(seriesStats[i])
}
qapi.seriesStatsAggregator.Observe(time.Since(beforeRange).Seconds())

// Optional stats field in response if parameter "stats" is not empty.
var qs stats.QueryStats
Expand Down Expand Up @@ -525,8 +558,19 @@ func (qapi *QueryAPI) queryRange(r *http.Request) (interface{}, []error, *api.Ap
span, ctx := tracing.StartSpan(ctx, "promql_range_query")
defer span.Finish()

var seriesStats []storepb.SeriesStatsCounter
qry, err := qapi.queryEngine.NewRangeQuery(
qapi.queryableCreate(enableDedup, replicaLabels, storeDebugMatchers, maxSourceResolution, enablePartialResponse, qapi.enableQueryPushdown, false, shardInfo),
qapi.queryableCreate(
enableDedup,
replicaLabels,
storeDebugMatchers,
maxSourceResolution,
enablePartialResponse,
qapi.enableQueryPushdown,
false,
shardInfo,
query.NewAggregateStatsReporter(&seriesStats),
),
&promql.QueryOpts{LookbackDelta: lookbackDelta},
r.FormValue("query"),
start,
Expand All @@ -545,6 +589,7 @@ func (qapi *QueryAPI) queryRange(r *http.Request) (interface{}, []error, *api.Ap
}
defer qapi.gate.Done()

beforeRange := time.Now()
res := qry.Exec(ctx)
if res.Err != nil {
switch res.Err.(type) {
Expand All @@ -555,6 +600,10 @@ func (qapi *QueryAPI) queryRange(r *http.Request) (interface{}, []error, *api.Ap
}
return nil, nil, &api.ApiError{Typ: api.ErrorExec, Err: res.Err}, qry.Close
}
for i := range seriesStats {
qapi.seriesStatsAggregator.Aggregate(seriesStats[i])
}
qapi.seriesStatsAggregator.Observe(time.Since(beforeRange).Seconds())

// Optional stats field in response if parameter "stats" is not empty.
var qs stats.QueryStats
Expand Down Expand Up @@ -600,8 +649,17 @@ func (qapi *QueryAPI) labelValues(r *http.Request) (interface{}, []error, *api.A
matcherSets = append(matcherSets, matchers)
}

q, err := qapi.queryableCreate(true, nil, storeDebugMatchers, 0, enablePartialResponse, qapi.enableQueryPushdown, true, nil).
Querier(ctx, timestamp.FromTime(start), timestamp.FromTime(end))
q, err := qapi.queryableCreate(
true,
nil,
storeDebugMatchers,
0,
enablePartialResponse,
qapi.enableQueryPushdown,
true,
nil,
query.NoopSeriesStatsReporter,
).Querier(ctx, timestamp.FromTime(start), timestamp.FromTime(end))
if err != nil {
return nil, nil, &api.ApiError{Typ: api.ErrorExec, Err: err}, func() {}
}
Expand Down Expand Up @@ -687,8 +745,18 @@ func (qapi *QueryAPI) series(r *http.Request) (interface{}, []error, *api.ApiErr
return nil, nil, apiErr, func() {}
}

q, err := qapi.queryableCreate(enableDedup, replicaLabels, storeDebugMatchers, math.MaxInt64, enablePartialResponse, qapi.enableQueryPushdown, true, nil).
Querier(r.Context(), timestamp.FromTime(start), timestamp.FromTime(end))
q, err := qapi.queryableCreate(
enableDedup,
replicaLabels,
storeDebugMatchers,
math.MaxInt64,
enablePartialResponse,
qapi.enableQueryPushdown,
true,
nil,
query.NoopSeriesStatsReporter,
).Querier(r.Context(), timestamp.FromTime(start), timestamp.FromTime(end))

if err != nil {
return nil, nil, &api.ApiError{Typ: api.ErrorExec, Err: err}, func() {}
}
Expand Down Expand Up @@ -737,8 +805,17 @@ func (qapi *QueryAPI) labelNames(r *http.Request) (interface{}, []error, *api.Ap
matcherSets = append(matcherSets, matchers)
}

q, err := qapi.queryableCreate(true, nil, storeDebugMatchers, 0, enablePartialResponse, qapi.enableQueryPushdown, true, nil).
Querier(r.Context(), timestamp.FromTime(start), timestamp.FromTime(end))
q, err := qapi.queryableCreate(
true,
nil,
storeDebugMatchers,
0,
enablePartialResponse,
qapi.enableQueryPushdown,
true,
nil,
query.NoopSeriesStatsReporter,
).Querier(r.Context(), timestamp.FromTime(start), timestamp.FromTime(end))
if err != nil {
return nil, nil, &api.ApiError{Typ: api.ErrorExec, Err: err}, func() {}
}
Expand Down
6 changes: 4 additions & 2 deletions pkg/api/query/v1_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,8 @@ import (
"github.com/prometheus/prometheus/tsdb/tsdbutil"
promgate "github.com/prometheus/prometheus/util/gate"
"github.com/prometheus/prometheus/util/stats"
"github.com/thanos-io/thanos/pkg/compact"

baseAPI "github.com/thanos-io/thanos/pkg/api"
"github.com/thanos-io/thanos/pkg/compact"
"github.com/thanos-io/thanos/pkg/component"
"github.com/thanos-io/thanos/pkg/gate"
"github.com/thanos-io/thanos/pkg/query"
Expand Down Expand Up @@ -198,6 +197,7 @@ func TestQueryEndpoints(t *testing.T) {
queryRangeHist: promauto.With(prometheus.NewRegistry()).NewHistogram(prometheus.HistogramOpts{
Name: "query_range_hist",
}),
seriesStatsAggregator: &store.NoopSeriesStatsAggregator{},
}

start := time.Unix(0, 0)
Expand Down Expand Up @@ -737,6 +737,7 @@ func TestMetadataEndpoints(t *testing.T) {
queryRangeHist: promauto.With(prometheus.NewRegistry()).NewHistogram(prometheus.HistogramOpts{
Name: "query_range_hist",
}),
seriesStatsAggregator: &store.NoopSeriesStatsAggregator{},
}
apiWithLabelLookback := &QueryAPI{
baseAPI: &baseAPI.BaseAPI{
Expand All @@ -750,6 +751,7 @@ func TestMetadataEndpoints(t *testing.T) {
queryRangeHist: promauto.With(prometheus.NewRegistry()).NewHistogram(prometheus.HistogramOpts{
Name: "query_range_hist",
}),
seriesStatsAggregator: &store.NoopSeriesStatsAggregator{},
}

var tests = []endpointTestCase{
Expand Down
Loading

0 comments on commit be95bb1

Please sign in to comment.