query: v0.32.1+ remote-write prometheus + layered queriers: "vector cannot contain metrics with the same labelset" #6677

tekicode · 2023-08-29T02:45:00Z

After upgrading to v0.32.0 and later v0.32.1+, our setup no longer functions, returning a random labelset error. We're using a layered thanos-query setup, with the top layer being the target for thanos-frontend.

Thanos, Prometheus and Golang version used: Since v0.31, currently main-2023-08-28-32412dc

Object Storage Provider:
gcs

What happened:
When executing an instant query on our setup, after upgrade from 0.30.2 to 0.32.1+ (main), we're seeing this error:
{"status":"error","errorType":"execution","error":"vector cannot contain metrics with the same labelset"}

What you expected to happen:
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[5m]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693271364.906,"580"]}]}}

A value higher than >500

How to reproduce it (as minimally and precisely as possible):

kubectl port-forward directly to a thanos-ring pod and execute and instant query using the HTTP API. This eliminates any issues with thanos-frontend by exclusion.

❯ kubectl port-forward -n org-monitoring deployment/thanos-ring 10902
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[5m]))'
{"status":"error","errorType":"execution","error":"vector cannot contain metrics with the same labelset"}

Our Setup:
Queries are made against thanos-frontend, which passes them directly to thanos-ring.

The ring component has the job of Query Fanout against all thanos-query targets in our fleet. It is configured like so:

query 
--endpoint=dns+thanos-query.org-monitoring.svc.clusterset.local:10901 
--query.replica-label=replica 
--query.replica-label=prometheus_replica 
--query.replica-label=rule_replica

thanos-query.org-monitoring.svc.clusterset.local:10901 is a GKE MultiClusterServices Target. It produces a list of thanos-query endpoints discovered across the fleet. This list is the number of running thanos-query replicas across the fleet.

Each discovered thanos-query is responsible for the endpoints in it's own region, it is configured like so:

query 
--endpoint=dns+prometheus.org-monitoring.svc:10901 
--endpoint=dnssrv+_grpc._tcp.thanos-receive 
--endpoint=dnssrv+_grpc._tcp.thanos-store-shard-0 
--endpoint=dnssrv+_grpc._tcp.thanos-store-shard-1 
--endpoint=dnssrv+_grpc._tcp.thanos-store-shard-2 
--query.auto-downsampling 
--query.replica-label=replica 
--query.replica-label=prometheus_replica 
--query.replica-label=rule_replica 
--query.timeout=5m 
--store.response-timeout=5s 
--store.sd-dns-interval=5s 
--store.unhealthy-timeout=1m

This is a layered querier setup as described in: https://thanos.io/tip/components/query.md/#global-view

Full logs to relevant components:

thanos-ring log

{"caller":"main.go:67","level":"debug","msg":"maxprocs: Updating GOMAXPROCS=[8]: determined from CPU quota","ts":"2023-08-29T02:20:10.989916878Z"}
{"caller":"options.go:26","level":"info","msg":"disabled TLS, key and cert must be set to enable","protocol":"gRPC","ts":"2023-08-29T02:20:10.995682676Z"}
{"caller":"query.go:842","level":"info","msg":"starting query node","ts":"2023-08-29T02:20:10.997116884Z"}
{"cachedEndpoints":0,"caller":"endpointset.go:354","component":"endpointset","level":"debug","msg":"starting to update API endpoints","ts":"2023-08-29T02:20:10.998004757Z"}
{"activeEndpoints":0,"caller":"endpointset.go:433","component":"endpointset","level":"debug","msg":"updated endpoints","ts":"2023-08-29T02:20:10.998508706Z"}
{"caller":"intrumentation.go:56","level":"info","msg":"changing probe status","status":"ready","ts":"2023-08-29T02:20:10.998376499Z"}
{"address":"0.0.0.0:10901","caller":"grpc.go:131","component":"query","level":"info","msg":"listening for serving gRPC","service":"gRPC/server","ts":"2023-08-29T02:20:10.999831529Z"}
{"caller":"intrumentation.go:75","level":"info","msg":"changing probe status","status":"healthy","ts":"2023-08-29T02:20:11.000108915Z"}
{"address":"0.0.0.0:10902","caller":"http.go:73","component":"query","level":"info","msg":"listening for requests and metrics","service":"http/server","ts":"2023-08-29T02:20:11.000234749Z"}
{"address":":10902","caller":"tls_config.go:274","component":"query","level":"info","msg":"Listening on","service":"http/server","ts":"2023-08-29T02:20:11.001127239Z"}
{"address":":10902","caller":"tls_config.go:277","component":"query","http2":false,"level":"info","msg":"TLS is disabled.","service":"http/server","ts":"2023-08-29T02:20:11.001827998Z"}
{"cachedEndpoints":0,"caller":"endpointset.go:354","component":"endpointset","level":"debug","msg":"starting to update API endpoints","ts":"2023-08-29T02:20:16.001726936Z"}
{"address":"10.69.98.8:10901","caller":"endpointset.go:392","component":"endpointset","err":"dialing connection: context deadline exceeded","level":"warn","msg":"new endpoint creation failed","ts":"2023-08-29T02:20:21.003893904Z"}
{"address":"10.22.227.9:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-northamerica-northeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"northamerica-northeast1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-northamerica-northeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"northamerica-northeast1\", tier=\"engineering\"},{receive_cluster=\"northamerica-northeast1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004178958Z"}
{"address":"10.110.40.9:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-us-east1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-east1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-east1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-east1\", tier=\"engineering\"},{receive_cluster=\"us-east1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004265675Z"}
{"address":"10.17.228.3:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west4\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"europe-west4\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west4\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"europe-west4\", tier=\"engineering\"},{receive_cluster=\"europe-west4\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004337318Z"}
{"address":"10.49.224.20:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-us-central1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-central1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-central1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-central1\", tier=\"engineering\"},{receive_cluster=\"us-central1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004393088Z"}
{"address":"10.110.37.23:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-us-east1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-east1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-east1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-east1\", tier=\"engineering\"},{receive_cluster=\"us-east1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004428556Z"}
{"address":"10.17.230.32:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west4\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"europe-west4\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west4\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"europe-west4\", tier=\"engineering\"},{receive_cluster=\"europe-west4\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004476601Z"}
{"address":"10.49.240.50:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-us-central1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-central1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-central1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-central1\", tier=\"engineering\"},{receive_cluster=\"us-central1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004523324Z"}
{"address":"10.9.161.166:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-us-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-west1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-west1\", tier=\"engineering\"},{receive_cluster=\"us-west1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004587081Z"}
{"address":"10.69.99.10:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"europe-west1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"europe-west1\", tier=\"engineering\"},{receive_cluster=\"europe-west1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004686835Z"}
{"address":"10.53.197.12:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-australia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"australia-southeast1\", tier=\"engineering\"},{receive_cluster=\"australia-southeast1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004730393Z"}
{"address":"10.9.163.66:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-us-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-west1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-west1\", tier=\"engineering\"},{receive_cluster=\"us-west1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004783825Z"}
{"address":"10.53.193.5:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-australia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"australia-southeast1\", tier=\"engineering\"},{receive_cluster=\"australia-southeast1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004852742Z"}
{"address":"10.16.35.9:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-asia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"asia-southeast1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-asia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"asia-southeast1\", tier=\"engineering\"},{receive_cluster=\"asia-southeast1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004886759Z"}
{"address":"10.16.33.3:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-asia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"asia-southeast1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-asia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"asia-southeast1\", tier=\"engineering\"},{receive_cluster=\"asia-southeast1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004938184Z"}
...

{"caller":"proxy.go:318","level":"debug","msg":"Tenant info in Series()","tenant":"default-tenant","ts":"2023-08-29T02:21:07.636981742Z"}
{"caller":"proxy.go:364","component":"proxy","level":"debug","msg":"Series: started fanout streams","request":"min_time:1693275367636 max_time:1693275667636 matchers:<name:\"__name__\" value:\"prometheus_build_info\" > aggregates:COUNT aggregates:SUM without_replica_labels:\"replica\" without_replica_labels:\"prometheus_replica\" without_replica_labels:\"rule_replica\" ","status":"store Addr: 10.22.227.9:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-northamerica-northeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"northamerica-northeast1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-northamerica-northeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"northamerica-northeast1\", tier=\"engineering\"},{receive_cluster=\"northamerica-northeast1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.69.99.10:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"europe-west1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"europe-west1\", tier=\"engineering\"},{receive_cluster=\"europe-west1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.53.193.5:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-australia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"australia-southeast1\", tier=\"engineering\"},{receive_cluster=\"australia-southeast1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.17.228.3:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west4\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"europe-west4\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west4\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"europe-west4\", tier=\"engineering\"},{receive_cluster=\"europe-west4\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.53.197.12:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-australia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"australia-southeast1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-australia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"australia-southeast1\", tier=\"engineering\"},{receive_cluster=\"australia-southeast1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.49.224.20:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-us-central1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-central1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-central1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-central1\", tier=\"engineering\"},{receive_cluster=\"us-central1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.16.33.3:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-asia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"asia-southeast1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-asia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"asia-southeast1\", tier=\"engineering\"},{receive_cluster=\"asia-southeast1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.9.161.166:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-us-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-west1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-west1\", tier=\"engineering\"},{receive_cluster=\"us-west1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.49.240.50:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-us-central1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-central1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-central1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-central1\", tier=\"engineering\"},{receive_cluster=\"us-central1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.16.35.9:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-asia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"asia-southeast1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-asia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"asia-southeast1\", tier=\"engineering\"},{receive_cluster=\"asia-southeast1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.110.40.9:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-us-east1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-east1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-east1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-east1\", tier=\"engineering\"},{receive_cluster=\"us-east1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.110.37.23:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-us-east1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-east1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-east1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-east1\", tier=\"engineering\"},{receive_cluster=\"us-east1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.17.230.32:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west4\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"europe-west4\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west4\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"europe-west4\", tier=\"engineering\"},{receive_cluster=\"europe-west4\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.9.163.66:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-us-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-west1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-west1\", tier=\"engineering\"},{receive_cluster=\"us-west1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried","ts":"2023-08-29T02:21:07.638288025Z"}
{"caller":"proxy.go:318","level":"debug","msg":"Tenant info in Series()","tenant":"default-tenant","ts":"2023-08-29T02:21:08.936129958Z"}
{"caller":"proxy.go:318","level":"debug","msg":"Tenant info in Series()","tenant":"default-tenant","ts":"2023-08-29T02:21:08.936305335Z"}

Anything else we need to know:

Most of our remote senders have an external label set like:
replica: prometheus-n
Some newer senders have an external label with a more specific prometheus_replica label.

We've tried a mix of playing with external labels and thanos query settings to no avail. Reverting just the ring component to v0.30.2 resolves the issue.

Also, sometimes the query works if the duration is kept really short:

❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[1s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693275793.558,"14"]}]}}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[1s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693275795.497,"16"]}]}}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[1s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693275798.886,"9"]}]}}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[1s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693275800.061,"11"]}]}}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[1s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693275801.252,"12"]}]}}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[5s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693275804.402,"195"]}]}}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[5s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693275806.031,"193"]}]}}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[5s]))'
{"status":"error","errorType":"execution","error":"vector cannot contain metrics with the same labelset"}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[5s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693276058.426,"121"]}]}}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[5s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693276059.51,"139"]}]}}

The text was updated successfully, but these errors were encountered:

MichaHoffmann · 2023-08-29T04:53:01Z

Are stores and receivers version 0.32.1 too?

tekicode · 2023-08-29T05:21:16Z

Are stores and receivers version 0.32.1 too?

Yes, they match the version of the leaf thanos-query

GiedriusS · 2023-08-29T06:49:22Z

Maybe you could write out the labels of each series when you query prometheus_build_info?

tekicode · 2023-08-29T07:15:17Z

I don't understand what you mean? The only external label we add that's different between replicas, for deduplication, is the replica label. We run 2 replica statefulsets, so each prometheus sts should produce the following series:

prometheus_build_info{cluster="foo",replica="prometheus-0"}
prometheus_build_info{cluster="foo",replica="prometheus-1"}

Which we deduplicate by specifying a replica label of replica. In some other clusters, the label is prometheus_replica which is the preferred nomenclature. We're de-duplicating on both labels.

GiedriusS · 2023-08-29T07:33:13Z

I mean could you please execute an instant query prometheus_build_info[5m] with disabled deduplication and paste the result here so that we could understand what kind of labels are you getting exactly?

mwennrich · 2023-08-29T09:00:17Z

same error message as in #6495

farodin91 · 2023-08-29T09:09:56Z

{branch="HEAD", container="prometheus", endpoint="http-web", goarch="amd64", goos="linux", goversion="go1.20.5", instance="10.15.224.46:9090", job="ops-system/monitoring-stack-kube-prom-prometheus", location="global", namespace="ops-system", pod="prom-agent-monitoring-stack-kube-prom-prometheus-0", prometheus="ops-system/monitoring-stack-kube-prom-prometheus", prometheus_replica="prom-agent-monitoring-stack-kube-prom-prometheus-0", receive="true", region="core", replica="thanos-receive-1", revision="8ef767e396bf8445f009f945b0162fd71827f445", tags="netgo,builtinassets,stringlabels", tenant_id="k3s", version="2.45.0"}
0
{branch="HEAD", container="prometheus", endpoint="http-web", goarch="amd64", goos="linux", goversion="go1.20.5", instance="10.15.224.22:9090", job="ops-system/monitoring-stack-kube-prom-prometheus", location="global", namespace="ops-system", pod="prom-agent-monitoring-stack-kube-prom-prometheus-0", prometheus="ops-system/monitoring-stack-kube-prom-prometheus", prometheus_replica="prom-agent-monitoring-stack-kube-prom-prometheus-0", receive="true", region="core", revision="8ef767e396bf8445f009f945b0162fd71827f445", tags="netgo,builtinassets,stringlabels", tenant_id="k3s", version="2.45.0"}
0
{branch="HEAD", container="prometheus", endpoint="http-web", goarch="amd64", goos="linux", goversion="go1.20.5", instance="10.15.224.46:9090", job="ops-system/monitoring-stack-kube-prom-prometheus", location="global", namespace="ops-system", pod="prom-agent-monitoring-stack-kube-prom-prometheus-0", prometheus="ops-system/monitoring-stack-kube-prom-prometheus", prometheus_replica="prom-agent-monitoring-stack-kube-prom-prometheus-0", receive="true", region="core", revision="8ef767e396bf8445f009f945b0162fd71827f445", tags="netgo,builtinassets,stringlabels", tenant_id="k3s", version="2.45.0"}

here with out dedup. This explicit fails with activated dedup.

MichaHoffmann · 2023-08-29T09:13:31Z

if im not wrong: if we address replica labels we have the same instance "10.15.224.46" ( one prometheus replica one thanos receive replica" right? )

GiedriusS · 2023-08-29T09:14:13Z

@farodin91 and what kind of parameters do you have on Thanos Query?

farodin91 · 2023-08-29T09:14:50Z

rate(prometheus_build_info[330h])

farodin91 · 2023-08-29T09:20:38Z

- query
- '--log.level=info'
- '--log.format=json'
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:10902'
- '--query.replica-label=replica'
- '--query.replica-label=prometheus_replica'
- '--query.replica-label=thanos_ruler_replica'
- >-
  --endpoint=dnssrv+_grpc._tcp.thanos-storegateway
- >-
  --endpoint=dnssrv+_grpc._tcp.thanos-receive-headless
- '--query.default-step=15s'
- '--query.promql-engine=thanos'
- '--grpc-compression=snappy'
- '--endpoint=dns+thanos-sidecar.bla:10901'
- >-
  --endpoint=dns+thanos-ruler-operated:10901
- '--alert.query-url=https://bla'

MichaHoffmann · 2023-08-31T19:33:24Z

Hey; are you able to provide some offending blocks? That would help me trying to reproduce and understand this!

saswatamcode · 2023-09-01T06:08:06Z

@farodin91 is there a reason why these series have receive="true" but don't have any replica label? Could you share some of your receive configs?

branch="HEAD", container="prometheus", endpoint="http-web", goarch="amd64", goos="linux", goversion="go1.20.5", instance="10.15.224.22:9090", job="ops-system/monitoring-stack-kube-prom-prometheus", location="global", namespace="ops-system", pod="prom-agent-monitoring-stack-kube-prom-prometheus-0", prometheus="ops-system/monitoring-stack-kube-prom-prometheus", prometheus_replica="prom-agent-monitoring-stack-kube-prom-prometheus-0", receive="true", region="core", revision="8ef767e396bf8445f009f945b0162fd71827f445", tags="netgo,builtinassets,stringlabels", tenant_id="k3s", version="2.45.0"}
0
{branch="HEAD", container="prometheus", endpoint="http-web", goarch="amd64", goos="linux", goversion="go1.20.5", instance="10.15.224.46:9090", job="ops-system/monitoring-stack-kube-prom-prometheus", location="global", namespace="ops-system", pod="prom-agent-monitoring-stack-kube-prom-prometheus-0", prometheus="ops-system/monitoring-stack-kube-prom-prometheus", prometheus_replica="prom-agent-monitoring-stack-kube-prom-prometheus-0", receive="true", region="core", revision="8ef767e396bf8445f009f945b0162fd71827f445", tags="netgo,builtinassets,stringlabels", tenant_id="k3s", version="2.45.0"}

farodin91 · 2023-09-01T06:57:33Z

- receive
- '--log.level=info'
- '--log.format=json'
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:10902'
- '--remote-write.address=0.0.0.0:19291'
- '--objstore.config=$(OBJSTORE_CONFIG)'
- '--tsdb.path=/var/thanos/receive'
- '--label=replica="$(NAME)"'
- '--label=receive="true"'
- '--tsdb.retention=12h'
- >-
  --receive.local-endpoint=$(NAME).thanos-receive-headless.$(NAMESPACE).svc.cluster.local:10901
- '--tsdb.out-of-order.time-window=120s'
- '--tsdb.too-far-in-future.time-window=60s'
- '--tsdb.max-exemplars=1000'

Isn't the replica removed in the storegateway?

MichaHoffmann · 2023-09-01T06:58:37Z

- receive
- '--log.level=info'
- '--log.format=json'
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:10902'
- '--remote-write.address=0.0.0.0:19291'
- '--objstore.config=$(OBJSTORE_CONFIG)'
- '--tsdb.path=/var/thanos/receive'
- '--label=replica="$(NAME)"'
- '--label=receive="true"'
- '--tsdb.retention=12h'
- >-
  --receive.local-endpoint=$(NAME).thanos-receive-headless.$(NAMESPACE).svc.cluster.local:10901
- '--tsdb.out-of-order.time-window=120s'
- '--tsdb.too-far-in-future.time-window=60s'
- '--tsdb.max-exemplars=1000'

Isn't the replica removed in the storegateway?

Only after compaction i think. New blocks will still have it

farodin91 · 2023-09-01T07:02:42Z

in 330h their should be a compaction

farodin91 · 2023-09-05T13:41:48Z

I updated all thanos components to main-2023-09-05-d1edf74. No improve, i see the same issue.

farodin91 · 2023-09-05T13:57:08Z

@GiedriusS Do you have any idea to help you? How I can debug this issue?

farodin91 · 2023-09-05T14:00:28Z

Now, it also fails for some queries, if use dedup is activated or not.

MichaHoffmann · 2023-09-05T15:49:57Z

I'm fairly sure it might be #6702 this

farodin91 · 2023-09-18T10:41:26Z

@MichaHoffmann Do you want to release it as 0.32.3?

tekicode · 2023-09-18T20:29:54Z

This is running good for us.

GiedriusS mentioned this issue Aug 29, 2023

Failed to query metric name with time slice name thanos-io/promql-engine#313

Closed

This was referenced Sep 3, 2023

engine: add offending lset to error message thanos-io/promql-engine#314

Open

Store: fix block dedup #6697

Merged

matej-g closed this as completed in #6697 Sep 5, 2023

GiedriusS reopened this Sep 5, 2023

vCra mentioned this issue Sep 6, 2023

Store: store responses should always be sorted #6706

Merged

2 tasks

tekicode closed this as completed Sep 18, 2023

saswatamcode mentioned this issue Sep 20, 2023

Cut patch release v0.32.3 #6736

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

query: v0.32.1+ remote-write prometheus + layered queriers: "vector cannot contain metrics with the same labelset" #6677

query: v0.32.1+ remote-write prometheus + layered queriers: "vector cannot contain metrics with the same labelset" #6677

tekicode commented Aug 29, 2023

MichaHoffmann commented Aug 29, 2023

tekicode commented Aug 29, 2023

GiedriusS commented Aug 29, 2023

tekicode commented Aug 29, 2023

GiedriusS commented Aug 29, 2023

mwennrich commented Aug 29, 2023

farodin91 commented Aug 29, 2023

MichaHoffmann commented Aug 29, 2023

GiedriusS commented Aug 29, 2023

farodin91 commented Aug 29, 2023

farodin91 commented Aug 29, 2023 •

edited

Loading

MichaHoffmann commented Aug 31, 2023

saswatamcode commented Sep 1, 2023

farodin91 commented Sep 1, 2023

MichaHoffmann commented Sep 1, 2023

farodin91 commented Sep 1, 2023

farodin91 commented Sep 5, 2023 •

edited

Loading

farodin91 commented Sep 5, 2023

farodin91 commented Sep 5, 2023

MichaHoffmann commented Sep 5, 2023

farodin91 commented Sep 18, 2023

tekicode commented Sep 18, 2023

query: v0.32.1+ remote-write prometheus + layered queriers: "vector cannot contain metrics with the same labelset" #6677

query: v0.32.1+ remote-write prometheus + layered queriers: "vector cannot contain metrics with the same labelset" #6677

Comments

tekicode commented Aug 29, 2023

MichaHoffmann commented Aug 29, 2023

tekicode commented Aug 29, 2023

GiedriusS commented Aug 29, 2023

tekicode commented Aug 29, 2023

GiedriusS commented Aug 29, 2023

mwennrich commented Aug 29, 2023

farodin91 commented Aug 29, 2023

MichaHoffmann commented Aug 29, 2023

GiedriusS commented Aug 29, 2023

farodin91 commented Aug 29, 2023

farodin91 commented Aug 29, 2023 • edited Loading

MichaHoffmann commented Aug 31, 2023

saswatamcode commented Sep 1, 2023

farodin91 commented Sep 1, 2023

MichaHoffmann commented Sep 1, 2023

farodin91 commented Sep 1, 2023

farodin91 commented Sep 5, 2023 • edited Loading

farodin91 commented Sep 5, 2023

farodin91 commented Sep 5, 2023

MichaHoffmann commented Sep 5, 2023

farodin91 commented Sep 18, 2023

tekicode commented Sep 18, 2023

farodin91 commented Aug 29, 2023 •

edited

Loading

farodin91 commented Sep 5, 2023 •

edited

Loading