Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: v0.32.1+ remote-write prometheus + layered queriers: "vector cannot contain metrics with the same labelset" #6677

Closed
tekicode opened this issue Aug 29, 2023 · 22 comments · Fixed by #6697

Comments

@tekicode
Copy link

After upgrading to v0.32.0 and later v0.32.1+, our setup no longer functions, returning a random labelset error. We're using a layered thanos-query setup, with the top layer being the target for thanos-frontend.

Thanos, Prometheus and Golang version used: Since v0.31, currently main-2023-08-28-32412dc

Object Storage Provider:
gcs

What happened:
When executing an instant query on our setup, after upgrade from 0.30.2 to 0.32.1+ (main), we're seeing this error:
{"status":"error","errorType":"execution","error":"vector cannot contain metrics with the same labelset"}

What you expected to happen:
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[5m]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693271364.906,"580"]}]}}

A value higher than >500

How to reproduce it (as minimally and precisely as possible):

kubectl port-forward directly to a thanos-ring pod and execute and instant query using the HTTP API. This eliminates any issues with thanos-frontend by exclusion.

❯ kubectl port-forward -n org-monitoring deployment/thanos-ring 10902
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[5m]))'
{"status":"error","errorType":"execution","error":"vector cannot contain metrics with the same labelset"}

Our Setup:
Queries are made against thanos-frontend, which passes them directly to thanos-ring.

The ring component has the job of Query Fanout against all thanos-query targets in our fleet. It is configured like so:

query 
--endpoint=dns+thanos-query.org-monitoring.svc.clusterset.local:10901 
--query.replica-label=replica 
--query.replica-label=prometheus_replica 
--query.replica-label=rule_replica 

thanos-query.org-monitoring.svc.clusterset.local:10901 is a GKE MultiClusterServices Target. It produces a list of thanos-query endpoints discovered across the fleet. This list is the number of running thanos-query replicas across the fleet.

Each discovered thanos-query is responsible for the endpoints in it's own region, it is configured like so:

query 
--endpoint=dns+prometheus.org-monitoring.svc:10901 
--endpoint=dnssrv+_grpc._tcp.thanos-receive 
--endpoint=dnssrv+_grpc._tcp.thanos-store-shard-0 
--endpoint=dnssrv+_grpc._tcp.thanos-store-shard-1 
--endpoint=dnssrv+_grpc._tcp.thanos-store-shard-2 
--query.auto-downsampling 
--query.replica-label=replica 
--query.replica-label=prometheus_replica 
--query.replica-label=rule_replica 
--query.timeout=5m 
--store.response-timeout=5s 
--store.sd-dns-interval=5s 
--store.unhealthy-timeout=1m 

This is a layered querier setup as described in: https://thanos.io/tip/components/query.md/#global-view

Full logs to relevant components:

thanos-ring log

{"caller":"main.go:67","level":"debug","msg":"maxprocs: Updating GOMAXPROCS=[8]: determined from CPU quota","ts":"2023-08-29T02:20:10.989916878Z"}
{"caller":"options.go:26","level":"info","msg":"disabled TLS, key and cert must be set to enable","protocol":"gRPC","ts":"2023-08-29T02:20:10.995682676Z"}
{"caller":"query.go:842","level":"info","msg":"starting query node","ts":"2023-08-29T02:20:10.997116884Z"}
{"cachedEndpoints":0,"caller":"endpointset.go:354","component":"endpointset","level":"debug","msg":"starting to update API endpoints","ts":"2023-08-29T02:20:10.998004757Z"}
{"activeEndpoints":0,"caller":"endpointset.go:433","component":"endpointset","level":"debug","msg":"updated endpoints","ts":"2023-08-29T02:20:10.998508706Z"}
{"caller":"intrumentation.go:56","level":"info","msg":"changing probe status","status":"ready","ts":"2023-08-29T02:20:10.998376499Z"}
{"address":"0.0.0.0:10901","caller":"grpc.go:131","component":"query","level":"info","msg":"listening for serving gRPC","service":"gRPC/server","ts":"2023-08-29T02:20:10.999831529Z"}
{"caller":"intrumentation.go:75","level":"info","msg":"changing probe status","status":"healthy","ts":"2023-08-29T02:20:11.000108915Z"}
{"address":"0.0.0.0:10902","caller":"http.go:73","component":"query","level":"info","msg":"listening for requests and metrics","service":"http/server","ts":"2023-08-29T02:20:11.000234749Z"}
{"address":":10902","caller":"tls_config.go:274","component":"query","level":"info","msg":"Listening on","service":"http/server","ts":"2023-08-29T02:20:11.001127239Z"}
{"address":":10902","caller":"tls_config.go:277","component":"query","http2":false,"level":"info","msg":"TLS is disabled.","service":"http/server","ts":"2023-08-29T02:20:11.001827998Z"}
{"cachedEndpoints":0,"caller":"endpointset.go:354","component":"endpointset","level":"debug","msg":"starting to update API endpoints","ts":"2023-08-29T02:20:16.001726936Z"}
{"address":"10.69.98.8:10901","caller":"endpointset.go:392","component":"endpointset","err":"dialing connection: context deadline exceeded","level":"warn","msg":"new endpoint creation failed","ts":"2023-08-29T02:20:21.003893904Z"}
{"address":"10.22.227.9:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-northamerica-northeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"northamerica-northeast1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-northamerica-northeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"northamerica-northeast1\", tier=\"engineering\"},{receive_cluster=\"northamerica-northeast1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004178958Z"}
{"address":"10.110.40.9:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-us-east1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-east1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-east1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-east1\", tier=\"engineering\"},{receive_cluster=\"us-east1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004265675Z"}
{"address":"10.17.228.3:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west4\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"europe-west4\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west4\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"europe-west4\", tier=\"engineering\"},{receive_cluster=\"europe-west4\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004337318Z"}
{"address":"10.49.224.20:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-us-central1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-central1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-central1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-central1\", tier=\"engineering\"},{receive_cluster=\"us-central1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004393088Z"}
{"address":"10.110.37.23:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-us-east1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-east1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-east1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-east1\", tier=\"engineering\"},{receive_cluster=\"us-east1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004428556Z"}
{"address":"10.17.230.32:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west4\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"europe-west4\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west4\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"europe-west4\", tier=\"engineering\"},{receive_cluster=\"europe-west4\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004476601Z"}
{"address":"10.49.240.50:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-us-central1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-central1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-central1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-central1\", tier=\"engineering\"},{receive_cluster=\"us-central1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004523324Z"}
{"address":"10.9.161.166:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-us-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-west1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-west1\", tier=\"engineering\"},{receive_cluster=\"us-west1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004587081Z"}
{"address":"10.69.99.10:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"europe-west1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"europe-west1\", tier=\"engineering\"},{receive_cluster=\"europe-west1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004686835Z"}
{"address":"10.53.197.12:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-australia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"australia-southeast1\", tier=\"engineering\"},{receive_cluster=\"australia-southeast1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004730393Z"}
{"address":"10.9.163.66:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-us-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-west1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-west1\", tier=\"engineering\"},{receive_cluster=\"us-west1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004783825Z"}
{"address":"10.53.193.5:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-australia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"australia-southeast1\", tier=\"engineering\"},{receive_cluster=\"australia-southeast1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004852742Z"}
{"address":"10.16.35.9:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-asia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"asia-southeast1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-asia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"asia-southeast1\", tier=\"engineering\"},{receive_cluster=\"asia-southeast1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004886759Z"}
{"address":"10.16.33.3:10901","caller":"endpointset.go:425","component":"endpointset","extLset":"{can_alert=\"false\", cluster=\"pre-prod-monitor-asia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"asia-southeast1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-asia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"asia-southeast1\", tier=\"engineering\"},{receive_cluster=\"asia-southeast1\", tenant_id=\"default-tenant\"}","level":"info","msg":"adding new query with [storeEndpoints rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]","ts":"2023-08-29T02:20:21.004938184Z"}
...

{"caller":"proxy.go:318","level":"debug","msg":"Tenant info in Series()","tenant":"default-tenant","ts":"2023-08-29T02:21:07.636981742Z"}
{"caller":"proxy.go:364","component":"proxy","level":"debug","msg":"Series: started fanout streams","request":"min_time:1693275367636 max_time:1693275667636 matchers:<name:\"__name__\" value:\"prometheus_build_info\" > aggregates:COUNT aggregates:SUM without_replica_labels:\"replica\" without_replica_labels:\"prometheus_replica\" without_replica_labels:\"rule_replica\" ","status":"store Addr: 10.22.227.9:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-northamerica-northeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"northamerica-northeast1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-northamerica-northeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"northamerica-northeast1\", tier=\"engineering\"},{receive_cluster=\"northamerica-northeast1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.69.99.10:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"europe-west1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"europe-west1\", tier=\"engineering\"},{receive_cluster=\"europe-west1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.53.193.5:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-australia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"australia-southeast1\", tier=\"engineering\"},{receive_cluster=\"australia-southeast1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.17.228.3:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west4\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"europe-west4\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west4\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"europe-west4\", tier=\"engineering\"},{receive_cluster=\"europe-west4\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.53.197.12:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-australia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"australia-southeast1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-australia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"australia-southeast1\", tier=\"engineering\"},{receive_cluster=\"australia-southeast1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.49.224.20:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-us-central1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-central1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-central1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-central1\", tier=\"engineering\"},{receive_cluster=\"us-central1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.16.33.3:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-asia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"asia-southeast1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-asia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"asia-southeast1\", tier=\"engineering\"},{receive_cluster=\"asia-southeast1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.9.161.166:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-us-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-west1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-west1\", tier=\"engineering\"},{receive_cluster=\"us-west1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.49.240.50:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-us-central1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-central1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-central1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-central1\", tier=\"engineering\"},{receive_cluster=\"us-central1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.16.35.9:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-asia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"asia-southeast1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-asia-southeast1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"asia-southeast1\", tier=\"engineering\"},{receive_cluster=\"asia-southeast1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.110.40.9:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-us-east1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-east1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-east1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-east1\", tier=\"engineering\"},{receive_cluster=\"us-east1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.110.37.23:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-us-east1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-east1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-east1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-east1\", tier=\"engineering\"},{receive_cluster=\"us-east1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.17.230.32:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west4\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"europe-west4\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-europe-west4\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"europe-west4\", tier=\"engineering\"},{receive_cluster=\"europe-west4\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried;store Addr: 10.9.163.66:10901 LabelSets: {can_alert=\"false\", cluster=\"pre-prod-monitor-us-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-0\", region=\"us-west1\", tier=\"engineering\"},{can_alert=\"false\", cluster=\"pre-prod-monitor-us-west1\", envreg=\"pre-prod-monitor\", gcp_project_id=\"sanitized\", prometheus_replica=\"prometheus-1\", region=\"us-west1\", tier=\"engineering\"},{receive_cluster=\"us-west1\", tenant_id=\"default-tenant\"} MinTime: -62167219200000 MaxTime: 9223372036854775807 queried","ts":"2023-08-29T02:21:07.638288025Z"}
{"caller":"proxy.go:318","level":"debug","msg":"Tenant info in Series()","tenant":"default-tenant","ts":"2023-08-29T02:21:08.936129958Z"}
{"caller":"proxy.go:318","level":"debug","msg":"Tenant info in Series()","tenant":"default-tenant","ts":"2023-08-29T02:21:08.936305335Z"}

Anything else we need to know:

Most of our remote senders have an external label set like:
replica: prometheus-n
Some newer senders have an external label with a more specific prometheus_replica label.

We've tried a mix of playing with external labels and thanos query settings to no avail. Reverting just the ring component to v0.30.2 resolves the issue.

Also, sometimes the query works if the duration is kept really short:

❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[1s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693275793.558,"14"]}]}}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[1s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693275795.497,"16"]}]}}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[1s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693275798.886,"9"]}]}}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[1s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693275800.061,"11"]}]}}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[1s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693275801.252,"12"]}]}}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[5s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693275804.402,"195"]}]}}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[5s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693275806.031,"193"]}]}}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[5s]))'
{"status":"error","errorType":"execution","error":"vector cannot contain metrics with the same labelset"}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[5s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693276058.426,"121"]}]}}
❯ curl http://localhost:10902/api/v1/query --data-urlencode 'query=count(last_over_time(prometheus_build_info[5s]))'
{"status":"success","data":{"resultType":"vector","result":[{"metric":{},"value":[1693276059.51,"139"]}]}}
@MichaHoffmann
Copy link
Contributor

Are stores and receivers version 0.32.1 too?

@tekicode
Copy link
Author

Are stores and receivers version 0.32.1 too?

Yes, they match the version of the leaf thanos-query

@GiedriusS
Copy link
Member

Maybe you could write out the labels of each series when you query prometheus_build_info?

@tekicode
Copy link
Author

I don't understand what you mean? The only external label we add that's different between replicas, for deduplication, is the replica label. We run 2 replica statefulsets, so each prometheus sts should produce the following series:

prometheus_build_info{cluster="foo",replica="prometheus-0"}
prometheus_build_info{cluster="foo",replica="prometheus-1"}

Which we deduplicate by specifying a replica label of replica. In some other clusters, the label is prometheus_replica which is the preferred nomenclature. We're de-duplicating on both labels.

@GiedriusS
Copy link
Member

I mean could you please execute an instant query prometheus_build_info[5m] with disabled deduplication and paste the result here so that we could understand what kind of labels are you getting exactly?

@mwennrich
Copy link

same error message as in #6495

@farodin91
Copy link
Contributor

{branch="HEAD", container="prometheus", endpoint="http-web", goarch="amd64", goos="linux", goversion="go1.20.5", instance="10.15.224.46:9090", job="ops-system/monitoring-stack-kube-prom-prometheus", location="global", namespace="ops-system", pod="prom-agent-monitoring-stack-kube-prom-prometheus-0", prometheus="ops-system/monitoring-stack-kube-prom-prometheus", prometheus_replica="prom-agent-monitoring-stack-kube-prom-prometheus-0", receive="true", region="core", replica="thanos-receive-1", revision="8ef767e396bf8445f009f945b0162fd71827f445", tags="netgo,builtinassets,stringlabels", tenant_id="k3s", version="2.45.0"}
0
{branch="HEAD", container="prometheus", endpoint="http-web", goarch="amd64", goos="linux", goversion="go1.20.5", instance="10.15.224.22:9090", job="ops-system/monitoring-stack-kube-prom-prometheus", location="global", namespace="ops-system", pod="prom-agent-monitoring-stack-kube-prom-prometheus-0", prometheus="ops-system/monitoring-stack-kube-prom-prometheus", prometheus_replica="prom-agent-monitoring-stack-kube-prom-prometheus-0", receive="true", region="core", revision="8ef767e396bf8445f009f945b0162fd71827f445", tags="netgo,builtinassets,stringlabels", tenant_id="k3s", version="2.45.0"}
0
{branch="HEAD", container="prometheus", endpoint="http-web", goarch="amd64", goos="linux", goversion="go1.20.5", instance="10.15.224.46:9090", job="ops-system/monitoring-stack-kube-prom-prometheus", location="global", namespace="ops-system", pod="prom-agent-monitoring-stack-kube-prom-prometheus-0", prometheus="ops-system/monitoring-stack-kube-prom-prometheus", prometheus_replica="prom-agent-monitoring-stack-kube-prom-prometheus-0", receive="true", region="core", revision="8ef767e396bf8445f009f945b0162fd71827f445", tags="netgo,builtinassets,stringlabels", tenant_id="k3s", version="2.45.0"}

here with out dedup. This explicit fails with activated dedup.

@MichaHoffmann
Copy link
Contributor

if im not wrong: if we address replica labels we have the same instance "10.15.224.46" ( one prometheus replica one thanos receive replica" right? )

@GiedriusS
Copy link
Member

@farodin91 and what kind of parameters do you have on Thanos Query?

@farodin91
Copy link
Contributor

rate(prometheus_build_info[330h])

@farodin91
Copy link
Contributor

farodin91 commented Aug 29, 2023

- query
- '--log.level=info'
- '--log.format=json'
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:10902'
- '--query.replica-label=replica'
- '--query.replica-label=prometheus_replica'
- '--query.replica-label=thanos_ruler_replica'
- >-
  --endpoint=dnssrv+_grpc._tcp.thanos-storegateway
- >-
  --endpoint=dnssrv+_grpc._tcp.thanos-receive-headless
- '--query.default-step=15s'
- '--query.promql-engine=thanos'
- '--grpc-compression=snappy'
- '--endpoint=dns+thanos-sidecar.bla:10901'
- >-
  --endpoint=dns+thanos-ruler-operated:10901
- '--alert.query-url=https://bla'

@MichaHoffmann
Copy link
Contributor

Hey; are you able to provide some offending blocks? That would help me trying to reproduce and understand this!

@saswatamcode
Copy link
Member

@farodin91 is there a reason why these series have receive="true" but don't have any replica label? Could you share some of your receive configs?

branch="HEAD", container="prometheus", endpoint="http-web", goarch="amd64", goos="linux", goversion="go1.20.5", instance="10.15.224.22:9090", job="ops-system/monitoring-stack-kube-prom-prometheus", location="global", namespace="ops-system", pod="prom-agent-monitoring-stack-kube-prom-prometheus-0", prometheus="ops-system/monitoring-stack-kube-prom-prometheus", prometheus_replica="prom-agent-monitoring-stack-kube-prom-prometheus-0", receive="true", region="core", revision="8ef767e396bf8445f009f945b0162fd71827f445", tags="netgo,builtinassets,stringlabels", tenant_id="k3s", version="2.45.0"}
0
{branch="HEAD", container="prometheus", endpoint="http-web", goarch="amd64", goos="linux", goversion="go1.20.5", instance="10.15.224.46:9090", job="ops-system/monitoring-stack-kube-prom-prometheus", location="global", namespace="ops-system", pod="prom-agent-monitoring-stack-kube-prom-prometheus-0", prometheus="ops-system/monitoring-stack-kube-prom-prometheus", prometheus_replica="prom-agent-monitoring-stack-kube-prom-prometheus-0", receive="true", region="core", revision="8ef767e396bf8445f009f945b0162fd71827f445", tags="netgo,builtinassets,stringlabels", tenant_id="k3s", version="2.45.0"}

@farodin91
Copy link
Contributor

- receive
- '--log.level=info'
- '--log.format=json'
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:10902'
- '--remote-write.address=0.0.0.0:19291'
- '--objstore.config=$(OBJSTORE_CONFIG)'
- '--tsdb.path=/var/thanos/receive'
- '--label=replica="$(NAME)"'
- '--label=receive="true"'
- '--tsdb.retention=12h'
- >-
  --receive.local-endpoint=$(NAME).thanos-receive-headless.$(NAMESPACE).svc.cluster.local:10901
- '--tsdb.out-of-order.time-window=120s'
- '--tsdb.too-far-in-future.time-window=60s'
- '--tsdb.max-exemplars=1000'

Isn't the replica removed in the storegateway?

@MichaHoffmann
Copy link
Contributor

- receive
- '--log.level=info'
- '--log.format=json'
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:10902'
- '--remote-write.address=0.0.0.0:19291'
- '--objstore.config=$(OBJSTORE_CONFIG)'
- '--tsdb.path=/var/thanos/receive'
- '--label=replica="$(NAME)"'
- '--label=receive="true"'
- '--tsdb.retention=12h'
- >-
  --receive.local-endpoint=$(NAME).thanos-receive-headless.$(NAMESPACE).svc.cluster.local:10901
- '--tsdb.out-of-order.time-window=120s'
- '--tsdb.too-far-in-future.time-window=60s'
- '--tsdb.max-exemplars=1000'

Isn't the replica removed in the storegateway?

Only after compaction i think. New blocks will still have it

@farodin91
Copy link
Contributor

in 330h their should be a compaction

@farodin91
Copy link
Contributor

farodin91 commented Sep 5, 2023

I updated all thanos components to main-2023-09-05-d1edf74. No improve, i see the same issue.

@GiedriusS GiedriusS reopened this Sep 5, 2023
@farodin91
Copy link
Contributor

@GiedriusS Do you have any idea to help you? How I can debug this issue?

@farodin91
Copy link
Contributor

Now, it also fails for some queries, if use dedup is activated or not.

@MichaHoffmann
Copy link
Contributor

I'm fairly sure it might be #6702 this

@farodin91
Copy link
Contributor

@MichaHoffmann Do you want to release it as 0.32.3?

@tekicode
Copy link
Author

This is running good for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants