Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thnaos store gateway CrashLoopBackOff when update to v0.30.0-rc.0 #6006

Closed
zhangrj opened this issue Dec 29, 2022 · 4 comments · Fixed by #6009
Closed

Thnaos store gateway CrashLoopBackOff when update to v0.30.0-rc.0 #6006

zhangrj opened this issue Dec 29, 2022 · 4 comments · Fixed by #6009

Comments

@zhangrj
Copy link

zhangrj commented Dec 29, 2022

Thanos, Prometheus and Golang version used:

Thanos update from v0.29.0 to v0.30.0-rc.0

redis version redis:6.2.6

Object Storage Provider: S3

What happened:

Thanos store gateway CrashLoopBackOff when update from v0.29.0 to v0.30.0-rc.0.

$ kubectl get po -n thanos-prod                                                    
NAME                                               READY   STATUS             RESTARTS   AGE
redis-master-0                                     1/1     Running            0          2d16h
thanos-prod-bucketweb-7754c688d5-zd89f             1/1     Running            0          2d18h
thanos-prod-compactor-67964d97f-prbwt              1/1     Running            0          2d18h
thanos-prod-query-7cf87f67d5-2t2cn                 1/1     Running            0          42h
thanos-prod-query-7cf87f67d5-92tnh                 1/1     Running            0          42h
thanos-prod-query-7cf87f67d5-t5nq4                 1/1     Running            0          42h
thanos-prod-query-frontend-69d9796878-bwvd8        1/1     Running            0          2d18h
thanos-prod-receive-0                              1/1     Running            0          2d18h
thanos-prod-receive-1                              1/1     Running            0          2d17h
thanos-prod-receive-2                              1/1     Running            0          2d17h
thanos-prod-receive-distributor-7ff99cdc64-654f9   1/1     Running            0          2d17h
thanos-prod-receive-distributor-7ff99cdc64-fsq6l   1/1     Running            0          2d17h
thanos-prod-receive-distributor-7ff99cdc64-rbzzz   1/1     Running            0          2d18h
thanos-prod-ruler-0                                2/2     Running            0          2d17h
thanos-prod-ruler-1                                2/2     Running            0          2d18h
thanos-prod-storegateway-0                         1/1     Running            0          43h
thanos-prod-storegateway-1                         0/1     CrashLoopBackOff   502        42h

error log:

$ kubectl logs thanos-prod-storegateway-1 -n thanos-prod
level=info ts=2022-12-29T02:37:44.296395835Z caller=factory.go:52 msg="loading bucket configuration"
level=info ts=2022-12-29T02:37:44.296812118Z caller=caching_bucket_factory.go:76 msg="loading caching bucket configuration"
level=info ts=2022-12-29T02:37:44.301647033Z caller=redis.go:48 msg="created redis cache"
level=info ts=2022-12-29T02:37:44.301824396Z caller=factory.go:35 msg="loading index cache configuration"
panic: duplicate metrics collector registration attempted

goroutine 1 [running]:
github.com/prometheus/client_golang/prometheus.(*wrappingRegisterer).MustRegister(0xc00065e6c0, {0xc00021c950?, 0x1, 0x0?})
        /go/pkg/mod/github.com/prometheus/client_golang@v1.14.0/prometheus/wrap.go:106 +0x151
github.com/prometheus/client_golang/prometheus/promauto.Factory.NewGauge({{0x2beb8d0?, 0xc00065e6c0?}}, {{0x0, 0x0}, {0x0, 0x0}, {0x26a5ded, 0x10}, {0x26e5307, 0x25}, ...})
        /go/pkg/mod/github.com/prometheus/client_golang@v1.14.0/prometheus/promauto/auto.go:297 +0xfd
github.com/thanos-io/thanos/pkg/gate.New({0x2beb8d0, 0xc00065e6c0}, 0x64)
        /app/pkg/gate/gate.go:86 +0x7e
github.com/thanos-io/thanos/pkg/cacheutil.NewRedisClientWithConfig({0x2bd9640?, _}, {_, _}, {{0xc000055920, 0x2f}, {0x0, 0x0}, {0xc000449b10, 0x10}, ...}, ...)
        /app/pkg/cacheutil/redis_client.go:217 +0x33f
github.com/thanos-io/thanos/pkg/cacheutil.NewRedisClient({0x2bd9640, 0xc0002473b0}, {0x269a989, 0xb}, {0xc0001ac600?, 0x5f3f01?, 0xc0001ac200?}, {0x2beb8a0, 0xc000247e50})
        /app/pkg/cacheutil/redis_client.go:167 +0x191
github.com/thanos-io/thanos/pkg/store/cache.NewIndexCache({0x2bd9640, 0xc0002473b0}, {0xc00057a000, 0x172, 0x180}, {0x2beb8a0, 0xc000247e50})
        /app/pkg/store/cache/factory.go:58 +0x229
main.runStore(_, {_, _}, _, {_, _}, {_, _, _}, {0xc0004840a0, ...}, ...)
        /app/cmd/thanos/store.go:304 +0x945
main.registerStore.func1(0x237eb40?, {0x2bd9640, 0xc0002473b0}, 0x6?, {0x2beb7b0, 0x417d8a0}, 0x414d2e0?, 0x0)
        /app/cmd/thanos/store.go:210 +0x2ae
main.main()
        /app/cmd/thanos/main.go:133 +0x1235

What you expected to happen:

thanos store run correctly

relevant yaml:

thanos store

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: thanos-prod-storegateway
  namespace: "thanos-prod"
  labels:
    app.kubernetes.io/name: thanos
    app.kubernetes.io/instance: thanos-prod
    app.kubernetes.io/component: storegateway
spec:
  replicas: 2
  podManagementPolicy: OrderedReady
  serviceName: thanos-prod-storegateway-headless
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: thanos
      app.kubernetes.io/instance: thanos-prod
      app.kubernetes.io/component: storegateway
  template:
    metadata:
      labels:
        app.kubernetes.io/name: thanos
        app.kubernetes.io/instance: thanos-prod
        app.kubernetes.io/component: storegateway
    spec:
      hostAliases:     
      - ip: "10.12.32.100"
        hostnames:
        - "s3-qos.iot-st-armtest.qiniu-solutions.com"
      serviceAccount: thanos-prod-storegateway
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app.kubernetes.io/name: thanos
                    app.kubernetes.io/instance: thanos-prod
                    app.kubernetes.io/component: storegateway
                namespaces:
                  - "thanos-prod"
                topologyKey: kubernetes.io/hostname
              weight: 1
      securityContext:
        fsGroup: 1001
      containers:
        - name: storegateway
          image: thanosio/thanos:v0.30.0-rc.0
          imagePullPolicy: "IfNotPresent"
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: false
            runAsNonRoot: true
            runAsUser: 1001
          args:
            - store
            - --log.level=info
            - --log.format=logfmt
            - --grpc-address=0.0.0.0:10901
            - --http-address=0.0.0.0:10902
            - --data-dir=/data
            - --objstore.config-file=/conf/objstore.yml
            - --index-cache.config-file=/cache_conf/index-cache.yml
            - --store.caching-bucket.config-file=/cache_conf/bucket-cache.yml
          ports:
            - name: http
              containerPort: 10902
              protocol: TCP
            - name: grpc
              containerPort: 10901
              protocol: TCP
          env:
            - name: NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          livenessProbe:
            failureThreshold: 6
            initialDelaySeconds: 30
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 30
            httpGet:
              path: /-/healthy
              port: http
          readinessProbe:
            failureThreshold: 6
            initialDelaySeconds: 30
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 30
            httpGet:
              path: /-/ready
              port: http
          resources:
            limits: {}
            requests: {}
          volumeMounts:
            - name: objstore-config
              mountPath: /conf
            - name: cache-config
              mountPath: /cache_conf
            - name: data
              mountPath: /data
      volumes:
        - name: objstore-config
          configMap:
            name: thanos-prod-objstore-configmap
        - name: cache-config
          configMap:
            name: thanos-prod-storegateway-cache-configmap
        - name: data
          emptyDir: {}

object store conf

apiVersion: v1
kind: ConfigMap
metadata:
  name: thanos-prod-objstore-configmap
  namespace: thanos-prod
  labels:
    app.kubernetes.io/name: thanos
    app.kubernetes.io/instance: thanos-prod
data:
  objstore.yml: |-
    type: s3
    config:
      bucket: *
      endpoint: *
      access_key: *
      secret_key: *
      insecure: true
      list_objects_version: "v1"

cache conf

apiVersion: v1
kind: ConfigMap
metadata:
  name: thanos-prod-storegateway-cache-configmap
  namespace: thanos-prod
  labels:
    app.kubernetes.io/name: thanos
    app.kubernetes.io/instance: thanos-prod
data:
  index-cache.yml: |-
    type: REDIS
    config:
      addr: redis-master.thanos-prod.svc.cluster.local:6379
      password: *
      db: 0
      dial_timeout: 5s
      read_timeout: 3s
      write_timeout: 3s
      pool_size: 100
      min_idle_conns: 10
      idle_timeout: 5m0s
      max_conn_age: 0s
      max_get_multi_concurrency: 100
      get_multi_batch_size: 100
      max_set_multi_concurrency: 100
      set_multi_batch_size: 100
  bucket-cache.yml: |-
    type: REDIS
    config:
      addr: redis-master.thanos-prod.svc.cluster.local:6379
      password: *
      db: 1
      dial_timeout: 5s
      read_timeout: 3s
      write_timeout: 3s
      pool_size: 100
      min_idle_conns: 10
      idle_timeout: 5m0s
      max_conn_age: 0s
      max_get_multi_concurrency: 100
      get_multi_batch_size: 100
      max_set_multi_concurrency: 100
      set_multi_batch_size: 100
    chunk_subrange_size: 16000
    max_chunks_get_range_requests: 3
    chunk_object_attrs_ttl: 24h
    chunk_subrange_ttl: 24h
    blocks_iter_ttl: 5m
    metafile_exists_ttl: 2h
    metafile_doesnt_exist_ttl: 15m
    metafile_content_ttl: 24h
    metafile_max_size: 1MiB

remark:

I test all images in https://hub.docker.com/r/thanosio/thanos/tags, and found this issue occur in version thanos:main-2022-12-20-e85bc1f and after. Related to this commit: e85bc1f. @GiedriusS @bwplotka

@yeya24 yeya24 added the bug label Dec 31, 2022
@yeya24
Copy link
Contributor

yeya24 commented Dec 31, 2022

The bug seems that we registered the same metrics twice when creating the redis cache client. To avoid duplicate registration we can wrap a constant label on the metrics registry, similar as what the memcached client is doing https://github.com/thanos-io/thanos/blob/main/pkg/cacheutil/memcached_client.go#L248.

@kama910
Copy link
Contributor

kama910 commented Dec 31, 2022

Hello, may I take this issue please?

@yeya24
Copy link
Contributor

yeya24 commented Dec 31, 2022

@kama910 It is yours!

@yeya24
Copy link
Contributor

yeya24 commented Jan 2, 2023

Let's not forget to include this fix to the v0.30.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants