Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos Receive Memory Usage very high OOMKilled #6100

Closed
caoimheharvey opened this issue Feb 3, 2023 · 5 comments
Closed

Thanos Receive Memory Usage very high OOMKilled #6100

caoimheharvey opened this issue Feb 3, 2023 · 5 comments

Comments

@caoimheharvey
Copy link

caoimheharvey commented Feb 3, 2023

Hi, I've been experiencing a lot of problems with Thanos Receive getting OOMKilled and going into a CrashBackupLoop anytime it tries to restart.

Thanos is deployed via Bitnami Helm Chart on EKS t3.2xlarge VMs (8vCPU 32GB RAM) across 6 nodes, the Thanos Receive deployment is configured in an AutoScaling group from 3-6 nodes and each replicas has 20Gi Memory, set to scale when Memory Usage hits 70%.

Almost instantly, Thanos Receive scales out to 6 pods and shortly after begins a CrashBackupLoop of OOMKilled for the termination reason.

k top pod -n thanos
NAME                                     CPU(cores)   MEMORY(bytes)
thanos-query-5445b5dc6d-lrqnq            1m           13Mi
thanos-query-5445b5dc6d-m65fr            1m           14Mi
thanos-query-5445b5dc6d-wqc9k            1m           14Mi
thanos-query-frontend-55d897f4dc-4fndg   1m           49Mi
thanos-receive-0                         94m          17907Mi
thanos-receive-1                         167m         17903Mi
thanos-receive-2                         501m         4667Mi
thanos-receive-3                         500m         3830Mi
thanos-receive-4                         84m          15356Mi
thanos-receive-5                         69m          13076Mi
thanos-storegateway-0                    1m           124Mi
thanos-storegateway-1                    4m           96Mi
thanos-storegateway-2                    1m           123Mi
k get pod thanos-receive-0 -o yaml
 containerStatuses:
  - containerID: containerd://fbcb35fa7b97ee38ab4ff480d4da33ab207d934d9f226fb3f3f74b8c580bbaf1
    image: docker.io/bitnami/thanos:0.29.0-scratch-r0
    imageID: docker.io/bitnami/thanos@sha256:e239696f575f201cd7f801e80945964fda3f731cd62be70772f88480bb428fcd
    lastState:
      terminated:
        containerID: containerd://10cc94abc68bc71a0fb6b59c762a25d3a9ba6521cbb684a511be47e77b3e9ecb
        exitCode: 137
        finishedAt: "2023-02-03T03:02:47Z"
        reason: OOMKilled
        startedAt: "2023-02-03T02:58:42Z"

In addition, when scraping pod metrics from OTEL Collector and sending to Thanos Receive, after Thanos Receive scales up to 6 nodes, the logs show issues with the TSDB not being ready which seems to occur as a result of the OOMKilled as well.
OTEL Collector

me": "prometheusremotewrite", "error": "Permanent error: remote write returned HTTP status 500 Internal Server Error; err = %!w(<nil>): store locally for endpoint 127.0.0.1:10901: get appender: TSDB not ready\n", "dropped_items": 407}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
	go.opentelemetry.io/collector@v0.66.0/exporter/exporterhelper/queued_retry.go:394
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
	go.opentelemetry.io/collector@v0.66.0/exporter/exporterhelper/metrics.go:135
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
	go.opentelemetry.io/collector@v0.66.0/exporter/exporterhelper/queued_retry.go:205
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1
	go.opentelemetry.io/collector@v0.66.0/exporter/exporterhelper/internal/bounded_memory_queue.go:61
2023-02-03T03:09:54.328Z	error	exporterhelper/queued_retry.go:394	Exporting failed. The error is not retryable. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: remote write returned HTTP status 502 Bad Gateway; err = %!w(<nil>): <html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n</body>\r\n</html>\r\n", "dropped_items": 414}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
	go.opentelemetry.io/collector@v0.66.0/exporter/exporterhelper/queued_retry.go:394
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
	go.opentelemetry.io/collector@v0.66.0/exporter/exporterhelper/metrics.go:135
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
	go.opentelemetry.io/collector@v0.66.0/exporter/exporterhelper/queued_retry.go:205
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1
	go.opentelemetry.io/collector@v0.66.0/exporter/exporterhelper/internal/bounded_memory_queue.go:61

Thanos Receive

level=debug ts=2023-02-03T03:10:29.706596677Z caller=main.go:67 msg="maxprocs: Updating GOMAXPROCS=[1]: using minimum allowed GOMAXPROCS"
level=info ts=2023-02-03T03:10:29.707099181Z caller=receive.go:125 component=receive mode=RouterIngestor msg="running receive"
level=info ts=2023-02-03T03:10:29.707127419Z caller=options.go:26 component=receive protocol=HTTP msg="disabled TLS, key and cert must be set to enable"
level=info ts=2023-02-03T03:10:29.707179527Z caller=factory.go:52 component=receive msg="loading bucket configuration"
level=info ts=2023-02-03T03:10:29.707534281Z caller=receive.go:702 component=receive msg="default tenant data dir already present, not attempting to migrate storage"
level=debug ts=2023-02-03T03:10:29.708346939Z caller=receive.go:264 component=receive msg="setting up TSDB"
level=debug ts=2023-02-03T03:10:29.708380296Z caller=receive.go:551 component=receive msg="removing storage lock files if any"
level=info ts=2023-02-03T03:10:29.711845832Z caller=multitsdb.go:415 component=receive component=multi-tsdb msg="a leftover lockfile found and removed" tenant=default-tenant
level=info ts=2023-02-03T03:10:29.711894552Z caller=receive.go:638 component=receive component=uploader msg="upload enabled, starting initial sync"
level=debug ts=2023-02-03T03:10:29.71190348Z caller=receive.go:626 component=receive component=uploader msg="upload phase starting"
level=debug ts=2023-02-03T03:10:29.711913101Z caller=receive.go:634 component=receive component=uploader msg="upload phase done" uploaded=0 elapsed=2.672µs
level=info ts=2023-02-03T03:10:29.711940043Z caller=receive.go:642 component=receive component=uploader msg="initial sync done"
level=debug ts=2023-02-03T03:10:29.711954049Z caller=receive.go:272 component=receive msg="setting up hashring"
level=debug ts=2023-02-03T03:10:29.712128372Z caller=receive.go:279 component=receive msg="setting up HTTP server"
level=debug ts=2023-02-03T03:10:29.71215783Z caller=receive.go:297 component=receive msg="setting up gRPC server"
level=info ts=2023-02-03T03:10:29.712170937Z caller=options.go:26 component=receive protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=debug ts=2023-02-03T03:10:29.712567297Z caller=receive.go:362 component=receive msg="setting up receive HTTP handler"
level=debug ts=2023-02-03T03:10:29.712588469Z caller=receive.go:391 component=receive msg="setting up periodic tenant pruning"
level=info ts=2023-02-03T03:10:29.712603814Z caller=receive.go:422 component=receive msg="starting receiver"
level=info ts=2023-02-03T03:10:29.727142426Z caller=receive.go:460 component=receive msg="the hashring initialized with config watcher."
level=debug ts=2023-02-03T03:10:29.727324029Z caller=config.go:248 component=receive component=config-watcher msg="refreshed hashring config"
level=warn ts=2023-02-03T03:10:29.727358993Z caller=intrumentation.go:67 component=receive msg="changing probe status" status=not-ready reason="hashring has changed; server is not ready to receive requests"
level=info ts=2023-02-03T03:10:29.72738477Z caller=receive.go:597 component=receive msg="hashring has changed; server is not ready to receive requests"
level=info ts=2023-02-03T03:10:29.727391568Z caller=receive.go:599 component=receive msg="updating storage"
level=info ts=2023-02-03T03:10:29.727469101Z caller=multitsdb.go:498 component=receive component=multi-tsdb tenant=default-tenant msg="opening TSDB"
level=info ts=2023-02-03T03:10:29.72778871Z caller=intrumentation.go:75 component=receive msg="changing probe status" status=healthy
level=info ts=2023-02-03T03:10:29.727822311Z caller=http.go:73 component=receive service=http/server component=receive msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2023-02-03T03:10:29.728156151Z caller=tls_config.go:195 component=receive service=http/server component=receive msg="TLS is disabled." http2=false
level=info ts=2023-02-03T03:10:29.728208198Z caller=receive.go:349 component=receive msg="listening for StoreAPI and WritableStoreAPI gRPC" address=0.0.0.0:10901
level=info ts=2023-02-03T03:10:29.728241513Z caller=intrumentation.go:75 component=receive msg="changing probe status" status=healthy
level=info ts=2023-02-03T03:10:29.728280285Z caller=grpc.go:131 component=receive service=gRPC/server component=receive msg="listening for serving gRPC" address=0.0.0.0:10901
level=info ts=2023-02-03T03:10:29.728312044Z caller=handler.go:312 component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19291
level=info ts=2023-02-03T03:10:29.728407976Z caller=handler.go:339 component=receive component=receive-handler msg="Serving plain HTTP" address=0.0.0.0:19291
level=info ts=2023-02-03T03:10:29.749651877Z caller=head.go:551 component=receive component=multi-tsdb tenant=default-tenant msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2023-02-03T03:10:29.749698686Z caller=head.go:595 component=receive component=multi-tsdb tenant=default-tenant msg="On-disk memory mappable chunks replay completed" duration=2.079µs
level=info ts=2023-02-03T03:10:29.749707636Z caller=head.go:601 component=receive component=multi-tsdb tenant=default-tenant msg="Replaying WAL, this may take a while"
level=debug ts=2023-02-03T03:10:34.415416434Z caller=handler.go:660 component=receive component=receive-handler msg="local tsdb write failed" err="get appender: TSDB not ready"
level=debug ts=2023-02-03T03:10:34.415521532Z caller=handler.go:660 component=receive component=receive-handler msg="local tsdb write failed" err="get appender: TSDB not ready"
level=debug ts=2023-02-03T03:10:34.415584461Z caller=handler.go:660 component=receive component=receive-handler msg="local tsdb write failed" err="get appender: TSDB not ready"
level=debug ts=2023-02-03T03:10:34.415607355Z caller=handler.go:515 component=receive component=receive-handler tenant=default-tenant msg="failed to handle request" err="store locally for endpoint 127.0.0.1:10901: get appender: TSDB not ready"
level=error ts=2023-02-03T03:10:34.415631376Z caller=handler.go:526 component=receive component=receive-handler tenant=default-tenant err="store locally for endpoint 127.0.0.1:10901: get appender: TSDB not ready" msg="internal server error"
level=debug ts=2023-02-03T03:10:34.415796197Z caller=handler.go:515 component=receive component=receive-handler tenant=default-tenant msg="failed to handle request" err="store locally for endpoint 127.0.0.1:10901: get appender: TSDB not ready"
level=error ts=2023-02-03T03:10:34.415824016Z caller=handler.go:526 component=receive component=receive-handler tenant=default-tenant err="store locally for endpoint 127.0.0.1:10901: get appender: TSDB not ready" msg="internal server error"
level=debug ts=2023-02-03T03:10:34.416297406Z caller=handler.go:515 component=receive component=receive-handler tenant=default-tenant msg="failed to handle request" err="store locally for endpoint 127.0.0.1:10901: get appender: TSDB not ready"

Please find the values.yaml config Thanos Receive using the Bitnami Helm Chart below.

      receive:
        enabled: true
        livenessProbe:
          enabled: false
        readinessProbe:
          enabled: false
        resources:
          limits:
            memory: 18Gi
            cpu: 500m
        autoscaling:
          enabled: true
          minReplicas: 3
          maxReplicas: 6
          targetCPU: 80
          targetMemory: 70
        logLevel: info
        tsdbRetention: 2h
        persistence:
          size: 12Gi
        serviceAccount:
          annotations:
            eks.amazonaws.com/role-arn: arn:aws:iam::some_account_id:role/thanos-receive
@GiedriusS
Copy link
Member

Maybe you could try out the newest main version and this #5926 feature flag to see if it helps. What would the results look like?

@fpetkovski
Copy link
Contributor

fpetkovski commented Feb 6, 2023

It is hard to say whether this is a problem without knowing how much data you are sending to receivers. TSDB not being ready is highly unlikely to be the culprit.

@caoimheharvey
Copy link
Author

It is hard to say whether this is a problem without knowing how much data you are sending to receivers. TSDB not being ready is highly unlikely to be the culprit.

We have the OTEL Collector set up as a Daemonset, across 38 nodes (multiple different clusters), collecting metrics from pods, eks nodes and kubeapi, I don't have exact numbers on the volume unfortunately.

@caoimheharvey
Copy link
Author

It seems that it was related to the amount of data being received, I changed the OTEL Collector deployment from a Daemonset to a Replicaset with 5 nodes and the memory issues seem to have resolved themselves. Thanks for helping :)

@fpetkovski
Copy link
Contributor

That makes sense, thanks for providing the reason 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants