Thanos Compact gets internal server error, when is facing context canceled from Storage Provider #4689

nicolastakashi · 2021-09-21T14:09:00Z

Thanos, Prometheus and Golang version used:

Thanos Version 0.22
Prometheus: 2.27

Object Storage Provider: Microsoft Azure

What happened: Thanos Compact gets internal server error, when is facing context canceled from Storage Provider

What you expected to happen: Retry an keep health

How to reproduce it (as minimally and precisely as possible): There is no a specific scenario, I just start to face this error after update to version 0.22

Full logs to relevant components:

level=info ts=2021-09-21T13:37:30.172907505Z caller=factory.go:46 msg="loading bucket configuration"
level=info ts=2021-09-21T13:37:30.679158689Z caller=compact.go:556 msg="starting compact node"
level=info ts=2021-09-21T13:37:30.679202292Z caller=intrumentation.go:48 msg="changing probe status" status=ready
level=info ts=2021-09-21T13:37:30.679450812Z caller=intrumentation.go:60 msg="changing probe status" status=healthy
level=info ts=2021-09-21T13:37:30.679477414Z caller=http.go:63 service=http/server component=compact msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2021-09-21T13:37:30.679994754Z caller=tls_config.go:191 service=http/server component=compact msg="TLS is disabled." http2=false
level=info ts=2021-09-21T13:37:30.680282777Z caller=compact.go:948 msg="start sync of metas"
level=info ts=2021-09-21T13:40:49.566485486Z caller=main.go:167 msg="caught signal. Exiting." signal=terminated
level=warn ts=2021-09-21T13:40:49.56655559Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason=null
level=info ts=2021-09-21T13:40:49.566569191Z caller=http.go:74 service=http/server component=compact msg="internal server is shutting down" err=null
level=info ts=2021-09-21T13:40:50.065939201Z caller=http.go:93 service=http/server component=compact msg="internal server is shutdown gracefully" err=null
level=info ts=2021-09-21T13:40:50.065987104Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason=null
level=error ts=2021-09-21T13:40:50.27341152Z caller=compact.go:488 msg="retriable error" err="compaction: sync: BaseFetcher: iter bucket: context canceled"
level=error ts=2021-09-21T13:40:50.273894348Z caller=runutil.go:101 msg="function failed. Retrying in next tick" err="BaseFetcher: iter bucket: cannot list hierarchy blobs with prefix  (iteration #1): Get \"https://storage.blob.core.windows.net/container?comp=list&delimiter=%2F&restype=container&timeout=61\": context canceled"
level=info ts=2021-09-21T13:40:50.273937051Z caller=main.go:159 msg=exiting

Anything else we need to know:

This component is running on Azure Kubernetes Service
CNCF Thanos Slack Channel Thread

The text was updated successfully, but these errors were encountered:

Aaron-ML · 2021-09-22T16:57:03Z

I'm seeing something similar to this on 0.22 as well

level=info ts=2021-09-14T00:42:00.064565662Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=805.474531ms cached=1284 returned=1284 partial=0
level=warn ts=2021-09-14T00:42:03.017083892Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason=null
level=info ts=2021-09-14T00:42:03.017118393Z caller=http.go:74 service=http/server component=compact msg="internal server is shutting down" err=null
level=error ts=2021-09-14T00:42:03.017273796Z caller=compact.go:488 msg="retriable error" err="syncing metas: filter metas: filter blocks marked for deletion: get file: 01FFC1B5V5CMHQN05DMBC9105N/deletion-mark.json: cannot get blob reader: 01FFC1B5V5CMHQN05DMBC9105N/deletion-mark.json: cannot get properties for Azure blob, address: 01FFC1B5V5CMHQN05DMBC9105N/deletion-mark.json: Head \"https://<redacted>.blob.core.windows.net/thanos/01FFC1B5V5CMHQN05DMBC9105N/deletion-mark.json?timeout=61\": dial tcp 52.241.88.36:443: connect: cannot assign requested address"
level=info ts=2021-09-14T00:42:03.019124632Z caller=http.go:93 service=http/server component=compact msg="internal server is shutdown gracefully" err=null
level=info ts=2021-09-14T00:42:03.019148232Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason=null
level=info ts=2021-09-14T00:42:03.019185833Z caller=main.go:159 msg=exiting

This doesn't happen on 0.21, very odd.

hanjm · 2021-09-26T03:17:55Z

By default thanos compact will run to completion which makes it possible to execute in a cronjob. Using the arguments --wait and --wait-interval=5m it’s possible to keep it running.

https://thanos.io/tip/components/compact.md/#compactor

Aaron-ML · 2021-09-27T16:46:03Z

By default thanos compact will run to completion which makes it possible to execute in a cronjob. Using the arguments --wait and --wait-interval=5m it’s possible to keep it running.

https://thanos.io/tip/components/compact.md/#compactor

@hanjm I'm already running the --wait flag and have for quite awhile, not sure if that changed in 0.22? This isn't running as a cronjob but rather a deployment.

  - args:
    - compact
    - --log.level=info
    - --log.format=logfmt
    - --http-address=0.0.0.0:10902
    - --objstore.config-file=/etc/config/object-store.yaml
    - --data-dir=/var/thanos/compact
    - --consistency-delay=30m
    - --retention.resolution-raw=30d
    - --retention.resolution-5m=120d
    - --retention.resolution-1h=1y
    - --block-sync-concurrency=20
    - --compact.concurrency=1
    - --wait

nicolastakashi · 2021-09-30T07:19:01Z

I'm using the same strategy that @Aaron-ML is using and even on version 0.20.1, I'm facing errors that could be handled by retrying the operation.

level=info ts=2021-09-30T07:02:29.072136538Z caller=factory.go:46 msg="loading bucket configuration"
level=info ts=2021-09-30T07:02:29.56787006Z caller=compact.go:539 msg="starting compact node"
level=info ts=2021-09-30T07:02:29.567916963Z caller=intrumentation.go:48 msg="changing probe status" status=ready
level=info ts=2021-09-30T07:02:29.568176778Z caller=intrumentation.go:60 msg="changing probe status" status=healthy
level=info ts=2021-09-30T07:02:29.568213681Z caller=http.go:62 service=http/server component=compact msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2021-09-30T07:02:29.568552701Z caller=compact.go:942 msg="start sync of metas"
level=info ts=2021-09-30T07:02:39.183579477Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=9.614692056s cached=348 returned=348 partial=0
level=info ts=2021-09-30T07:02:57.97161771Z caller=fetcher.go:476 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=18.698576374s cached=348 returned=348 partial=0
level=info ts=2021-09-30T07:03:11.869691052Z caller=main.go:167 msg="caught signal. Exiting." signal=terminated
level=warn ts=2021-09-30T07:03:11.869753456Z caller=intrumentation.go:54 msg="changing probe status" status=not-ready reason=null
level=info ts=2021-09-30T07:03:11.869770557Z caller=http.go:69 service=http/server component=compact msg="internal server is shutting down" err=null
level=info ts=2021-09-30T07:03:11.874936563Z caller=http.go:88 service=http/server component=compact msg="internal server is shutdown gracefully" err=null
level=info ts=2021-09-30T07:03:11.874965165Z caller=intrumentation.go:66 msg="changing probe status" status=not-healthy reason=null
level=error ts=2021-09-30T07:03:12.368622077Z caller=compact.go:468 msg="retriable error" err="compaction: sync: filter metas: filter blocks marked for no compaction: get file: 01G4E4NFP77YFEJES4E63TB07S/no-compact-mark.json: cannot get blob reader: 01F4E3NCP77YFEJES3E63TF07S/no-compact-mark.json: cannot get properties for Azure blob, address: 01F4E3NCP77YFEJES4E63TB07S/no-compact-mark.json: -> github.com/Azure/azure-pipeline-go/pipeline.NewError, /home/circleci/go/pkg/mod/github.com/!azure/azure-pipeline-go@v0.2.2/pipeline/error.go:154\nHTTP request failed\n\nHead \"https://<storage>.blob.core.windows.net/container/01F4E3NCP77YFEJES4E63TB07S/no-compact-mark.json?timeout=61\": context canceled\n"
level=info ts=2021-09-30T07:03:12.369028801Z caller=main.go:159 msg=exiting

ahurtaud · 2021-10-07T13:04:04Z

Can be a duplicate of #4466 which is still valid for me.
@nicolastakashi @Aaron-ML Can you try to run thanos tools bucket inspect ?
As per my comment on the other issue, it is working fine in 0.21.0 but failed since 0.22. Still valid in 0.23.1

nicolastakashi · 2021-10-11T10:17:18Z

Hi @ahurtaud, sorry but I don't realize the relation of this issue with thanos tools bucket inspect, could please provide me a little more context?

ahurtaud · 2021-10-11T10:23:12Z

@nicolastakashi I think some code is shared for listing buckets/meta.
I suspect this issue to be with the Azure API and I am seeing the same regression between 0.21 and 0.22+ for azure blob storage with thanos tools bucket inspect is an easy way to reproduce with the CLI.

So this might be the same root cause.

nicolastakashi · 2021-10-18T13:00:51Z

Hey Peeps!
A quick update.
In the last week, I reviewed my pod resources requests and limits, and I've noticed many CPU throttling on the Thanos Compact Pod.

After I adjusted the resources requests and limits and Thanos Compact stopped being throttled, the Thanos Compact stopped receiving Timeout during Storage Communication.

yeya24 · 2021-10-30T15:15:27Z

Is it good to close the issue now? @nicolastakashi

nicolastakashi · 2021-11-02T13:10:41Z

Hi @yeya24 from my side it looks Ok!

ianwoolf mentioned this issue Oct 10, 2021

compactor: add sync block metas timeout for block-viewer #4763

Closed

yeya24 closed this as completed Nov 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanos Compact gets internal server error, when is facing context canceled from Storage Provider #4689

Thanos Compact gets internal server error, when is facing context canceled from Storage Provider #4689

nicolastakashi commented Sep 21, 2021 •

edited

Loading

Aaron-ML commented Sep 22, 2021

hanjm commented Sep 26, 2021

Aaron-ML commented Sep 27, 2021 •

edited

Loading

nicolastakashi commented Sep 30, 2021

ahurtaud commented Oct 7, 2021

nicolastakashi commented Oct 11, 2021

ahurtaud commented Oct 11, 2021

nicolastakashi commented Oct 18, 2021

yeya24 commented Oct 30, 2021

nicolastakashi commented Nov 2, 2021

Thanos Compact gets internal server error, when is facing context canceled from Storage Provider #4689

Thanos Compact gets internal server error, when is facing context canceled from Storage Provider #4689

Comments

nicolastakashi commented Sep 21, 2021 • edited Loading

Aaron-ML commented Sep 22, 2021

hanjm commented Sep 26, 2021

Aaron-ML commented Sep 27, 2021 • edited Loading

nicolastakashi commented Sep 30, 2021

ahurtaud commented Oct 7, 2021

nicolastakashi commented Oct 11, 2021

ahurtaud commented Oct 11, 2021

nicolastakashi commented Oct 18, 2021

yeya24 commented Oct 30, 2021

nicolastakashi commented Nov 2, 2021

nicolastakashi commented Sep 21, 2021 •

edited

Loading

Aaron-ML commented Sep 27, 2021 •

edited

Loading