You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
thanos, version 0.11.0-rc.1 (branch: HEAD, revision: f223a1d17c3d6bfa14938ba155660fdcada618fb)
build user: circleci@04cefde2f01f
build date: 20200226-14:55:11
go version: go1.13.1
Compact:
thanos, version 0.10.1 (branch: HEAD, revision: bdcc35842f30ada375f65aaf748a104b43d56672)
build user: circleci@4e51e880cd24
build date: 20200124-07:36:32
go version: go1.13.1
Object Storage Provider: AWS S3
What happened:
After upgrading thanos-store to the version 0.11.0-rc.1 I encountered an unexpected behavior. After some time running without issues, bucket_store_preload_all operation suddenly takes around 160 seconds to complete (stops due to the timeout, I assume). Partial responses are disabled in my configuration, so the query component also returns an error response code and outputs context deadline exceeded.
In the store component logs, I see lots of messages like: rpc error: code = Aborted desc = fetch series for block {ULID}: preload chunks: read range for 0: get range reader: The specified key does not exist that happen to appear a minute after the message from compactor: msg="deleting compacted block" old_block={ULID}.
The issue seems to be related to #564, but this doesn't happen with version 0.10.1 and 0.9.0.
I also noticed that the number of goroutines increases during the issue from usual ~30 up to 500. Most common ones are:
It results into elevated error rate, measured by thanos_objstore_bucket_operation_failures_total{operation="get_range"} and lasts for 2-3 hours on average, until store runs SyncBlocks process.
This behavior was noticed in some intermediate master builds between 0.10.1 and 0.11.0-rc.1
What you expected to happen:
store component handles a missing block and resyncs blocks automatically.
How to reproduce it (as minimally and precisely as possible):
Deploy thanos store component version 0.11.0-rc.1, wait for the compactor to delete some old chunks after compaction and query the data that was stored in the old chunk.
Full logs to relevant components:
Log messages from the store and compact components
err="Addr: 172.23.81.1:10901 LabelSets:
[name:\"prometheus_replica\" value:\"monitoring/thanos-rule-0\" ]
[name:\"prometheus_replica\" value:\"monitoring/thanos-rule-1\" ]
Mint: 1566432000000 Maxt: 1582826400000:
receive series from Addr: 172.23.81.1:10901 LabelSets:
[name:\"prometheus_replica\" value:\"monitoring/thanos-rule-0\" ]
[name:\"prometheus_replica\" value:\"monitoring/thanos-rule-1\" ]
Mint: 1566432000000 Maxt: 1582826400000:
rpc error: code = Aborted desc = fetch series for block 01E234VA29MRC9BTZ065AHT6YX:
preload chunks: read range for 0: get range reader: The specified key does not exist."
thanos-compact-0 - level=info ts=2020-02-27T19:32:38.493845908Z caller=compact.go:834
compactionGroup=0@1170028431517605376 msg="deleting compacted block"
old_block=01E234VA29MRC9BTZ065AHT6YX`
thanos-compact-0 - level=info ts=2020-02-27T19:32:38.20204386Z caller=compact.go:441
msg="compact blocks" count=4 mint=1582790400000 maxt=1582819200000
ulid=01E2425XYTVXTS5QY8EJA43GQJ sources="[01E234VA29MRC9BTZ065AHT6YX
01E23BQ18W2K2T76NEJN19QY18 01E23JJRJYBTXBC9DAJXQ1AJG3
01E23SEFS322B3JN40PSAEHDAM]" duration=224.375018ms
This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.
We encounter the same issue. Exactly described as above. The main issue for us is the number of goroutines, which causes Thanos query to miss some requests.
We now are upgrading to v0.12.2, in the hopes that this helps.
Thanos, Prometheus and Golang version used:
Store:
Compact:
Object Storage Provider: AWS S3
What happened:
After upgrading thanos-store to the version
0.11.0-rc.1
I encountered an unexpected behavior. After some time running without issues,bucket_store_preload_all
operation suddenly takes around 160 seconds to complete (stops due to the timeout, I assume). Partial responses are disabled in my configuration, so the query component also returns an error response code and outputscontext deadline exceeded
.In the store component logs, I see lots of messages like:
rpc error: code = Aborted desc = fetch series for block {ULID}: preload chunks: read range for 0: get range reader: The specified key does not exist
that happen to appear a minute after the message from compactor:msg="deleting compacted block" old_block={ULID}
.The issue seems to be related to #564, but this doesn't happen with version
0.10.1
and0.9.0
.I also noticed that the number of goroutines increases during the issue from usual ~30 up to 500. Most common ones are:
/go/pkg/mod/github.com/minio/minio-go/v6@v6.0.49/retry.go:79
/go/pkg/mod/github.com/minio/minio-go/v6@v6.0.49/api-get-object.go:103
/go/src/github.com/thanos-io/thanos/pkg/store/bucket.go:882
/go/src/github.com/thanos-io/thanos/pkg/store/bucket.go:1843
It results into elevated error rate, measured by
thanos_objstore_bucket_operation_failures_total{operation="get_range"}
and lasts for 2-3 hours on average, until store runsSyncBlocks
process.This behavior was noticed in some intermediate master builds between
0.10.1
and0.11.0-rc.1
What you expected to happen:
store component handles a missing block and resyncs blocks automatically.
How to reproduce it (as minimally and precisely as possible):
Deploy thanos store component version
0.11.0-rc.1
, wait for the compactor to delete some old chunks after compaction and query the data that was stored in the old chunk.Full logs to relevant components:
Goroutine profile sample
Additional information
The first time this issue happened with
--experimental.enable-index-header
feature on, but it also got reproduced with this feature disabled.The text was updated successfully, but these errors were encountered: