store: "The specified key does not exist" error message after compaction #2209

spetrashov · 2020-03-02T16:45:15Z

Thanos, Prometheus and Golang version used:

Store:

thanos, version 0.11.0-rc.1 (branch: HEAD, revision: f223a1d17c3d6bfa14938ba155660fdcada618fb)
  build user:       circleci@04cefde2f01f
  build date:       20200226-14:55:11
  go version:       go1.13.1

Compact:

thanos, version 0.10.1 (branch: HEAD, revision: bdcc35842f30ada375f65aaf748a104b43d56672)
  build user:       circleci@4e51e880cd24
  build date:       20200124-07:36:32
  go version:       go1.13.1

Object Storage Provider: AWS S3

What happened:

After upgrading thanos-store to the version 0.11.0-rc.1 I encountered an unexpected behavior. After some time running without issues, bucket_store_preload_all operation suddenly takes around 160 seconds to complete (stops due to the timeout, I assume). Partial responses are disabled in my configuration, so the query component also returns an error response code and outputs context deadline exceeded.

In the store component logs, I see lots of messages like: rpc error: code = Aborted desc = fetch series for block {ULID}: preload chunks: read range for 0: get range reader: The specified key does not exist that happen to appear a minute after the message from compactor: msg="deleting compacted block" old_block={ULID}.

The issue seems to be related to #564, but this doesn't happen with version 0.10.1 and 0.9.0.

I also noticed that the number of goroutines increases during the issue from usual ~30 up to 500. Most common ones are:

/go/pkg/mod/github.com/minio/minio-go/v6@v6.0.49/retry.go:79
/go/pkg/mod/github.com/minio/minio-go/v6@v6.0.49/api-get-object.go:103
/go/src/github.com/thanos-io/thanos/pkg/store/bucket.go:882
/go/src/github.com/thanos-io/thanos/pkg/store/bucket.go:1843

It results into elevated error rate, measured by thanos_objstore_bucket_operation_failures_total{operation="get_range"} and lasts for 2-3 hours on average, until store runs SyncBlocks process.

This behavior was noticed in some intermediate master builds between 0.10.1 and 0.11.0-rc.1

What you expected to happen:
store component handles a missing block and resyncs blocks automatically.

How to reproduce it (as minimally and precisely as possible):

Deploy thanos store component version 0.11.0-rc.1, wait for the compactor to delete some old chunks after compaction and query the data that was stored in the old chunk.

Full logs to relevant components:

Log messages from the store and compact components

err="Addr: 172.23.81.1:10901 LabelSets: 
[name:\"prometheus_replica\" value:\"monitoring/thanos-rule-0\" ]
[name:\"prometheus_replica\" value:\"monitoring/thanos-rule-1\" ] 
Mint: 1566432000000 Maxt: 1582826400000: 
receive series from Addr: 172.23.81.1:10901 LabelSets: 
[name:\"prometheus_replica\" value:\"monitoring/thanos-rule-0\" ]
[name:\"prometheus_replica\" value:\"monitoring/thanos-rule-1\" ] 
Mint: 1566432000000 Maxt: 1582826400000: 
rpc error: code = Aborted desc = fetch series for block 01E234VA29MRC9BTZ065AHT6YX: 
preload chunks: read range for 0: get range reader: The specified key does not exist."

thanos-compact-0 - level=info ts=2020-02-27T19:32:38.493845908Z caller=compact.go:834
compactionGroup=0@1170028431517605376 msg="deleting compacted block" 
old_block=01E234VA29MRC9BTZ065AHT6YX`

thanos-compact-0 - level=info ts=2020-02-27T19:32:38.20204386Z caller=compact.go:441 
msg="compact blocks" count=4 mint=1582790400000 maxt=1582819200000 
ulid=01E2425XYTVXTS5QY8EJA43GQJ sources="[01E234VA29MRC9BTZ065AHT6YX 
01E23BQ18W2K2T76NEJN19QY18 01E23JJRJYBTXBC9DAJXQ1AJG3 
01E23SEFS322B3JN40PSAEHDAM]" duration=224.375018ms

Goroutine profile sample

goroutine profile: total 384
89 @ 0x4313d0 0x44d587 0x44d55d 0xde553e 0x45f261
#	0x44d55c	time.Sleep+0x12c						/usr/local/go/src/runtime/time.go:105
#	0xde553d	github.com/minio/minio-go/v6.Client.newRetryTimer.func2+0x8d	/go/pkg/mod/github.com/minio/minio-go/v6@v6.0.49/retry.go:79

88 @ 0x4313d0 0x405b78 0x405b4e 0x40587b 0xdd3c82 0xdaf37a 0xddfe7d 0x45f261
#	0xdd3c81	github.com/minio/minio-go/v6.Client.executeMethod+0x201			/go/pkg/mod/github.com/minio/minio-go/v6@v6.0.49/api.go:600
#	0xdaf379	github.com/minio/minio-go/v6.Client.getObject+0x219			/go/pkg/mod/github.com/minio/minio-go/v6@v6.0.49/api-get-object.go:623
#	0xddfe7c	github.com/minio/minio-go/v6.Client.getObjectWithContext.func1+0x31c	/go/pkg/mod/github.com/minio/minio-go/v6@v6.0.49/api-get-object.go:103

75 @ 0x4313d0 0x405b78 0x405b4e 0x40583b 0xdad192 0xdad0c8 0xdf514c 0xdf5487 0x8beaf9 0x1693bc2 0x1698392 0x16aa16f 0x872954 0x45f261
#	0xdad191	github.com/minio/minio-go/v6.(*Object).doGetRequest+0x251			/go/pkg/mod/github.com/minio/minio-go/v6@v6.0.49/api-get-object.go:282
#	0xdad0c7	github.com/minio/minio-go/v6.(*Object).Read+0x187				/go/pkg/mod/github.com/minio/minio-go/v6@v6.0.49/api-get-object.go:356
#	0xdf514b	github.com/thanos-io/thanos/pkg/objstore/s3.(*Bucket).getRange+0x1cb		/go/src/github.com/thanos-io/thanos/pkg/objstore/s3/s3.go:269
#	0xdf5486	github.com/thanos-io/thanos/pkg/objstore/s3.(*Bucket).GetRange+0x76		/go/src/github.com/thanos-io/thanos/pkg/objstore/s3/s3.go:286
#	0x8beaf8	github.com/thanos-io/thanos/pkg/objstore.(*metricBucket).GetRange+0xf8		/go/src/github.com/thanos-io/thanos/pkg/objstore/objstore.go:285
#	0x1693bc1	github.com/thanos-io/thanos/pkg/store.(*bucketBlock).readChunkRange+0x171	/go/src/github.com/thanos-io/thanos/pkg/store/bucket.go:1261
#	0x1698391	github.com/thanos-io/thanos/pkg/store.(*bucketChunkReader).loadChunks+0xc1	/go/src/github.com/thanos-io/thanos/pkg/store/bucket.go:1853
#	0x16aa16e	github.com/thanos-io/thanos/pkg/store.(*bucketChunkReader).preload.func3+0x9e	/go/src/github.com/thanos-io/thanos/pkg/store/bucket.go:1843
#	0x872953	golang.org/x/sync/errgroup.(*Group).Go.func1+0x63				/go/pkg/mod/golang.org/x/sync@v0.0.0-20190911185100-cd5d95a43a6e/errgroup/errgroup.go:57

75 @ 0x4313d0 0x441a90 0x441a7b 0x4416e2 0x46f344 0x8727c1 0x1698015 0x168e7c5 0x16a8228 0x872954 0x45f261
#	0x4416e1	sync.runtime_Semacquire+0x41							/usr/local/go/src/runtime/sema.go:56
#	0x46f343	sync.(*WaitGroup).Wait+0x63							/usr/local/go/src/sync/waitgroup.go:130
#	0x8727c0	golang.org/x/sync/errgroup.(*Group).Wait+0x30					/go/pkg/mod/golang.org/x/sync@v0.0.0-20190911185100-cd5d95a43a6e/errgroup/errgroup.go:40
#	0x1698014	github.com/thanos-io/thanos/pkg/store.(*bucketChunkReader).preload+0x2d4	/go/src/github.com/thanos-io/thanos/pkg/store/bucket.go:1847
#	0x168e7c4	github.com/thanos-io/thanos/pkg/store.blockSeries+0xc14				/go/src/github.com/thanos-io/thanos/pkg/store/bucket.go:723
#	0x16a8227	github.com/thanos-io/thanos/pkg/store.(*BucketStore).Series.func1+0xc7		/go/src/github.com/thanos-io/thanos/pkg/store/bucket.go:882
#	0x872953	golang.org/x/sync/errgroup.(*Group).Go.func1+0x63				/go/pkg/mod/golang.org/x/sync@v0.0.0-20190911185100-cd5d95a43a6e/errgroup/errgroup.go:57

Additional information

The first time this issue happened with --experimental.enable-index-header feature on, but it also got reproduced with this feature disabled.

The text was updated successfully, but these errors were encountered:

stale · 2020-04-01T17:32:32Z

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

wiardvanrij · 2020-06-25T14:07:00Z

We encounter the same issue. Exactly described as above. The main issue for us is the number of goroutines, which causes Thanos query to miss some requests.

We now are upgrading to v0.12.2, in the hopes that this helps.

stale bot added the stale label Apr 1, 2020

spetrashov closed this as completed Apr 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

store: "The specified key does not exist" error message after compaction #2209

store: "The specified key does not exist" error message after compaction #2209

spetrashov commented Mar 2, 2020

stale bot commented Apr 1, 2020

wiardvanrij commented Jun 25, 2020

store: "The specified key does not exist" error message after compaction #2209

store: "The specified key does not exist" error message after compaction #2209

Comments

spetrashov commented Mar 2, 2020

stale bot commented Apr 1, 2020

wiardvanrij commented Jun 25, 2020