Mimir failed consistency check, unable to query certain blocks #2656

mari-arondeus · 2022-08-05T13:20:04Z

mari-arondeus
Aug 5, 2022

Hello, all.
It appears our Mimir instance has lost the ability to query certain blocks. We're using MinIO as a backend, and I've already restored the missing blocks at an object storage level. We had object versioning enabled, thankfully, but restoring the objects and manually editing the Mimir index to include them did not appear to resolve the issue. Mimir continues to throw a 500 error every time that range is queried, and I find the following logs on each Mimir host:

2022-08-05 09:10:54	level=warn ts=2022-08-05T13:10:54.295809426Z caller=logging.go:72 traceID=23236640d8102c82 msg="POST /prometheus/api/v1/query_range (500) 765.63155ms Response: \"{\\\"status\\\":\\\"error\\\",\\\"errorType\\\":\\\"internal\\\",\\\"error\\\":\\\"expanding series: the consistency check failed because some blocks were not queried (err-mimir-store-consistency-check-failed). The non-queried blocks are: 01G9N647CH52CN3K73WG453P8Q 01G9N3PJWF12K2TQYQNAP42645 01G9N57ZX2Z0FTFKSZZM6H1SWA 01G9N3PMRR01R8Y1RYKFCBR04A 01G9N3PYWQ50CPYGP7DVQNCFG0 01G9N57YV4ETEFA6WDBGD7ZNMC\\\"}\" ws: false; Accept-Encoding: gzip; Content-Length: 88; Content-Type: application/x-www-form-urlencoded; User-Agent: Grafana/9.0.5; X-Forwarded-For: 10.100.1.18; X-Forwarded-Port: 3200; X-Forwarded-Proto: https; "
2022-08-05 09:10:54	level=error ts=2022-08-05T13:10:54.29563604Z caller=retry.go:78 user=anonymous msg="error processing request" try=4 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: the consistency check failed because some blocks were not queried (err-mimir-store-consistency-check-failed). The non-queried blocks are: 01G9N647CH52CN3K73WG453P8Q 01G9N3PJWF12K2TQYQNAP42645 01G9N57ZX2Z0FTFKSZZM6H1SWA 01G9N3PMRR01R8Y1RYKFCBR04A 01G9N3PYWQ50CPYGP7DVQNCFG0 01G9N57YV4ETEFA6WDBGD7ZNMC\"}"
2022-08-05 09:10:54	ts=2022-08-05T13:10:54.295325984Z caller=spanlogger.go:80 user=anonymous method=blocksStoreQuerier.selectSorted level=warn user=anonymous msg="failed consistency check" err=null
2022-08-05 09:10:54	level=error ts=2022-08-05T13:10:54.210888731Z caller=retry.go:78 user=anonymous msg="error processing request" try=3 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: the consistency check failed because some blocks were not queried (err-mimir-store-consistency-check-failed). The non-queried blocks are: 01G9N3PJWF12K2TQYQNAP42645 01G9N3PMRR01R8Y1RYKFCBR04A 01G9N57ZX2Z0FTFKSZZM6H1SWA 01G9N57YV4ETEFA6WDBGD7ZNMC 01G9N647CH52CN3K73WG453P8Q 01G9N3PYWQ50CPYGP7DVQNCFG0\"}"
2022-08-05 09:10:54	ts=2022-08-05T13:10:54.210606635Z caller=spanlogger.go:80 user=anonymous method=blocksStoreQuerier.selectSorted level=warn user=anonymous msg="failed consistency check" err=null
2022-08-05 09:10:54	level=error ts=2022-08-05T13:10:54.068575389Z caller=retry.go:78 user=anonymous msg="error processing request" try=2 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: the consistency check failed because some blocks were not queried (err-mimir-store-consistency-check-failed). The non-queried blocks are: 01G9N3PMRR01R8Y1RYKFCBR04A 01G9N3PJWF12K2TQYQNAP42645 01G9N3PYWQ50CPYGP7DVQNCFG0 01G9N57YV4ETEFA6WDBGD7ZNMC 01G9N647CH52CN3K73WG453P8Q 01G9N57ZX2Z0FTFKSZZM6H1SWA\"}"
2022-08-05 09:10:54	ts=2022-08-05T13:10:54.068296583Z caller=spanlogger.go:80 user=anonymous method=blocksStoreQuerier.selectSorted level=warn user=anonymous msg="failed consistency check" err=null
2022-08-05 09:10:54	level=error ts=2022-08-05T13:10:54.000619783Z caller=retry.go:78 user=anonymous msg="error processing request" try=1 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: the consistency check failed because some blocks were not queried (err-mimir-store-consistency-check-failed). The non-queried blocks are: 01G9N3PMRR01R8Y1RYKFCBR04A 01G9N57YV4ETEFA6WDBGD7ZNMC 01G9N57ZX2Z0FTFKSZZM6H1SWA 01G9N3PYWQ50CPYGP7DVQNCFG0 01G9N3PJWF12K2TQYQNAP42645 01G9N647CH52CN3K73WG453P8Q\"}"
2022-08-05 09:10:54	ts=2022-08-05T13:10:54.000248218Z caller=spanlogger.go:80 user=anonymous method=blocksStoreQuerier.selectSorted level=warn user=anonymous msg="failed consistency check" err=null
2022-08-05 09:10:53	level=error ts=2022-08-05T13:10:53.915430267Z caller=retry.go:78 user=anonymous msg="error processing request" try=0 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: the consistency check failed because some blocks were not queried (err-mimir-store-consistency-check-failed). The non-queried blocks are: 01G9N57ZX2Z0FTFKSZZM6H1SWA 01G9N647CH52CN3K73WG453P8Q 01G9N3PJWF12K2TQYQNAP42645 01G9N3PMRR01R8Y1RYKFCBR04A 01G9N3PYWQ50CPYGP7DVQNCFG0 01G9N57YV4ETEFA6WDBGD7ZNMC\"}"
2022-08-05 09:10:53	ts=2022-08-05T13:10:53.915102527Z caller=spanlogger.go:80 user=anonymous method=blocksStoreQuerier.selectSorted level=warn user=anonymous msg="failed consistency check" err=null

We aren't able to query between 1659628830000 (2022-08-04T16:00:30 UTC) and 1659636300000 (2022-08-04T18:05:00 UTC). It appears to be just these 6 blocks that are causing an issue. No updates have been done on our Mimir cluster recently, and no changes have been made to the configuration. I'm at a bit of a loss as to why these 6 objects would be deleted, and why Mimir would not be willing to query them once restored to object storage and the index. Any guidance would be greatly appreciated.

I've also included a copy of our config, in case it's helpful. We have Mimir deployed as 4 individual single-node services on Docker Swarm Mode.

mimir.txt

Answered by pracucci

Aug 8, 2022

I think there are two different things to investigate:

Why were these blocks deleted from the object storage?
Why aren't these blocks queried after manually restoring them?

Why were these blocks deleted from the object storage?

Do you have the compactor logs around the time these blocks where deleted? Can you find any related log message? I would like to better understand if they were deleted by the compactor (no other Mimir component can delete blocks, so if it wasn't the compactor then it has been caused by something outside Mimir control).

Why aren't these blocks queried after manually restoring them?

Queriers look up blocks through the bucket index. The bucket index is kept upda…

View full answer

pracucci · 2022-08-08T16:52:38Z

pracucci
Aug 8, 2022
Maintainer

I think there are two different things to investigate:

Why were these blocks deleted from the object storage?
Why aren't these blocks queried after manually restoring them?

Why were these blocks deleted from the object storage?

Do you have the compactor logs around the time these blocks where deleted? Can you find any related log message? I would like to better understand if they were deleted by the compactor (no other Mimir component can delete blocks, so if it wasn't the compactor then it has been caused by something outside Mimir control).

Why aren't these blocks queried after manually restoring them?

Queriers look up blocks through the bucket index. The bucket index is kept updated periodically by the compactor (by default every -compactor.cleanup-interval=15m). So, within 15m the blocks have been manually restored, we expect the bucket index to be updated and contain the restored blocks too.

You can manually lookup the bucket index in the object storage: it's stored at the path /<tenant id>/bucket-index.json.gz. Does it contain the restored blocks? If not, can you look at the compactor logs to see why (e.g. have the restored blocks the meta.json file)?

7 replies

mari-arondeus Aug 8, 2022
Author

OK, I'm starting to think something larger is at play here. I've checked a number of blocks that - as far as I know - are working fine, and they all report the same deletion:

sh-4.4# mc ls --versions minio/mimir/anonymous/01G9THFBKCC58WYP9X1BRCBZT1
[2022-08-07 10:17:23 UTC]     0B STANDARD 0157bad9-76bf-4ec2-a631-9bd766d2d5a0 v2 DEL deletion-mark.json
[2022-08-06 22:10:31 UTC]   112B STANDARD 9ad67d6d-2eaa-4249-a1e5-25210be90abe v1 PUT deletion-mark.json
[2022-08-07 10:17:23 UTC]     0B STANDARD 91362ecc-14df-4db9-8cca-3b39dbad5c3e v2 DEL index
[2022-08-06 21:37:43 UTC] 347KiB STANDARD b4dd0069-97e2-4e2a-b507-332e2885364e v1 PUT index
[2022-08-07 10:17:23 UTC]     0B STANDARD 62e5d33b-0471-4fe5-9cf8-074fc7e960e8 v2 DEL meta.json
[2022-08-06 21:37:43 UTC]   797B STANDARD a738b31f-b3d8-4e61-8177-8fae2b67c1a6 v1 PUT meta.json
[2022-08-08 17:31:50 UTC]     0B chunks/
sh-4.4# mc ls --versions minio/mimir/anonymous/01G9THFBKCC58WYP9X1BRCBZT1/chunks/
[2022-08-07 10:17:23 UTC]     0B STANDARD ce148c53-76d4-4dc1-8b20-aab96399ba1b v2 DEL 000001
[2022-08-06 21:37:43 UTC] 739KiB STANDARD c7f2148e-8eae-4f0c-a9b7-98905058d455 v1 PUT 000001

now I'm even more confused...

mari-arondeus Aug 8, 2022
Author

I got some more information that might be useful. When searching Mimir logs for the string "delete", I noticed all these logs:

2022-08-07T04:30:40-04:00	level=info ts=2022-08-07T08:30:40.209018189Z caller=blocks_cleaner.go:398 component=cleaner user=anonymous msg="deleted block marked for deletion" block=01G9TBTAX1S33F4F1H3X89TRKQ
2022-08-07T04:30:40-04:00	level=info ts=2022-08-07T08:30:40.206506187Z caller=blocks_cleaner.go:398 component=cleaner user=anonymous msg="deleted block marked for deletion" block=01G9T9B9XF7NAZFR3X6JCZ4RRX
2022-08-07T04:30:40-04:00	level=info ts=2022-08-07T08:30:40.204620354Z caller=blocks_cleaner.go:398 component=cleaner user=anonymous msg="deleted block marked for deletion" block=01G9TBTAX1Q1FFTS89MFKR7A3A
2022-08-07T04:30:40-04:00	level=info ts=2022-08-07T08:30:40.204135691Z caller=blocks_cleaner.go:398 component=cleaner user=anonymous msg="deleted block marked for deletion" block=01G9TBTAX1XWCXNX3G0PKNEDY6

(etc) Explore-logs-2022-08-08 13 45 46.txt

I previously assumed these log lines were generated whenever the local cache was cleared - not that these were actually events where Mimir was deleting blocks from object storage. This was just over the last 48hrs, but all the blocks I mentioned regarding the data loss event on the 6th were mentioned in these logs. Is it possible Mimir is misconfigured and has been deleting blocks from object storage accidentally?

I'm including my Mimir configuration again, in case it has something damning in it.
mimir.txt

pracucci Aug 11, 2022
Maintainer

Let me do a step back.

The compactor is responsible to compact smaller blocks into larger ones. Blocks are immutable, so they're not edited in-place, but new ones are created. The way the compactor works is: download source blocks to compact, compact them locally (on local disk), upload the bigger compacted block(s) to the object storage and finally delete the source blocks from the object storage (because all data of source blocks is available in the new compacted block(s)).

To sum up, the compactor is expected to delete blocks as part of its normal operations. The deletion is not immediate, but the compact uses a soft-deletion strategy: first blocks to delete are "marked for deletion" and then, after some time, they're deleted for real from the object storage. Whenever a block is "marked for deletion", the compactor logs "block has been marked for deletion", while when the block is deleted for real it logs "deleted block marked for deletion".

Getting back to your issue, to better understand what's going on I suggest we focus on 1 block only. For example, in the original message you reported that a query failed the consistency check because the block "01G9N647CH52CN3K73WG453P8Q" was not queried. If possible, I would like to see all logs related to the block "01G9N647CH52CN3K73WG453P8Q". I don't know if you're collect your logs, but would be able to give me all logs containing the string "01G9N647CH52CN3K73WG453P8Q" since 1 day before the issue on that block started to happen?

mari-arondeus Aug 11, 2022
Author

I really appreciate you explaining to me how Mimir treats compaction - I realize this is an open-source project, but it's still very kind of you to take the time to educate me on this. I think your approach - focusing on a single block for troubleshooting - makes the most sense, and I was likely starting to overthink things since my understanding of Mimir is relatively limited. Here are all the logs matching "01G9N647CH52CN3K73WG453P8Q", pulled via Grafana Loki:

2022-08-05 08:18:24 level=warn ts=2022-08-05T12:18:24.737228401Z caller=logging.go:72 traceID=6d26558b70d68e1e msg="POST /prometheus/api/v1/query_range (500) 540.161287ms Response: \"{\\\"status\\\":\\\"error\\\",\\\"errorType\\\":\\\"internal\\\",\\\"error\\\":\\\"expanding series: the consistency check failed because some blocks were not queried (err-mimir-store-consistency-check-failed). The non-queried blocks are: 01G9N57YV4ETEFA6WDBGD7ZNMC 01G9N3PJWF12K2TQYQNAP42645 01G9N57ZX2Z0FTFKSZZM6H1SWA 01G9N3PYWQ50CPYGP7DVQNCFG0 01G9N647CH52CN3K73WG453P8Q 01G9N3PMRR01R8Y1RYKFCBR04A\\\"}\" ws: false; Accept-Encoding: gzip; Content-Length: 88; Content-Type: application/x-www-form-urlencoded; User-Agent: Grafana/9.0.5; X-Forwarded-For: 10.100.1.18; X-Forwarded-Port: 3200; X-Forwarded-Proto: https; "
2022-08-05 08:18:24 level=error ts=2022-08-05T12:18:24.73695721Z caller=retry.go:78 user=anonymous msg="error processing request" try=4 err="rpc error: code = Code(500) desc = {\"status\":\"error\",\"errorType\":\"internal\",\"error\":\"expanding series: the consistency check failed because some blocks were not queried (err-mimir-store-consistency-check-failed). The non-queried blocks are: 01G9N57YV4ETEFA6WDBGD7ZNMC 01G9N3PJWF12K2TQYQNAP42645 01G9N57ZX2Z0FTFKSZZM6H1SWA 01G9N3PYWQ50CPYGP7DVQNCFG0 01G9N647CH52CN3K73WG453P8Q 01G9N3PMRR01R8Y1RYKFCBR04A\"}"Show context
2022-08-05 08:16:00 level=warn ts=2022-08-05T12:16:00.81350076Z caller=bucket.go:394 user=anonymous msg="loading block failed" elapsed=13.100386ms id=01G9N647CH52CN3K73WG453P8Q err="create index header reader: write index header: new index reader: get object attributes of 01G9N647CH52CN3K73WG453P8Q/index: The specified key does not exist."
2022-08-05 03:43:49 level=info ts=2022-08-05T07:43:49.030360979Z caller=blocks_cleaner.go:398 component=cleaner user=anonymous msg="deleted block marked for deletion" block=01G9N647CH52CN3K73WG453P8Q
2022-08-04 15:43:14 level=info ts=2022-08-04T19:43:14.516445536Z caller=block.go:203 component=compactor user=anonymous groupKey=0@17241709254077376921-merge-6_of_8-1659628800000-1659636000000 minTime="2022-08-04 16:00:00.241 +0000 UTC" maxTime="2022-08-04 18:00:00 +0000 UTC" msg="block has been marked for deletion" block=01G9N647CH52CN3K73WG453P8Q
2022-08-04 15:43:14 level=info ts=2022-08-04T19:43:14.465270149Z caller=bucket_compactor.go:590 component=compactor user=anonymous groupKey=0@17241709254077376921-merge-6_of_8-1659628800000-1659636000000 minTime="2022-08-04 16:00:00.241 +0000 UTC" maxTime="2022-08-04 18:00:00 +0000 UTC" msg="marking compacted block for deletion" old_block=01G9N647CH52CN3K73WG453P8Q
2022-08-04 15:43:14 level=info ts=2022-08-04T19:43:14.294341896Z caller=bucket_compactor.go:390 component=compactor user=anonymous groupKey=0@17241709254077376921-merge-6_of_8-1659628800000-1659636000000 minTime="2022-08-04 16:00:00.241 +0000 UTC" maxTime="2022-08-04 18:00:00 +0000 UTC" msg="compacted blocks" new=[01G9N64A9WMGA02KCAE9652JG6] blocks="[data-compactor/compact/0@17241709254077376921-merge-6_of_8-1659628800000-1659636000000/01G9N57YV4ETEFA6WDBGD7ZNMC data-compactor/compact/0@17241709254077376921-merge-6_of_8-1659628800000-1659636000000/01G9N57ZX2Z0FTFKSZZM6H1SWA data-compactor/compact/0@17241709254077376921-merge-6_of_8-1659628800000-1659636000000/01G9N647CH52CN3K73WG453P8Q]" duration=59.494154ms duration_ms=59
2022-08-04 15:43:14 level=info ts=2022-08-04T19:43:14.294070037Z caller=compact.go:510 component=compactor msg="compact blocks" count=3 mint=1659628800241 maxt=1659636000000 ulid=01G9N64A9WMGA02KCAE9652JG6 sources="[01G9N57YV4ETEFA6WDBGD7ZNMC 01G9N57ZX2Z0FTFKSZZM6H1SWA 01G9N647CH52CN3K73WG453P8Q]" duration=59.214495ms shard=1_of_1
2022-08-04 15:43:14 level=info ts=2022-08-04T19:43:14.23478491Z caller=bucket_compactor.go:360 component=compactor user=anonymous groupKey=0@17241709254077376921-merge-6_of_8-1659628800000-1659636000000 minTime="2022-08-04 16:00:00.241 +0000 UTC" maxTime="2022-08-04 18:00:00 +0000 UTC" msg="downloaded and verified blocks; compacting blocks" blocks=3 plan="[data-compactor/compact/0@17241709254077376921-merge-6_of_8-1659628800000-1659636000000/01G9N57YV4ETEFA6WDBGD7ZNMC data-compactor/compact/0@17241709254077376921-merge-6_of_8-1659628800000-1659636000000/01G9N57ZX2Z0FTFKSZZM6H1SWA data-compactor/compact/0@17241709254077376921-merge-6_of_8-1659628800000-1659636000000/01G9N647CH52CN3K73WG453P8Q]" duration=89.466847ms duration_ms=89
2022-08-04 15:43:14 level=info ts=2022-08-04T19:43:14.145288316Z caller=bucket_compactor.go:312 component=compactor user=anonymous groupKey=0@17241709254077376921-merge-6_of_8-1659628800000-1659636000000 minTime="2022-08-04 16:00:00.241 +0000 UTC" maxTime="2022-08-04 18:00:00 +0000 UTC" msg="compaction available and planned; downloading blocks" blocks=3 plan="[01G9N57YV4ETEFA6WDBGD7ZNMC (min time: 1659628800241, max time: 1659636000000) 01G9N57ZX2Z0FTFKSZZM6H1SWA (min time: 1659628800241, max time: 1659636000000) 01G9N647CH52CN3K73WG453P8Q (min time: 1659628800241, max time: 1659636000000)]"
2022-08-04 15:43:12 level=info ts=2022-08-04T19:43:12.12230298Z caller=bucket_compactor.go:435 component=compactor user=anonymous groupKey=0@17241709254077376921-split-4_of_4-1659628800000-1659636000000 minTime="2022-08-04 16:00:00.241 +0000 UTC" maxTime="2022-08-04 18:00:00 +0000 UTC" msg="uploaded block" result_block=01G9N647CH52CN3K73WG453P8Q duration=96.670282ms duration_ms=96 external_labels="{__compactor_shard_id__=\"6_of_8\"}"
2022-08-04 15:43:12 level=info ts=2022-08-04T19:43:12.020888722Z caller=bucket_compactor.go:390 component=compactor user=anonymous groupKey=0@17241709254077376921-split-4_of_4-1659628800000-1659636000000 minTime="2022-08-04 16:00:00.241 +0000 UTC" maxTime="2022-08-04 18:00:00 +0000 UTC" msg="compacted blocks" new="[01G9N647CHPVRE1246AYPAX220 01G9N647CHN2KRQ00TBBPE0ZC1 01G9N647CH5F6AMPKWX49Z1200 01G9N647CH486A5ZE9V81CA1EW 01G9N647CHSN1AA7A9X7Z8V8Q4 01G9N647CH52CN3K73WG453P8Q 01G9N647CHSSSPPN2FATWZWT1H 01G9N647CHF81RS572WPQM8Y7N]" blocks=[data-compactor/compact/0@17241709254077376921-split-4_of_4-1659628800000-1659636000000/01G9N3PYWQ50CPYGP7DVQNCFG0] duration=772.007514ms duration_ms=772
2022-08-04 15:43:12 level=info ts=2022-08-04T19:43:12.020386985Z caller=compact.go:510 component=compactor msg="compact blocks" count=1 mint=1659628800241 maxt=1659636000000 ulid=01G9N647CH52CN3K73WG453P8Q sources=[01G9N3PYWQ50CPYGP7DVQNCFG0] duration=771.516402ms shard=6_of_8

I went ahead and just pasted these as a code block since there isn't much to look at. I deduped the final warnings and errors since they happened after block deletion (and there were a lot of them). Also, it's worth noting that the timestamps at the beginning are in EST and the 'ts' logfmt label is in UTC. We're located on the US East Coast so we're using EST for most everything. MinIO uses UTC for everything, so of course that's one extra thing to keep track of, haha.

mari-arondeus Aug 28, 2022
Author

So... I think I may know at least part of the issue. It's not a whole solution, but it explains why the Mimir storage bucket I was using was growing ridiculously large; I said it myself: versioning. I didn't fully understand how Mimir handles storage and compaction (I assumed it was just like Loki with boltdb-shipper), so I had object versioning enabled for my Mimir bucket. Of course, Mimir kept nuking blocks and PUTing them back after compaction. This created A huge number of partial and duplicate blocks due to the object versioning, which explains the tremendous bucket size! That's a pretty massive "oops" on my part, and a good reminder to RTFM.

I'm still not sure what caused block deletion (the reason I opened this ticket/discussion), but I'll close it out for now since I don't want y'all to have open tickets with no way for me to recreate the issue. I've nuked the whole Mimir bucket we were running since the data wasn't prod-ready yet anyway, and it's been running fresh for about 4 days with no issues so far. If I have any more issues by the end of September, I'll be sure to update this.

Thank you again for your guidance on this topic. I'm learning more and more about Mimir (and the LGTM stack in general) as we expand our use-case, so I appreciate your willingness to stick with it and help us out. Have a fantastic rest of your weekend!

deajan · 2023-12-12T19:52:50Z

deajan
Dec 12, 2023

I'm exactly in the same situation as @mari-arondeus
I had versionning enabled in minio (my bad), and ended up having the following message on Minio when Mimir tried to upload objects:
maximum versions exceeded, please delete few versions to proceed

This happened for a full weekend, so only on monday I realized that I couldn't store data with Mimir into Minio because of this error.
I've managed to clean the versions with minio's mc tool using the following commands, which deletes all previous versions of an object:

# Disable versionning on bucket
mc version suspend <alias>/<mimir_blocks_bucket>
# Remove versions, /!\ this command has --dry-run for testing. Remove this option for actual deletion
mc rm -r --non-current --versions --force --dry-run <alias>/<mimir_blocks_bucket>
mc

So far so good, mimir could again store data into my minio S3 buket.
My problem is now that I've lost some blocks, which are out of the TSDB retention span.
Since this, of course Grafana cannot query this weekend, which is fairly normal.

My problem is that any Grafana query that includes the "weekend of death" will fail.
Is there any way to tell mimir that does blocks just don't exist, and return "no data" or something, ie like reconstructing the indices ?

Sorry if I'm kind of misunderstanding this here.

2 replies

mari-arondeus Dec 12, 2023
Author

Modifying Loki, Tempo, & Mimir to allow for unreliable object storage backends seems useful given how many folks are using Grafana products with a 3rd party backend. IMHO, better to let the user specifically opt in via config options.

deajan Dec 12, 2023

Hmmm... It always can happen that a file gets "badly written", or "lost in space".
In my case, I have minio running on a ZFS raid pool which could heal disk issues, but of course not logical "out of disk space" or "won't write because of X" errors.

Modifying Loki, Tempo, & Mimir to allow for unreliable object storage backends seems useful

Is that a suggestion for grafana teams or does this option actually exist ?

From https://grafana.com/docs/mimir/latest/manage/mimir-runbooks/#err-mimir-store-consistency-check-failed

Mimir has been designed to guarantee query results correctness and never return partial query results. Either a query succeeds returning fully consistent results or it fails.

Sure, but then again, there can be "holes" in results when no data is present.

IMO there should be something to rewrite the indices of the storage backend, so missing chunks just would end up returning "no data".

@pracucci Sorry to bug you, does something like the above solutions exist ? Or any other solution to deal with permanent missing chunks ? Thank you.

deajan · 2023-12-12T20:16:44Z

deajan
Dec 12, 2023

Also, how can I increase the 'local' mimir temporary storage to let's say a week in order to prevent such issues ?

Is setting the following sufficient in mimir's config file ?

tsdb: retention_period: 7d

I've grepped a single missing block over the mimir logs, this is the first error message I got for this block (which obviously couldn't be written to minio because of the versionning limit)

déc. 12 11:12:05 mimir01p.local mimir[992]: ts=2023-12-12T10:12:05.580264163Z caller=bucket.go:388 level=warn user=anonymous msg="loading block failed" elapsed=15.01128ms id=01HH4KPB66X1X43ENMFXKJHJT2 err="create index header reader: write index header: new index reader: get object attributes of 01HH4KPB66X1X43ENMFXKJHJT2/index: The specified key does not exist.

Last but not least, sorry to piggyback this discussion ;) Hopefully my issue is identical enough for this to make sense.

2 replies

dimitarvdimitrov Dec 13, 2023
Maintainer

Is setting the following sufficient in mimir's config file ?

also consider setting -querier.query-ingesters-within=7d. This may significantly increase the resource usage of ingesters and queriers. I'd advise to keep an eye on those when making this change.

deajan Dec 13, 2023

Thank you ;)

deajan · 2023-12-12T21:07:32Z

deajan
Dec 12, 2023

Okay... Here's a big WTF moment for me...
My minio instance runs as minio-user.
Just to outrule any problems, I checked my SELinux labels and permissions on all mimir data in the minio file system.
I ended up making a chown minio-user:minio-user /path/to/my/mimir/data/bucket
Now my queries work again, even those on my "weekend of death" period.

Unless I became Mr Hyde, I'm pretty sure that I didn't do any file operations in the minio bucket, so I have no explanation as of why there would be any files not owned by my minio service user.
Perhaps the mc command to deal with versions which I ran as root ? Since this basically interacts with the minio service, all file operations should be carried out as minio-user.

Anyway, glad I got everything to work.

As side questions:
If the storage backend goes offline for too long, will the local mimir TSDB hold the data until the storage backend is reachable again ? For how much time ?
What would happen if somehow a chunk would really go missing ? Can mimir reindex the storage to "exclude" the chunk ?

Any insight would be appreciated, and again, sorry if I ask a lot of questions.

3 replies

dimitarvdimitrov Dec 13, 2023
Maintainer

If the storage backend goes offline for too long, will the local mimir TSDB hold the data until the storage backend is reachable again ? For how much time ?

Yes, ingesters will hold onto data until it's shipped to object storage.

What would happen if somehow a chunk would really go missing ? Can mimir reindex the storage to "exclude" the chunk ?

depends on what you mean by chunk. A block whole block can go missing and mimir will recover, but individual files within the block can't.

deajan Dec 13, 2023

Yes, ingesters will hold onto data until it's shipped to object storage.

Even if it goes beyond the tsdb.retention_period settings ?

depends on what you mean by chunk. A block whole block can go missing and mimir will recover, but individual files within the block can't.

So what happens if a file "goes bad/missing" after a bad fsck/power outage/disk failure/whatever.
AFAIK mimir won't be able to query over any period including the block which has missing files.
So the solution would be to delete the block manually maybe ? If so, any tooling for that perhaps ? Or just check the logs, get the foldername and delete the whole block ?

dimitarvdimitrov Dec 14, 2023
Maintainer

Even if it goes beyond the tsdb.retention_period settings ?

yes, from the docs:

retention_period: TSDB blocks retention in the ingester before a block is removed. If shipping is enabled, the retention will be relative to the time when the block was uploaded to storage. If shipping is disabled then its relative to the creation time of the block. This should be larger than the -blocks-storage.tsdb.block-ranges-period, -querier.query-store-after and large enough to give store-gateways and queriers enough time to discover newly uploaded blocks.

AFAIK mimir won't be able to query over any period including the block which has missing files.

that's correct until the block has been cleaned up

So the solution would be to delete the block manually maybe ? If so, any tooling for that perhaps ?

you can use markblocks to mark the block for deletion. The query path should recover within 20-30 minutes after that (using default config durations)

Or just check the logs, get the foldername and delete the whole block ?

that will also work just ok; same ~20-30 minute cleanup time applies

deajan · 2024-11-04T19:12:43Z

deajan
Nov 4, 2024

Here's another resolution I've found for this problem.
While upgrading from mimir 2.13 to 2.14, my mimir instance would return errors like

err="create index header reader: write index header: new index reader: get object attributes of 01JATMDP3XVJGDC2FSKDQ7A6V8/index: The specified key does not exist."

or

err="expanding series: error querying tenant_id sometenant: failed to fetch some blocks (err-mimir-store-consistency-check-failed). The failed blocks are: [...]"

I've tried to downgrade from 2.14 to 2.13 without success.
Having tried various things, I upgraded to 2.14.1 which was supposed to resolve some S3 issues (I use minio as s3 backend).
That didn't help, also I tried to remove all specific configs from my mimir config.yml file, without success.

What I did think is that somewhere mimir 2.14 did not properly write / delete data in the S3 bucket.

While checking my data, I noticed that although I have versionning disabled on minio, I had alot of versions of files marked for deletion.
I did the following on my minio server:

mc rm -r --non-current --versions --force <alias>/<mimir blocks bucket>

Then, navigating into the real path where minio stores the block files (as minio-user):

find /path/to/minio-blocks-bucket ! -user minio-user -exec chown minio-user {} \;
find /path/to/minio-blocks-bucket -type f ! -user minio-user -exec chmod 0640 {} \;
find /path/to/minio-blocks-bucket -type d ! -user minio-user -exec chmod 0750 {} \;

After these commands I restarted mimir, waited 30 min for next compactor cleanup to run, and voilà.

Hope this helps anyone ;)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mimir failed consistency check, unable to query certain blocks #2656

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Mimir failed consistency check, unable to query certain blocks #2656

Replies: 5 comments · 14 replies

pracucci Aug 8, 2022 Maintainer

mari-arondeus Aug 8, 2022 Author

mari-arondeus Aug 8, 2022 Author

pracucci Aug 11, 2022 Maintainer

mari-arondeus Aug 11, 2022 Author

mari-arondeus Aug 28, 2022 Author

mari-arondeus Dec 12, 2023 Author

dimitarvdimitrov Dec 13, 2023 Maintainer

dimitarvdimitrov Dec 13, 2023 Maintainer

dimitarvdimitrov Dec 14, 2023 Maintainer

Replies: 5 comments 14 replies

pracucci
Aug 8, 2022
Maintainer

mari-arondeus Aug 8, 2022
Author

mari-arondeus Aug 8, 2022
Author

pracucci Aug 11, 2022
Maintainer

mari-arondeus Aug 11, 2022
Author

mari-arondeus Aug 28, 2022
Author

mari-arondeus Dec 12, 2023
Author

dimitarvdimitrov Dec 13, 2023
Maintainer

dimitarvdimitrov Dec 13, 2023
Maintainer

dimitarvdimitrov Dec 14, 2023
Maintainer