Compactor: upgrading to Prometheus 2.47.0 breaks the compactor #6723

FlorisFeddema · 2023-09-15T08:03:41Z

Thanos, Prometheus and Golang version used:
Prometheus: 2.47.0
Thanos: 0.32.2

Object Storage Provider:
Azure Blob

What happened:
After upgrading Prometheus to vesion 2.47.0 the compactor stopped working. All blocks after the upgrade have out-of-order chunks.
The compactor fails when it reaches a new block with error:

err="compaction: group 0@6061977523826161203: blocks with out-of-order chunks are dropped from compaction:  /data/compact/0@6061977523826161203/01HA7QRGQDB0Z4SV0BM2S4687R: 1157/361848 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"

What you expected to happen:
The compactor succeeds in compacting the blocks.

How to reproduce it (as minimally and precisely as possible):
Run Prometheus 2.47.0 with the compactor enabled.

Full logs to relevant components:

k exec --stdin --tty prometheus-stack-thanos-storegateway-0 -- thanos tools bucket verify --objstore.config-file=/conf/objstore.yml

ts=2023-09-15T08:01:10.601379049Z caller=index_issue.go:61 level=warn verifiers=overlapped_blocks,index_known_issues verifier=index_known_issues msg="detected issue" id=01HA90YWB92D35CFAS57RQESJ7 err="538/230171 series have an average of 1.186 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"
ts=2023-09-15T08:01:12.606537326Z caller=index_issue.go:61 level=warn verifiers=overlapped_blocks,index_known_issues verifier=index_known_issues msg="detected issue" id=01HA8CBPM19NR869TN69ANARQV err="674/229044 series have an average of 1.221 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"
ts=2023-09-15T08:01:14.526769349Z caller=index_issue.go:61 level=warn verifiers=overlapped_blocks,index_known_issues verifier=index_known_issues msg="detected issue" id=01HA8K7DVGHQJ04JTEETER4TP6 err="596/228614 series have an average of 1.215 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"

Anything else we need to know:
After reverting back to 2.46.0 the newly created blocks done give the error anymore and it seems to be solved.

Already discussed this problem on the CNCF slack: https://cloud-native.slack.com/archives/CK5RSSC10/p1694681247238809

The text was updated successfully, but these errors were encountered:

cablespaghetti · 2023-09-15T08:26:56Z

We're on Prometheus 2.47 and thanos 0.32 since yesterday and have not yet experienced this issue, I've just checked our logs. We are trying to figure out why the compactor is suddenly using a LOT of disk io and running constantly though.

FlorisFeddema · 2023-09-15T08:37:19Z

There might be a difference between the flags we add to the compactor.

We run the compactor with these:

--log.level=info
--log.format=logfmt
--http-address=0.0.0.0:10902
--data-dir=/data
--retention.resolution-raw=90d
--retention.resolution-5m=90d
--retention.resolution-1h=90d
--consistency-delay=30m
--objstore.config-file=/conf/objstore.yml
--compact.enable-vertical-compaction
--deduplication.replica-label=prometheus_replica
--deduplication.func=penalty
--delete-delay=6h
--downsampling.disable

skpenpen · 2023-09-18T11:08:39Z

Thanos, Prometheus and Golang version used: Prometheus: 2.47.0 Thanos: 0.32.2

Object Storage Provider: Azure Blob

What happened: After upgrading Prometheus to vesion 2.47.0 the compactor stopped working. All blocks after the upgrade have out-of-order chunks. The compactor fails when it reaches a new block with error:
err="compaction: group 0@6061977523826161203: blocks with out-of-order chunks are dropped from compaction:  /data/compact/0@6061977523826161203/01HA7QRGQDB0Z4SV0BM2S4687R: 1157/361848 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"
What you expected to happen: The compactor succeeds in compacting the blocks.

How to reproduce it (as minimally and precisely as possible): Run Prometheus 2.47.0 with the compactor enabled.

Full logs to relevant components:
k exec --stdin --tty prometheus-stack-thanos-storegateway-0 -- thanos tools bucket verify --objstore.config-file=/conf/objstore.yml

ts=2023-09-15T08:01:10.601379049Z caller=index_issue.go:61 level=warn verifiers=overlapped_blocks,index_known_issues verifier=index_known_issues msg="detected issue" id=01HA90YWB92D35CFAS57RQESJ7 err="538/230171 series have an average of 1.186 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"
ts=2023-09-15T08:01:12.606537326Z caller=index_issue.go:61 level=warn verifiers=overlapped_blocks,index_known_issues verifier=index_known_issues msg="detected issue" id=01HA8CBPM19NR869TN69ANARQV err="674/229044 series have an average of 1.221 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"
ts=2023-09-15T08:01:14.526769349Z caller=index_issue.go:61 level=warn verifiers=overlapped_blocks,index_known_issues verifier=index_known_issues msg="detected issue" id=01HA8K7DVGHQJ04JTEETER4TP6 err="596/228614 series have an average of 1.215 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"
Anything else we need to know: After reverting back to 2.46.0 the newly created blocks done give the error anymore and it seems to be solved.

Already discussed this problem on the CNCF slack: https://cloud-native.slack.com/archives/CK5RSSC10/p1694681247238809

We are also on 2.47.0 and compactor have the same errors messages on One prometheus (not another one)

ts=2023-09-15T14:16:11.347496243Z caller=main.go:161 level=error err="group 0@5028037766798890583: blocks with out-of-order chunks are dropped from compaction: /thanos/compact/compact/0@5028037766798890583/01HA4QQB074HFBBW6E48KFQA52: 20/8072687 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)

On this prometheus we do federation and not on the one working.
We rolled back to LTS and problem disapeared.

kumode · 2023-09-19T06:31:56Z

When you rolled back the versions, did you also delete the problematic chunks, which halted the compaction?

skpenpen · 2023-09-19T08:45:24Z

When you rolled back the versions, did you also delete the problematic chunks, which halted the compaction?

We have compactor with "--compact.skip-block-with-out-of-order-chunks", so the blocks are marked for no compaction, we did not delete anything to prevent data loss

Its a rollback, not a solution :)

calvinbui · 2023-09-21T00:01:53Z

we've encountered the same problem on a few deployments using AWS S3.

sylr · 2023-09-21T09:15:35Z

Got the same issue:

ts=2023-09-21T09:12:33.839849253Z caller=factory.go:53 level=info msg="loading bucket configuration"
ts=2023-09-21T09:12:33.840674439Z caller=compact.go:393 level=info msg="retention policy of raw samples is enabled" duration=720h0m0s
ts=2023-09-21T09:12:33.840718066Z caller=compact.go:400 level=info msg="retention policy of 5 min aggregated samples is enabled" duration=2160h0m0s
ts=2023-09-21T09:12:33.840729154Z caller=compact.go:403 level=info msg="retention policy of 1 hour aggregated samples is enabled" duration=8760h0m0s
ts=2023-09-21T09:12:33.840743143Z caller=compact.go:643 level=info msg="starting compact node"
ts=2023-09-21T09:12:33.840758713Z caller=intrumentation.go:56 level=info msg="changing probe status" status=ready
ts=2023-09-21T09:12:33.840845524Z caller=intrumentation.go:75 level=info msg="changing probe status" status=healthy
ts=2023-09-21T09:12:33.840866202Z caller=http.go:73 level=info service=http/server component=compact msg="listening for requests and metrics" address=0.0.0.0:10902
ts=2023-09-21T09:12:33.840877557Z caller=compact.go:1414 level=info msg="start sync of metas"
ts=2023-09-21T09:12:33.841054051Z caller=tls_config.go:274 level=info service=http/server component=compact msg="Listening on" address=[::]:10902
ts=2023-09-21T09:12:33.841081378Z caller=tls_config.go:277 level=info service=http/server component=compact msg="TLS is disabled." http2=false address=[::]:10902
ts=2023-09-21T09:12:38.314973248Z caller=fetcher.go:487 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=4.474026609s duration_ms=4474 cached=88 returned=52 partial=0
ts=2023-09-21T09:12:38.315015057Z caller=compact.go:1419 level=info msg="start of GC"
ts=2023-09-21T09:12:38.315167885Z caller=compact.go:1442 level=info msg="start of compactions"
ts=2023-09-21T09:12:38.315321271Z caller=compact.go:1062 level=info group="0@{env=\"infra\", k8s_cluster=\"eks-euc1-infra-01\", replica=\"1\"}" groupKey=0@6056763383787088098 msg="compaction available and planned" plan="[01HAS3YBZCWNQRHT3RVS43JXF2 (min time: 1695196800027, max time: 1695204000000) 01HASAT3Q01A5DZH3CKSSR0EJJ (min time: 1695204000024, max time: 1695211200000) 01HASHNTCDYPG96GQQ7RHK2T4V (min time: 1695211200000, max time: 1695218400000) 01HASRHJ8HB5ZD3WD2Y37K9A4Y (min time: 1695218400010, max time: 1695225600000)]"
ts=2023-09-21T09:12:38.315339484Z caller=compact.go:1071 level=info group="0@{env=\"infra\", k8s_cluster=\"eks-euc1-infra-01\", replica=\"1\"}" groupKey=0@6056763383787088098 msg="finished running pre compaction callback; downloading blocks" plan="[01HAS3YBZCWNQRHT3RVS43JXF2 (min time: 1695196800027, max time: 1695204000000) 01HASAT3Q01A5DZH3CKSSR0EJJ (min time: 1695204000024, max time: 1695211200000) 01HASHNTCDYPG96GQQ7RHK2T4V (min time: 1695211200000, max time: 1695218400000) 01HASRHJ8HB5ZD3WD2Y37K9A4Y (min time: 1695218400010, max time: 1695225600000)]" duration=4.307µs duration_ms=0
ts=2023-09-21T09:12:49.930839658Z caller=intrumentation.go:67 level=warn msg="changing probe status" status=not-ready reason="compaction: group 0@6056763383787088098: blocks with out-of-order chunks are dropped from compaction:  data/compact/0@6056763383787088098/01HASRHJ8HB5ZD3WD2Y37K9A4Y: 128/256707 series have an average of 1.125 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"
ts=2023-09-21T09:12:49.930873487Z caller=http.go:91 level=info service=http/server component=compact msg="internal server is shutting down" err="compaction: group 0@6056763383787088098: blocks with out-of-order chunks are dropped from compaction:  data/compact/0@6056763383787088098/01HASRHJ8HB5ZD3WD2Y37K9A4Y: 128/256707 series have an average of 1.125 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"
ts=2023-09-21T09:12:49.931001135Z caller=http.go:110 level=info service=http/server component=compact msg="internal server is shutdown gracefully" err="compaction: group 0@6056763383787088098: blocks with out-of-order chunks are dropped from compaction:  data/compact/0@6056763383787088098/01HASRHJ8HB5ZD3WD2Y37K9A4Y: 128/256707 series have an average of 1.125 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"
ts=2023-09-21T09:12:49.931017837Z caller=intrumentation.go:81 level=info msg="changing probe status" status=not-healthy reason="compaction: group 0@6056763383787088098: blocks with out-of-order chunks are dropped from compaction:  data/compact/0@6056763383787088098/01HASRHJ8HB5ZD3WD2Y37K9A4Y: 128/256707 series have an average of 1.125 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"
ts=2023-09-21T09:12:49.931099743Z caller=main.go:161 level=error err="group 0@6056763383787088098: blocks with out-of-order chunks are dropped from compaction:  data/compact/0@6056763383787088098/01HASRHJ8HB5ZD3WD2Y37K9A4Y: 128/256707 series have an average of 1.125 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)\ncompaction\nmain.runCompact.func7\n\t/app/cmd/thanos/compact.go:427\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:476\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1172\ncompact command failed\nmain.main\n\t/app/cmd/thanos/main.go:161\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_arm64.s:1172"

happened after we upgraded our prometheus instances to v2.47.0 and turned on native-histograms.

skpenpen · 2023-09-22T10:01:44Z

We do not have "native-histograms" in our conf
@sylr Do you do federation on your prometheus ?

sylr · 2023-09-22T11:26:59Z

We do not have "native-histograms" in our conf @sylr Do you do federation on your prometheus ?

No I do not do federation.

federicopires · 2023-09-25T19:06:45Z

Same issue for us. We did try running sidecar with thanos v0.31 since we use kube-prometheus-stack chart and we thought the problem might be in the 0.32 sidecar that comes in the latest chart version , but didn't help, issue was still there.

The streams we get from Thanos receiver get compacted without problems.

sandeep-sidhu · 2023-09-26T12:48:18Z

Just encounter the same in our setup.

Thanos, Prometheus version used:
Prometheus: 2.47.0
Thanos: 0.32.2

object storage: s3

Thanos compactor falling due to out-of-order chunks, tried the skip-block-with-out-of-order-chunks option, but it was encountering it with almost with every chunk, so we rolled back Prometheus to 2.46.0.

mickeyzzc · 2023-09-28T08:18:03Z

Same issue for me, so I added log prints to find out what the problem was, and found that the problem had been solved in prometheus/prometheus#12874 , but It hasn't been released yet.

mickeyzzc · 2023-09-28T08:20:15Z

I don't want to delete block because of this problem, I want a tool that fixes the problem, and now I'm going to add a directive to the thanos tool to handle the problem.

sylr · 2023-09-28T15:57:39Z

Can we upgrade prometheus without a release and tag a patch version a thanos ?

This is a serious issue that merits actions.

Mimir already did the upgrade grafana/mimir#6107.

yeya24 · 2023-09-28T16:20:38Z

CC @saswatamcode I guess we can do a v0.32.4 release for this fix prometheus/prometheus#12874 and previous fixes

saswatamcode · 2023-09-28T16:22:29Z

Ack! I think the upgrade will be a bit more involved tho, will do it on main first and cherry-pick for v0.32.4

yeya24 · 2023-09-28T18:15:05Z

@saswatamcode After taking another look, this seems a Prometheus issue only. Thanos is still using an old version of Prometheus so not affected by this bug.

It is just the bad blocks created by Prometheus causing compactor to fail.

mickeyzzc · 2023-10-03T12:01:45Z

@saswatamcode Is that fix? I made some code changes in my branch, testing it, I'm on vacation here, go back after the holiday to see the actual situation. If it's fixed, I'll cancel the code I changed.

yeya24 · 2023-10-03T15:52:16Z

@mickeyzzc No, since the bad blocks are created by Prometheus, we need a tool to repair the bad blocks. Thanos doesn't cause those blocks to be corrupted.

mickeyzzc · 2023-10-10T02:55:28Z

@saswatamcode @yeya24 I'm testing my environment, and the anomaly block has passed the fix.

skpenpen · 2023-10-10T07:21:00Z

@saswatamcode @yeya24 I'm testing my environment, and the anomaly block has passed the fix.

You mean the faulty blocks got compacted ?

mickeyzzc · 2023-10-10T13:40:59Z

@saswatamcode @yeya24 I'm testing my environment, and the anomaly block has passed the fix.

You mean the faulty blocks got compacted ?

It's possible in this case.
After rewriting, only two neighbors of the same sample point will appear. The downsampled sample will also be processed, but it is better to correct it with tools.

andrejshapal · 2023-10-31T13:20:59Z

Hello,
This fix:
prometheus-community/helm-charts#3877
Does not fix issue. The logs out-of-order disappeared, but the issue stays the same - metrics are not pushed to store.

rouke-broersma · 2023-10-31T16:38:22Z

Hello,
This fix:
prometheus-community/helm-charts#3877
Does not fix issue. The logs out-of-order disappeared, but the issue stays the same - metrics are not pushed to store.

Sounds like you have a different issue. The issue here was not that blocks are not pushed to the store but that the blocks in the store were out of order and could not be compacted.

andrejshapal · 2023-10-31T21:40:18Z

@rouke-broersma You are right. My appologies. We assumed this issue resulted in unavailability of data in the storage. Bet as I see now, these are two different issues.

douglascamata added bug component: compact needs-investigation labels Sep 19, 2023

saswatamcode mentioned this issue Oct 2, 2023

Cut patch release v0.32.4 #6763

Merged

2 tasks

yeya24 mentioned this issue Oct 3, 2023

Compaction is Not Running Properly Because of Out-of-order Chunk cortexproject/cortex#5584

Open

mickeyzzc mentioned this issue Oct 9, 2023

Compact skips prometheus bug, and thanos tool supports fixes for this issue 6723 #6782

Closed

2 tasks

FlorisFeddema mentioned this issue Oct 10, 2023

[kube-prometheus-stack] bump prometheus image tag to 2.47.1 prometheus-community/helm-charts#3877

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compactor: upgrading to Prometheus 2.47.0 breaks the compactor #6723

Compactor: upgrading to Prometheus 2.47.0 breaks the compactor #6723

FlorisFeddema commented Sep 15, 2023 •

edited

Loading

cablespaghetti commented Sep 15, 2023

FlorisFeddema commented Sep 15, 2023 •

edited

Loading

skpenpen commented Sep 18, 2023 •

edited

Loading

kumode commented Sep 19, 2023

skpenpen commented Sep 19, 2023

calvinbui commented Sep 21, 2023

sylr commented Sep 21, 2023

skpenpen commented Sep 22, 2023

sylr commented Sep 22, 2023

federicopires commented Sep 25, 2023

sandeep-sidhu commented Sep 26, 2023

mickeyzzc commented Sep 28, 2023 •

edited

Loading

mickeyzzc commented Sep 28, 2023

sylr commented Sep 28, 2023 •

edited

Loading

yeya24 commented Sep 28, 2023

saswatamcode commented Sep 28, 2023 •

edited

Loading

yeya24 commented Sep 28, 2023

mickeyzzc commented Oct 3, 2023 •

edited

Loading

yeya24 commented Oct 3, 2023

mickeyzzc commented Oct 10, 2023 •

edited

Loading

skpenpen commented Oct 10, 2023

mickeyzzc commented Oct 10, 2023 •

edited

Loading

andrejshapal commented Oct 31, 2023

rouke-broersma commented Oct 31, 2023

andrejshapal commented Oct 31, 2023

Compactor: upgrading to Prometheus 2.47.0 breaks the compactor #6723

Compactor: upgrading to Prometheus 2.47.0 breaks the compactor #6723

Comments

FlorisFeddema commented Sep 15, 2023 • edited Loading

cablespaghetti commented Sep 15, 2023

FlorisFeddema commented Sep 15, 2023 • edited Loading

skpenpen commented Sep 18, 2023 • edited Loading

kumode commented Sep 19, 2023

skpenpen commented Sep 19, 2023

calvinbui commented Sep 21, 2023

sylr commented Sep 21, 2023

skpenpen commented Sep 22, 2023

sylr commented Sep 22, 2023

federicopires commented Sep 25, 2023

sandeep-sidhu commented Sep 26, 2023

mickeyzzc commented Sep 28, 2023 • edited Loading

mickeyzzc commented Sep 28, 2023

sylr commented Sep 28, 2023 • edited Loading

yeya24 commented Sep 28, 2023

saswatamcode commented Sep 28, 2023 • edited Loading

yeya24 commented Sep 28, 2023

mickeyzzc commented Oct 3, 2023 • edited Loading

yeya24 commented Oct 3, 2023

mickeyzzc commented Oct 10, 2023 • edited Loading

skpenpen commented Oct 10, 2023

mickeyzzc commented Oct 10, 2023 • edited Loading

andrejshapal commented Oct 31, 2023

rouke-broersma commented Oct 31, 2023

andrejshapal commented Oct 31, 2023

FlorisFeddema commented Sep 15, 2023 •

edited

Loading

FlorisFeddema commented Sep 15, 2023 •

edited

Loading

skpenpen commented Sep 18, 2023 •

edited

Loading

mickeyzzc commented Sep 28, 2023 •

edited

Loading

sylr commented Sep 28, 2023 •

edited

Loading

saswatamcode commented Sep 28, 2023 •

edited

Loading

mickeyzzc commented Oct 3, 2023 •

edited

Loading

mickeyzzc commented Oct 10, 2023 •

edited

Loading

mickeyzzc commented Oct 10, 2023 •

edited

Loading