Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compactor can fail with "block with not healthy index found ... series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)" message #3569

Open
pstibrany opened this issue Dec 4, 2020 · 8 comments

Comments

@pstibrany
Copy link
Contributor

Compactor can fail to compact block with message like this:

msg="failed to compact user blocks" err="compaction: group 0@8712473450002685162: block with not healthy index found /data/compact/0@8712473450002685162/01EJEXEW6XQ37G17Q4JH9M2KF1; Compaction level 1; Labels: map[__org_id__:...]: 1/457844 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"

When this happens, compaction for given user will not continue, because compactor will retry to compact this block over and over, failing each time.

Upon further investigation, this is a 2h block produced by ingester. It's not clear why out-of-order chunks would be written. This is bug likely in Prometheus TSDB code.

Similar bugs in Thanos:

Workaround is to rename the block so that it's not included in the compaction.

@alvinlin123
Copy link
Contributor

We recently hit this issue as well.

@pracucci
Copy link
Contributor

pracucci commented Feb 4, 2021

We recently hit this issue as well.

Could you paste the exact log error you've got?

@alvinlin123
Copy link
Contributor

caller=compactor.go:450 component=compactor msg="failed to compact user blocks" user=<redacted> err="compaction: group 0@16811904347059316647: block with not healthy index found /data/compactor/compact/0@16811904347059316647/01EPTTT4B3FXVQ5X7WX5XZA13K; Compaction level 1; Labels: map[org_id:<redacted>]: 1/1000000 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"\n"

@alvinlin123
Copy link
Contributor

alvinlin123 commented Feb 9, 2021

I am wondering, can this issue be caused by prometheus/prometheus#8055 ? Because #8055 seems to introduce out of order samples.

@pracucci
Copy link
Contributor

I am wondering, can this issue be caused by prometheus/prometheus#8055 ? Because #8055 seems to introduce out of order samples.

It shouldn't. What you got is chunks of order within the same block while the issue you linked is about 2 different blocks overlapping in time.

@alvinlin123
Copy link
Contributor

@pracucci and @pstibrany for this issue, do you think it would be an improvement if we change the compactor not to halt the whole compacting process when a level1 block is bad; given that there is replicas of that block.

@bubu11e
Copy link

bubu11e commented Nov 19, 2021

Hi,

We also encountered this issue with the following message on our compactor :

Nov 18 16:29:56 cortex-compactor-1 cortex[8363]: level=error ts=2021-11-18T16:29:56.584539428Z caller=compactor.go:531 component=compactor msg="failed to compact user blocks" user=fake err="compaction: group 0@5679675083797525161: block with not healthy index found /var/lib/cortex/data/compact/0@5679675083797525161/01FM53C7H2SR3N111QZMA3TK8P; Compaction level 1; Labels: map[org_id:fake]: 13/20459573 series have an average of 1.000 out-of-order chunks: 1.538 of these are exact duplicates (in terms of data and time range)"

If it is any help, you can find the content of the not healthy block here : https://dl.plik.ovh/file/snWjOJ69TkrmdcUN/D8sbecutaDZBVAsN/01FM53C7H2SR3N111QZMA3TK8P.tar.gz

Regards,

Julien.

@friedrichg
Copy link
Member

A better workaround for this issue is to mark the block as no-compact.

$ cat thanos.yml
type: S3
config:
  bucket: my-bucket
  endpoint: ...
prefix: tenant-id
$ thanos tools bucket mark --id=$BLOCK --marker=no-compact-mark.json --objstore.config-file=thanos.yml --details=buggy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants