Failure to compact due to maximum index size 64 GiB #4705

thejosephstevens · 2022-04-05T19:57:03Z

Describe the bug
Over the weekend we significantly expanded one of our clusters, pushing ~153M timeseries to our Cortex 1.11.1 cluster in a day.

To Reproduce
Steps to reproduce the behavior:

Start Cortex 1.11.1 in single-tenant mode as microservices
Push ~153M timeseries to it
See the Compactor get to Stage 3 compaction and then fail with an error like this

level=error ts=2022-04-04T16:11:41.167458322Z caller=compactor.go:537 component=compactor msg="failed to compact user blocks" user=fake err="compaction: group 0@5679675083797525161: compact blocks [data/compact/0@5679675083797525161/01FZNK9ZNWHJWRXB2YNHZ9W8H9 data/compact/0@5679675083797525161/01FZPV9DZ1PGV0HHV9007P8ZWM]: \"data/compact/0@5679675083797525161/01FZTCHWRJHFN419FJPDM0MQNW.tmp-for-creation/index\" exceeding max size of 64GiB"

The two blocks referenced by this error are 12-hour blocks at level-3 compaction, each of which has an index of ~38 GB (summing together to ~76 GB > 64 GB).

Expected behavior
There should be a way of skipping this, forcing compaction, sharding, or something.

There's an upstream Thanos patch here which allows skipping compaction but it appears to not be used by Cortex today (found by @alvinlin123 in Cortex Slack). There's also a Thanos change here which would automatically skip past compaction if the block is too large.

Mimir appears to have the ability to get past this by sharding during compaction so multiple blocks are produced per day, each of which can have a smaller index.

Environment:

Infrastructure: Kubernetes (GKE)
Deployment tool: Jinja (derived from the cortex-jsonnet manifest)

Storage Engine

Blocks
Chunks

Additional Context

The text was updated successfully, but these errors were encountered:

alvinlin123 · 2022-04-05T21:13:05Z

There is a related Prometheus issue to support index file size bigger than 64Gi

However, I think we shouldn't wait for that, and should have Cortex skip compaction for blocks with large index.

alvinlin123 · 2022-04-06T20:57:41Z

After PR #4707 I still need to implement the part that auto skip compaction for blocks with humongous index.

alvinlin123 · 2022-04-16T06:18:09Z

Hmm looks like I can't just use Thanos largeTotalIndexSizeFilter for Cortex's ShuffleShardingPlanner because ShuffleShardingPlanner is coupled with non-exported tsdbBasedPlanner struct.

alvinlin123 · 2022-04-16T07:06:32Z

Another relevant Thanos issue thanos-io/thanos#3068 to track the sharding work

stale · 2022-08-12T01:09:16Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

thejosephstevens · 2022-08-19T00:34:50Z

For what it's worth, we migrated to Mimir and no longer have issues with this. I think this still makes sense to implement in Cortex since it's a pretty brutal issue to hit (and effectively puts a hard cap on max tenant size), but it doesn't need to be kept open for us. If you'd like to close it for clean-up, feel free.

jeromeinsf · 2022-08-25T20:52:33Z

@thejosephstevens we are looking on how to address this in Cortex.

yeya24 · 2022-09-28T03:39:17Z

@alvinlin123 I don't think we can close this issue since only the proposal gets merged. Please reopen it.

alvinlin123 · 2022-09-28T04:15:36Z

@yeya24 you are right; closing this issue was a mistake. Thanks!

I still have to learn to pay attention when merging PR and not which issue may get incorrectly closed :)

alvinlin123 mentioned this issue Apr 6, 2022

Introduce SkipBlocksWithOutOfOrderChunksEnabled feature #4707

Merged

3 tasks

stale bot added the stale label Aug 12, 2022

alvinlin123 added keepalive Skipped by stale bot and removed stale labels Aug 12, 2022

roystchiang mentioned this issue Aug 25, 2022

add proposal for timeseries partitioning in compactor #4843

Merged

3 tasks

alvinlin123 closed this as completed in #4843 Sep 27, 2022

alexqyle mentioned this issue Sep 28, 2022

Update proposal for timeseries partitioning in compactor #4882

Merged

3 tasks

alvinlin123 reopened this Sep 28, 2022

alanprot closed this as completed in #4882 Oct 24, 2022

alvinlin123 reopened this Oct 24, 2022

alexqyle mentioned this issue Dec 6, 2022

Partitioning compaction for Cortex #5025

Closed

3 tasks

alexqyle mentioned this issue May 2, 2023

Partitioning compaction for Cortex #5316

Closed

3 tasks

alexqyle mentioned this issue Jul 18, 2023

Partitioning compaction for Cortex #5465

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to compact due to maximum index size 64 GiB #4705

Failure to compact due to maximum index size 64 GiB #4705

thejosephstevens commented Apr 5, 2022

alvinlin123 commented Apr 5, 2022 •

edited

Loading

alvinlin123 commented Apr 6, 2022 •

edited

Loading

alvinlin123 commented Apr 16, 2022

alvinlin123 commented Apr 16, 2022

stale bot commented Aug 12, 2022

thejosephstevens commented Aug 19, 2022

jeromeinsf commented Aug 25, 2022

yeya24 commented Sep 28, 2022

alvinlin123 commented Sep 28, 2022

Failure to compact due to maximum index size 64 GiB #4705

Failure to compact due to maximum index size 64 GiB #4705

Comments

thejosephstevens commented Apr 5, 2022

alvinlin123 commented Apr 5, 2022 • edited Loading

alvinlin123 commented Apr 6, 2022 • edited Loading

alvinlin123 commented Apr 16, 2022

alvinlin123 commented Apr 16, 2022

stale bot commented Aug 12, 2022

thejosephstevens commented Aug 19, 2022

jeromeinsf commented Aug 25, 2022

yeya24 commented Sep 28, 2022

alvinlin123 commented Sep 28, 2022

alvinlin123 commented Apr 5, 2022 •

edited

Loading

alvinlin123 commented Apr 6, 2022 •

edited

Loading