Description
Describe the bug
I'm running two config-identical Cortex clusters, let's say: prod & nonprod.
Nonprod looks fine.
In prod it seems that compactor is unable to properly compact blocks and this leads to prod having ~11 000 blocks, while nonprod ~600 (the volume of data alone is not 20x bigger on prod, so this is unexpected).
Having 11k blocks causes problems with store gateway pods which tend to take a lot of time to load blocks and, until all blocks loaded, the cluster does not work great.
Why am I assuming that compacting does not work for prod? It seems that upon successful compaction there should be some entries such as compacted blocks
and marking compacted block for deletion
etc. There are none, only endless entries like the ones above. Also I'm running various dashboards, ie. https://github.com/monitoring-mixins/website/blob/master/assets/cortex/dashboards/cortex-compactor-resources.json which shows literally no compacted blocks for prod (and some for nonprod). Sharing my logs below.
I am aware that there are at least several issues that could be causing compactor not to work #4453 or #3569, but I'd very much welcome any hints that could allow me to unblock compacting as the current volume of blocks makes the cluster prone to not working properly (which is not great for production usage, obviously).
Expected behavior
Compacting works, the amount of blocks in prod in not ~20x the amount of blocks in nonprod (more like ~3-4 times at best).
Environment:
K8s in GKE v1.21, deployed by the official cortex Helm chart v1.4.0
Storage Engine
Blocks
Additional Context
Two sets of logs are here: