-
Notifications
You must be signed in to change notification settings - Fork 802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add shuffle-sharding for the compactor #4433
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems plausible. Will review again after the preceding PRs are merged.
@ac1214 Any plans to resume progress on this amazing work? |
Thanks for checking/testing out these changes @edma2! I haven't had much time to work on these changes, but I believe that @roystchiang might be taking over this work. |
if !c.compactorCfg.ShardingEnabled { | ||
// Always owned if sharding is disabled or if using shuffle-sharding as shard ownership | ||
// is determined by the shuffle sharding grouper. | ||
if !c.compactorCfg.ShardingEnabled || c.compactorCfg.ShardingStrategy == util.ShardingStrategyShuffle { | ||
return true, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every Compactor now "owns" all the users, even if it doesn't actually compact any blocks for most users. One particular impact I saw is a huge growth of metadata syncs. Instead, maybe we can compute ownUser
based on the shuffle shard of a tenant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick fix that seems to work: b0c2c2a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @edma2 thanks for taking a look at the compactor PR. We are actively testing this branch in our beta environment, and I'm glad to hear that it is working for you.
We have a similar fix to your commit, and I can confirm that we also saw an improvement in the metadata syncs.
However, on the tenant clean up side, we were running into issues where the deleted tenant directory was left dangling. While compactor-A is deleting the deletion markers, compactor-B is also trying to sync the data, and re-uploads the block index. We currently provide an override for the tenant shard size on the clean-up path so that only 1 compactor owns the cleanup for a given tenant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the tip on the cleanup side. Makes sense that you don't want multiple compactors repeating the same work and potentially conflicting.
@roystchiang with this PR have you started seeing any errors that look like |
We've been running this change for a bigger workspace, but we have not seen this error yet Are you able to provide more detail? Is it a level-1 -> level-2 compaction? and what level blocks are these? |
Currently, my theory is that this happens in the following situation: For a vertical compaction group e.g. (10:00 - 12:00), one Ingester has not yet uploaded its block (for some reason). Compactor B tries to compact [B1, A2] within the (00:00 - 12:00) compaction group and fails since both include A1 as the source. I don't think this issue would happen if only one compactor was running for a tenant. |
I've been testing successfully this PR on my compactor on a cluster with a single tenant having 200M actives series. I'm running 6 compactors with the following configuration. compactor:
block_deletion_marks_migration_enabled: false
block_sync_concurrency: 10
compaction_interval: 30m0s
data_dir: /var/lib/cortex/data
meta_sync_concurrency: 10
sharding_enabled: true
sharding_ring:
instance_interface_names:
- eth1
kvstore:
prefix: cortex/collectors/
sharding_strategy: shuffle-sharding
limits:
compactor_tenant_shard_size: 6
target: compactor Here are a couple of message from the logs which can raise concern.
The last one is more concerning, seen it 20 time, only on 1 of my 6 compactors |
@edma2, what you suggest makes sense, and it can definitely happen. How often does this issue happen to do you? Let me see if there's a way to avoid this. With a coordinator for compactors, we can block the compaction of A2 until B1 is done. However, we don't have this right now. Thanos played around with this idea in thanos-io/thanos#4458, and it would be nice to unify this logic for both Thanos and Cortex. @wilfriedroset, for |
The error message I got was during a regular compaction |
d8a21d5
to
d1cdef6
Compare
d1cdef6
to
ff79a09
Compare
9c428f2
to
f9a93f7
Compare
a04eb07
to
74d182c
Compare
not very often, I'd say a corrupt block is created less than once a week |
Could we lock the block A1 at this point? So it would not be used as source to any other compaction?
The only problem with locks ofc is what to do if the compactor dies, we need to have a lock timeout and some kinda and compactor would need to keep updating the @edma2 Could you prove that that's what happened on your case? Maybe looking at the timestamps of when the source blocks got updated on s3? |
My original theory was based on what I saw in the compactor logs. I'll look at this again and see if I can find any recent compactor logs that tell this story. |
We tested a few blocks starting from level 1 compaction (~1TB) on a new bucket. To test it it, I essentially ran the same config as @wilfriedroset . Couldn't find any major issues, seem to have compacted just fine. Extra: Took this pr and rebased it against current master and then build the image. Update: Compacted 10TiB this way, zero Issues |
bb6b026
to
bf10c94
Compare
Signed-off-by: Albert <ac1214@users.noreply.github.com>
Signed-off-by: Albert <ac1214@users.noreply.github.com>
Signed-off-by: Albert <ac1214@users.noreply.github.com>
Signed-off-by: Albert <ac1214@users.noreply.github.com>
Signed-off-by: Albert <ac1214@users.noreply.github.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
…ctor via ring, instead of returning true if shuffle-sharding is enabled Signed-off-by: Roy Chiang <roychi@amazon.com>
…nt at once, which results in dangling bucket index Signed-off-by: Roy Chiang <roychi@amazon.com>
… it as plans get generated Signed-off-by: Roy Chiang <roychi@amazon.com>
Signed-off-by: Roy Chiang <roychi@amazon.com>
Signed-off-by: Roy Chiang <roychi@amazon.com>
bf10c94
to
ff60ff1
Compare
ff60ff1
to
08782d4
Compare
Signed-off-by: Roy Chiang <roychi@amazon.com>
08782d4
to
3cd209b
Compare
I've been running this for some time and we did not see the problem happening. We could skip the problematic block using #4707 We could also propose a change on Thanos to skip those blocks similarly what was introduced thanos-io/thanos#4469 |
What this PR does:
Adds shuffle-sharding for the plans generated by the grouper. Depends on #4432. Once #4432 is merged the diff for this PR will be reduced.
Implements Proposal Parallel Compaction by Time Interval
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]