Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compact: Update filtered blocks list before second downsample pass #7492

Merged
merged 4 commits into from
Jul 13, 2024

Conversation

tghartland
Copy link
Contributor

@tghartland tghartland commented Jun 27, 2024

Fixes #7488

If the second downsampling pass is given the same filteredMetas list as the first pass, it will create duplicates of blocks created in the first pass.

It will also not be able to do further downsampling e.g 5m->1h using blocks created in the first pass, as it will not be aware of them.

The metadata was already being synced before the second pass, but not updated into the filteredMetas list.

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Re-uses the same block filtering logic from the first pass, to regenerate the list of filtered blocks from freshly synced metadata before the second pass.

I haven't made any changes regarding the metrics logic that takes place after the first filtering:

for _, meta := range filteredMetas {
resolutionLabel := meta.Thanos.ResolutionString()
downsampleMetrics.downsamples.WithLabelValues(resolutionLabel)
downsampleMetrics.downsampleFailures.WithLabelValues(resolutionLabel)
}

as I'm not sure what exactly this is doing. So there may be some downsamples that happen in the second pass that aren't counted in the metrics, if that's what it's doing. Let's keep that separate to this fix.

Verification

Tested against a bucket that was creating duplicate downsampled blocks on v0.35.1.
Verified in compactor logs that no duplicate block is created on the second pass: #7488 (comment)

If the second downsampling pass is given the same filteredMetas
list as the first pass, it will create duplicates of blocks
created in the first pass.

It will also not be able to do further downsampling e.g 5m->1h
using blocks created in the first pass, as it will not be aware
of them.

The metadata was already being synced before the second pass,
but not updated into the filteredMetas list.

Signed-off-by: Thomas Hartland <thomas.hartland@diamond.ac.uk>
Signed-off-by: Thomas Hartland <thomas.hartland@diamond.ac.uk>
yeya24
yeya24 previously approved these changes Jun 28, 2024
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think the fix makes sense to me.

@tghartland
Copy link
Contributor Author

For this failure on the compactor tests:
https://github.com/thanos-io/thanos/actions/runs/9699030328/job/26793642104?pr=7492#step:5:2050
It is reaching a value of 18 blocks cleaned, and the test is looking for 19.

It was changed in the test from 18 to 19 here 2ed48f7 one day after the change that introduced the filter code which ends up duplicating a block 68bef3f.

The reasoning at the time was that the hardcoded test value might have been taken from a run of the CI that didn't report the full number in the metric, but looking at it now this +1 has to be from a block being cleaned up that was a duplicate from the downsampling.

The value was increased in 2ed48f7 to fix the test,
with the reasoning that the hardcoded value must
have been taken from a run of the CI that didn't
reach the max value due to CI worker lag.

More likely the real reason is that commit 68bef3f
the day before had caused blocks to be duplicated
during downsampling.

The duplicate block is immediately marked for deletion,
causing an extra +1 in the number of blocks cleaned.

Subtracting one from the value again now that the
block duplication issue is fixed.

Signed-off-by: Thomas Hartland <thomas.hartland@diamond.ac.uk>
Combined with the previous commit this effectively reverts
all of 2ed48f7, in which two assertions were changed to
(unknowingly) account for a bug which had just been
introduced in the downsampling code, causing duplicate blocks.

This assertion change I am less sure on the reasoning for,
but after running through the e2e tests several times locally,
it is consistent that the only downsampling happens in the
"compact-working" step, and so all other steps would report 0
for their total downsamples metric.

Signed-off-by: Thomas Hartland <thomas.hartland@diamond.ac.uk>
@tghartland tghartland force-pushed the 7488-compactor-fix branch from d4e0cd5 to bdcf63b Compare June 28, 2024 16:59
@tghartland
Copy link
Contributor Author

@yeya24 I made some further changes to fix the e2e compaction tests, please could you take another look when you have the time?

Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Sorry for the lag. Let's merge it

@yeya24 yeya24 merged commit 35c0dbe into thanos-io:main Jul 13, 2024
20 checks passed
jnyi pushed a commit to jnyi/thanos that referenced this pull request Oct 17, 2024
…hanos-io#7492)

* compact: Update filtered blocks list before second downsample pass

If the second downsampling pass is given the same filteredMetas
list as the first pass, it will create duplicates of blocks
created in the first pass.

It will also not be able to do further downsampling e.g 5m->1h
using blocks created in the first pass, as it will not be aware
of them.

The metadata was already being synced before the second pass,
but not updated into the filteredMetas list.

Signed-off-by: Thomas Hartland <thomas.hartland@diamond.ac.uk>

* Update changelog

Signed-off-by: Thomas Hartland <thomas.hartland@diamond.ac.uk>

* e2e/compact: Fix number of blocks cleaned assertion

The value was increased in 2ed48f7 to fix the test,
with the reasoning that the hardcoded value must
have been taken from a run of the CI that didn't
reach the max value due to CI worker lag.

More likely the real reason is that commit 68bef3f
the day before had caused blocks to be duplicated
during downsampling.

The duplicate block is immediately marked for deletion,
causing an extra +1 in the number of blocks cleaned.

Subtracting one from the value again now that the
block duplication issue is fixed.

Signed-off-by: Thomas Hartland <thomas.hartland@diamond.ac.uk>

* e2e/compact: Revert change to downsample count assertion

Combined with the previous commit this effectively reverts
all of 2ed48f7, in which two assertions were changed to
(unknowingly) account for a bug which had just been
introduced in the downsampling code, causing duplicate blocks.

This assertion change I am less sure on the reasoning for,
but after running through the e2e tests several times locally,
it is consistent that the only downsampling happens in the
"compact-working" step, and so all other steps would report 0
for their total downsamples metric.

Signed-off-by: Thomas Hartland <thomas.hartland@diamond.ac.uk>

---------

Signed-off-by: Thomas Hartland <thomas.hartland@diamond.ac.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compactor: Downsample second pass creates duplicate blocks
2 participants