[bloom-compactor] Fix: blooms returning same checksum with different fp/ts params. #11720

poyzannur · 2024-01-19T16:54:39Z

What this PR does / why we need it:
We noticed that blooms created for different fingerprint(FPs) ranges and timestamps(Ts) were returning same checksums.
Block building essentially creates blooms and indexes. Previously we were calculating checksums based on blooms. However they do not contain information about FP and Ts. Moving checksum calculation to indexes includes that information, and results in different checksums as expected.

Which issue(s) this PR fixes:
Fixes https://github.com/grafana/loki-private/issues/899

Special notes for your reviewer:
The tests assume the chunkRefs in series are the same, for simplicity I kept this as is. But tested locally with different chunks refs too.

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
CHANGELOG.md updated
- If the change is worth mentioning in the release notes, add add-to-release-notes label
Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

owen-d · 2024-01-19T18:35:18Z

I feel like this misses the mark a bit. Seeing the same checksums after summing the content (blooms) and migrating to checksumming the metadata (index) just seems like we had a bug in the previous checksum code, not that this approach is more attractive. I think checksumming on content|index|both would reasonably work (both is probably better b/c it's all technically "content"), but logically we shouldn't be seeing the same checksums when summing the blooms if their metadata is different (outside of statistically unlikely collisions).

Does that make sense?

poyzannur · 2024-01-21T19:30:21Z

I feel like this misses the mark a bit. Seeing the same checksums after summing the content (blooms) and migrating to checksumming the metadata (index) just seems like we had a bug in the previous checksum code, not that this approach is more attractive. I think checksumming on content|index|both would reasonably work (both is probably better b/c it's all technically "content"), but logically we shouldn't be seeing the same checksums when summing the blooms if their metadata is different (outside of statistically unlikely collisions).

Does that make sense?

It sure does make sense. I tried to communicate in the last meeting that this was a cheap work around to unblock the team. Bc I could not find a bug in the checksum code, hence i offered this. I'll spend more time looking at the checksum code. But I don't think it harms to merge this while keeping the bug issue open.
how do you envision implementing checksum over both without adding overhead?

owen-d · 2024-01-22T16:43:03Z

Thanks for addt'l context that this was just unblocking our current approach. A few bits in no particular order:

Checksumming based on the index should be fine, as long as we also include the block schema as part of the input to checksumming code (so different algorithm parameters built against the same inputs produce different checksummed blocks). One benefit of checksumming everything is that we remove the need to think about minutiae like this. Instead, if the block bytes change, so will the checksum. Small benefit now, but can help catch some nasty bugs down the line.
I'm worried that changing this now, while helpful & expedient in the short term, is covering up issues we don't yet understand.
Performance wise, this all has a cost, but I don't expect it to be an issue (this is something we can measure/re-evaluate later).

Let's spend some more time uncovering why we're seeing identical checksums before moving to this approach.

poyzannur · 2024-01-23T15:21:50Z

@paul1r and I were debugging this issue and got to the bottom of the why. We calculate the checksum with BloomPageHeader, which does not contain any information about actual bloom data. All fields can turn out to be the same for same number of pages for the same number of series in a bloom.

We now consider checksumming over both index and blooms OR adding a hash to the BloomPageHeader per page which may give some variance on the bloom data per page. I'm closing this PR in favour of these solutions.

poyzannur added 3 commits January 18, 2024 16:29

Fix: bloom builder returning same checksum for different blocks

8949927

add more tests

69488bb

check for no err

a1d2d37

poyzannur requested a review from a team as a code owner January 19, 2024 16:54

pull-request-size bot added the size/L label Jan 19, 2024

poyzannur added 2 commits January 19, 2024 17:06

Resolve merge conflics

4034b23

Refactor tests, many thanks to paul1r

fa4cff0

pull-request-size bot added size/M and removed size/L labels Jan 19, 2024

poyzannur closed this Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bloom-compactor] Fix: blooms returning same checksum with different fp/ts params. #11720

[bloom-compactor] Fix: blooms returning same checksum with different fp/ts params. #11720

poyzannur commented Jan 19, 2024

owen-d commented Jan 19, 2024 •

edited

Loading

poyzannur commented Jan 21, 2024

owen-d commented Jan 22, 2024

poyzannur commented Jan 23, 2024 •

edited

Loading

[bloom-compactor] Fix: blooms returning same checksum with different fp/ts params. #11720

[bloom-compactor] Fix: blooms returning same checksum with different fp/ts params. #11720

Conversation

poyzannur commented Jan 19, 2024

owen-d commented Jan 19, 2024 • edited Loading

poyzannur commented Jan 21, 2024

owen-d commented Jan 22, 2024

poyzannur commented Jan 23, 2024 • edited Loading

owen-d commented Jan 19, 2024 •

edited

Loading

poyzannur commented Jan 23, 2024 •

edited

Loading