Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bloom-compactor] Fix: blooms returning same checksum with different fp/ts params. #11720

Closed
wants to merge 5 commits into from

Conversation

poyzannur
Copy link
Contributor

What this PR does / why we need it:
We noticed that blooms created for different fingerprint(FPs) ranges and timestamps(Ts) were returning same checksums.
Block building essentially creates blooms and indexes. Previously we were calculating checksums based on blooms. However they do not contain information about FP and Ts. Moving checksum calculation to indexes includes that information, and results in different checksums as expected.

Which issue(s) this PR fixes:
Fixes https://github.com/grafana/loki-private/issues/899

Special notes for your reviewer:
The tests assume the chunkRefs in series are the same, for simplicity I kept this as is. But tested locally with different chunks refs too.

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
    • If the change is worth mentioning in the release notes, add add-to-release-notes label
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@poyzannur poyzannur requested a review from a team as a code owner January 19, 2024 16:54
@pull-request-size pull-request-size bot added size/M and removed size/L labels Jan 19, 2024
@owen-d
Copy link
Member

owen-d commented Jan 19, 2024

I feel like this misses the mark a bit. Seeing the same checksums after summing the content (blooms) and migrating to checksumming the metadata (index) just seems like we had a bug in the previous checksum code, not that this approach is more attractive. I think checksumming on content|index|both would reasonably work (both is probably better b/c it's all technically "content"), but logically we shouldn't be seeing the same checksums when summing the blooms if their metadata is different (outside of statistically unlikely collisions).

Does that make sense?

@poyzannur
Copy link
Contributor Author

I feel like this misses the mark a bit. Seeing the same checksums after summing the content (blooms) and migrating to checksumming the metadata (index) just seems like we had a bug in the previous checksum code, not that this approach is more attractive. I think checksumming on content|index|both would reasonably work (both is probably better b/c it's all technically "content"), but logically we shouldn't be seeing the same checksums when summing the blooms if their metadata is different (outside of statistically unlikely collisions).

Does that make sense?

It sure does make sense. I tried to communicate in the last meeting that this was a cheap work around to unblock the team. Bc I could not find a bug in the checksum code, hence i offered this. I'll spend more time looking at the checksum code. But I don't think it harms to merge this while keeping the bug issue open.
how do you envision implementing checksum over both without adding overhead?

@owen-d
Copy link
Member

owen-d commented Jan 22, 2024

Thanks for addt'l context that this was just unblocking our current approach. A few bits in no particular order:

  • Checksumming based on the index should be fine, as long as we also include the block schema as part of the input to checksumming code (so different algorithm parameters built against the same inputs produce different checksummed blocks). One benefit of checksumming everything is that we remove the need to think about minutiae like this. Instead, if the block bytes change, so will the checksum. Small benefit now, but can help catch some nasty bugs down the line.
  • I'm worried that changing this now, while helpful & expedient in the short term, is covering up issues we don't yet understand.
  • Performance wise, this all has a cost, but I don't expect it to be an issue (this is something we can measure/re-evaluate later).

Let's spend some more time uncovering why we're seeing identical checksums before moving to this approach.

@poyzannur
Copy link
Contributor Author

poyzannur commented Jan 23, 2024

@paul1r and I were debugging this issue and got to the bottom of the why. We calculate the checksum with BloomPageHeader, which does not contain any information about actual bloom data. All fields can turn out to be the same for same number of pages for the same number of series in a bloom.

We now consider checksumming over both index and blooms OR adding a hash to the BloomPageHeader per page which may give some variance on the bloom data per page. I'm closing this PR in favour of these solutions.

@poyzannur poyzannur closed this Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants