Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kafka replay speed: add alert for when we miss records in Kafka #9921

Merged
merged 11 commits into from
Nov 19, 2024

Conversation

dimitarvdimitrov
Copy link
Contributor

What this PR does

Adds an alert and metrics to detect when we have bugs.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job. Can you add an assertion to existing unit tests to ensure the metric is always 0 at the end of each test? Should be quick to do.

Base automatically changed from dimitar/ingest/remove-fetchWant-trimming to main November 17, 2024 18:14
I realized that trimming `fetchWant`s can end up discarding offsets in extreme circumstances.

### How it works

If the fetchWant is so big that its size would exceed 2GiB, then we trim it. We trim it by reducing the end offset. The idea is that the next fetchWant will pick up from where this one left off.

### How it can break

We trim the `fetchWant` in `UpdateBytesPerRecord` too. `UpdateBytesPerRecord` can be invoked in `concurrentFEtchers.run` after the `fetchWant` is dispatched. In that case the next `fetchWant` would have already been calculated. And we would end up with a gap.

### Did it break?

It's hard to tell, but it's very unlikely. To reach 2GiB we would have needed to have the estimation for bytes per record be 2 MiB. While these large records are possible, they should be rare and our rolling average estimation for records size shouldn't reach it.

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
@dimitarvdimitrov dimitarvdimitrov force-pushed the dimitar/ingest/detect-gaps-when-consuming branch from aa30f73 to 90aaeeb Compare November 18, 2024 09:45
@dimitarvdimitrov dimitarvdimitrov marked this pull request as ready for review November 18, 2024 09:45
@dimitarvdimitrov dimitarvdimitrov requested review from tacole02 and a team as code owners November 18, 2024 09:45
Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
Copy link
Contributor

@tacole02 tacole02 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A few minor questions/suggestions.

docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
- Ingester reads records from Kafka, and processes them sequentially. It keeps track of the offset of the last record it processed.
- Upon fetching the next batch of records, it checks if the first available record has an offset one greater than the last processed offset. If the first available offset is larger than that, then the ingester has missed some records.
- Kafka doesn't guarantee sequential offsets. If a record has been manually deleted from Kafka or the records have been produced in a transaction and the transaction was aborted, then there may be a gap.
- Mimir doesn't produce in transactions and does not delete records.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Mimir doesn't produce in transactions" reads unclear to me. Is the "in" supposed to be here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. in Kafka you can create a transaction and in its context produce records.

docs/sources/mimir/manage/mimir-runbooks/_index.md Outdated Show resolved Hide resolved
Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com>
@@ -1149,6 +1151,15 @@ func createConcurrentFetchers(ctx context.Context, t *testing.T, client *kgo.Cli
reg := prometheus.NewPedanticRegistry()
metrics := newReaderMetrics(partition, reg, noopReaderMetricsSource{})

t.Cleanup(func() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice use of cleanup!

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job!

pkg/storage/ingest/fetcher.go Outdated Show resolved Hide resolved
pkg/storage/ingest/fetcher.go Outdated Show resolved Hide resolved
pkg/storage/ingest/fetcher.go Show resolved Hide resolved
Co-authored-by: Marco Pracucci <marco@pracucci.com>
@dimitarvdimitrov dimitarvdimitrov enabled auto-merge (squash) November 19, 2024 09:03
@dimitarvdimitrov dimitarvdimitrov merged commit dc3ddfa into main Nov 19, 2024
31 checks passed
@dimitarvdimitrov dimitarvdimitrov deleted the dimitar/ingest/detect-gaps-when-consuming branch November 19, 2024 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants