kafka replay speed: add alert for when we miss records in Kafka #9921

dimitarvdimitrov · 2024-11-15T16:50:02Z

What this PR does

Adds an alert and metrics to detect when we have bugs.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

pracucci

Good job. Can you add an assertion to existing unit tests to ensure the metric is always 0 at the end of each test? Should be quick to do.

development/mimir-ingest-storage/docker-compose.jsonnet

I realized that trimming `fetchWant`s can end up discarding offsets in extreme circumstances. ### How it works If the fetchWant is so big that its size would exceed 2GiB, then we trim it. We trim it by reducing the end offset. The idea is that the next fetchWant will pick up from where this one left off. ### How it can break We trim the `fetchWant` in `UpdateBytesPerRecord` too. `UpdateBytesPerRecord` can be invoked in `concurrentFEtchers.run` after the `fetchWant` is dispatched. In that case the next `fetchWant` would have already been calculated. And we would end up with a gap. ### Did it break? It's hard to tell, but it's very unlikely. To reach 2GiB we would have needed to have the estimation for bytes per record be 2 MiB. While these large records are possible, they should be rare and our rolling average estimation for records size shouldn't reach it. Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

tacole02

Looks good! A few minor questions/suggestions.

docs/sources/mimir/manage/mimir-runbooks/_index.md

tacole02 · 2024-11-18T23:33:15Z

docs/sources/mimir/manage/mimir-runbooks/_index.md

+- Ingester reads records from Kafka, and processes them sequentially. It keeps track of the offset of the last record it processed.
+- Upon fetching the next batch of records, it checks if the first available record has an offset one greater than the last processed offset. If the first available offset is larger than that, then the ingester has missed some records.
+- Kafka doesn't guarantee sequential offsets. If a record has been manually deleted from Kafka or the records have been produced in a transaction and the transaction was aborted, then there may be a gap.
+- Mimir doesn't produce in transactions and does not delete records.


"Mimir doesn't produce in transactions" reads unclear to me. Is the "in" supposed to be here?

yes. in Kafka you can create a transaction and in its context produce records.

docs/sources/mimir/manage/mimir-runbooks/_index.md

Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com>

pracucci · 2024-11-19T08:24:54Z

pkg/storage/ingest/fetcher_test.go

@@ -1149,6 +1151,15 @@ func createConcurrentFetchers(ctx context.Context, t *testing.T, client *kgo.Cli
 	reg := prometheus.NewPedanticRegistry()
 	metrics := newReaderMetrics(partition, reg, noopReaderMetricsSource{})

+	t.Cleanup(func() {


Nice use of cleanup!

pracucci

Nice job!

pkg/storage/ingest/fetcher.go

Co-authored-by: Marco Pracucci <marco@pracucci.com>

docs/sources/mimir/manage/mimir-runbooks/_index.md

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

pracucci approved these changes Nov 15, 2024

View reviewed changes

development/mimir-ingest-storage/docker-compose.jsonnet Show resolved Hide resolved

Base automatically changed from dimitar/ingest/remove-fetchWant-trimming to main November 17, 2024 18:14

dimitarvdimitrov added 6 commits November 18, 2024 11:44

kafka replay speed: add alert for when we miss records in Kafka

87641f6

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

Restore local config

8e53e4a

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

Assert there are no missed records at the end of every test

69a3e0f

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

make doc

9b2f3f6

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

Fix rebase

90aaeeb

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

dimitarvdimitrov force-pushed the dimitar/ingest/detect-gaps-when-consuming branch from aa30f73 to 90aaeeb Compare November 18, 2024 09:45

dimitarvdimitrov marked this pull request as ready for review November 18, 2024 09:45

dimitarvdimitrov requested review from tacole02 and a team as code owners November 18, 2024 09:45

Add support for gaps within a Fetch

e82975f

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

tacole02 approved these changes Nov 18, 2024

View reviewed changes

Reword runbook

e227d03

Co-authored-by: Taylor C <41653732+tacole02@users.noreply.github.com>

pracucci reviewed Nov 19, 2024

View reviewed changes

pracucci approved these changes Nov 19, 2024

View reviewed changes

pkg/storage/ingest/fetcher.go Outdated Show resolved Hide resolved

pkg/storage/ingest/fetcher.go Outdated Show resolved Hide resolved

pkg/storage/ingest/fetcher.go Show resolved Hide resolved

Update log fields

11c3015

Co-authored-by: Marco Pracucci <marco@pracucci.com>

dimitarvdimitrov commented Nov 19, 2024

View reviewed changes

docs/sources/mimir/manage/mimir-runbooks/_index.md Show resolved Hide resolved

dimitarvdimitrov added 2 commits November 19, 2024 09:44

Update docs/sources/mimir/manage/mimir-runbooks/_index.md

68df0a0

Add TestFindGapsInRecords

5ed7788

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

dimitarvdimitrov enabled auto-merge (squash) November 19, 2024 09:03

dimitarvdimitrov merged commit dc3ddfa into main Nov 19, 2024
31 checks passed

dimitarvdimitrov deleted the dimitar/ingest/detect-gaps-when-consuming branch November 19, 2024 09:16

dimitarvdimitrov mentioned this pull request Nov 21, 2024

ingest: move missed record tracking to PartitionReader #9972

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kafka replay speed: add alert for when we miss records in Kafka #9921

kafka replay speed: add alert for when we miss records in Kafka #9921

dimitarvdimitrov commented Nov 15, 2024

pracucci left a comment

tacole02 left a comment

tacole02 Nov 18, 2024

dimitarvdimitrov Nov 19, 2024

pracucci Nov 19, 2024

pracucci left a comment

kafka replay speed: add alert for when we miss records in Kafka #9921

kafka replay speed: add alert for when we miss records in Kafka #9921

Conversation

dimitarvdimitrov commented Nov 15, 2024

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

pracucci left a comment

Choose a reason for hiding this comment

tacole02 left a comment

Choose a reason for hiding this comment

tacole02 Nov 18, 2024

Choose a reason for hiding this comment

dimitarvdimitrov Nov 19, 2024

Choose a reason for hiding this comment

pracucci Nov 19, 2024

Choose a reason for hiding this comment

pracucci left a comment

Choose a reason for hiding this comment