memberlist: Log fast-join failures at `info` instead of `debug` #585

56quarters · 2024-09-20T17:55:48Z

What this PR does:

Not being able to fast-join via contacting a node is suspicious and might indicate a problem. Logging at info makes this easier to troubleshoot.

Which issue(s) this PR fixes:

N/A

Checklist

Tests updated
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Not being able to fast-join via contacting a node is suspicious and might indicate a problem. Logging at info makes this easier to troubleshoot. Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

pracucci

Thanks!

Specifically pulls in grafana/dskit#585 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> Use labels hasher Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> Use consistent title name Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> Use consistent title name Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> kafka replay speed: adjust batchingQueueCapacity (#9344) * kafka replay speed: adjust batchingQueueCapacity I made 2000 up when we were flushing individual series to the channel. Then 2000 might have made sense, but when flushing whole WriteRequests a capacity of 1 should be sufficient. Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Increase errCh capacity Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Explain why +1 Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Set capacity to 5 Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Update pkg/storage/ingest/pusher.go Co-authored-by: gotjosh <josue.abreu@gmail.com> * Improve test Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Update pkg/storage/ingest/pusher.go --------- Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> Co-authored-by: gotjosh <josue.abreu@gmail.com> kafka replay speed: rename CLI flags (#9345) * kafka replay speed: rename CLI flags Make them a bit more consistent on what they mean and add better descriptions. Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Clarify metrics Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Rename flags Co-authored-by: gotjosh <josue.abreu@gmail.com> * Update docs Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> --------- Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> Co-authored-by: gotjosh <josue.abreu@gmail.com> kafka replay speed: add support for metadata & source (#9287) * kafka replay speed: add support for metadata & source Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Remove completed TODO Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Use a single map Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Make tests compile again Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> --------- Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> kafka replay speed: improve fetching tracing (#9361) * Better span logging Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Better span logging Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Try to have more buffering in ordered batches maybe waiting to send to ordered batches comes with too much overhead Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Correct local docker-compose config with new flags Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Maybe have more stable events Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Revert "Try to have more buffering in ordered batches" This reverts commit 886b159. * Maybe have more stable events Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Maybe have more stable events Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Propagate loggers in spans Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> --------- Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> continuous-test: Make the User-Agent header for the Mimir client conf… (#9338) * continuous-test: Make the User-Agent header for the Mimir client configurable * Update CHANGELOG.md * Run make reference-help TestIngester_PushToStorage_CircuitBreaker: increase initial delay (#9351) * TestIngester_PushToStorage_CircuitBreaker: increase initial delay Fixes XXX I believe there's a race between sending the first request and then collecting the metrics. It's possible that we collect the metrics longer than 200ms after the first request, at which point the CB has opened. I could reproduce XXX by reducing the initialDelay to 10ms. This PR increases it to 1 hour so that we're more sure that the delay hasn't expired when we're collecting the metrics. Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Adjust all tests Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> --------- Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> Update to latest commit of dskit main (#9356) Specifically pulls in grafana/dskit#585 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com> Update mimir-prometheus (#9358) * Update mimir-prometheus * Run make generate-otlp query-tee: add equivalent errors for string expression for range queries (#9366) * query-tee: add equivalent errors for string expression for range queries * Add changelog entry MQE: fix `rate()` over native histograms where first point in range is a counter reset (#9371) * MQE: fix `rate()` over native histograms where first point is a counter reset * Add changelog entry Update module github.com/Azure/azure-sdk-for-go/sdk/storage/azblob to v1.4.1 (#9369) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Use centralized 'Add to docs project' workflow with GitHub App auth (#9330) * Use centralized 'Add to docs project' workflow with GitHub App auth Until this is merged, it is likely that any issues labeled `type/docs` won't be added to the [organization project](https://github.com/orgs/grafana/projects/69). The underlying action is centralized so that any future changes are made in one place (`grafana/writers-toolkit`). The action is versioned to protect workflows from breaking changes. The action uses Vault secrets instead of the discouraged organization secrets. The workflow uses a consistent name so that future changes can be made programmatically. Relates to https://github.com/orgs/grafana/projects/279/views/9?pane=issue&itemId=44280262 Signed-off-by: Jack Baldry <jack.baldry@grafana.com> * Remove unneeded checkout step * Remove unneeded checkout step --------- Signed-off-by: Jack Baldry <jack.baldry@grafana.com> Update grafana/agent Docker tag to v0.43.1 (#9365) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Update module github.com/hashicorp/vault/api/auth/userpass to v0.8.0 (#9375) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Update module github.com/hashicorp/vault/api/auth/approle to v0.8.0 (#9374) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Update module go.opentelemetry.io/collector/pdata to v1.15.0 (#9380) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Update module github.com/hashicorp/vault/api/auth/kubernetes to v0.8.0 (#9377) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Update module github.com/twmb/franz-go/plugin/kotel to v1.5.0 (#9379) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> kafka replay speed: ingestion metrics (#9346) * kafka replay speed: ingestion metrics Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Separate batch processing time by batch contents Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Also set time on metadata Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Add tenant to metrics Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Add metrics for errors Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Rename batching queue metrics Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Pairing to address code review Co-Authored-By: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Move the metrics into their own file Co-Authored-By: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * go mod tidy Signed-off-by: gotjosh <josue.abreu@gmail.com> --------- Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> Signed-off-by: gotjosh <josue.abreu@gmail.com> Co-authored-by: gotjosh <josue.abreu@gmail.com> kafka replay speed: move error handling closer to actual ingestion (#9349) * kafka replay speed: move error handling closer to actual ingestion Previously, we'd let error bubble-up and only take decisions on whether to abort the request or not at the very top (`pusherConsumer`). This meant that we'd potentially buffer more requests before we detect an error. This change extracts error handling logic into a `Pusher` implementation: `clientErrorFilteringPusher`. This implementation logs client errors and then swallows them. We inject that implementation in front of the ingester. This means that the parallel storage implementation can abort ASAP instead of collecting and bubbling up the errors. Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> --------- Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> Signed-off-by: gotjosh <josue.abreu@gmail.com> Co-authored-by: gotjosh <josue.abreu@gmail.com> kafka replay speed: concurrency fetching improvements (#9389) * fetched records include timestamps Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * try with defaultMinBytesWaitTime=3s Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * add fetch_min_bytes_max_wait Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Don't block on sending to the channel Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Remove wait for when we're fetching from the end Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Fix bug with blocking on fetch Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Slightly easier to follow lifecycle of previousResult Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Correct merging of results Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Avoid double-logging events Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Revert "add fetch_min_bytes_max_wait" This reverts commit 6197d4b. * Increase MinBytesWaitTime to 5s Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Add comment about warpstream and MinBytes Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Address review comments Signed-off-by: gotjosh <josue.abreu@gmail.com> * Add tests for concurrentFetchers Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Fix bugs in tracking lastReturnedRecord Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Renamed method Signed-off-by: gotjosh <josue.abreu@gmail.com> * use the older context Signed-off-by: gotjosh <josue.abreu@gmail.com> * Name variable correct variable name Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Reduce MaxWaitTime in PartitionReader tests Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Change test createConcurrentFetchers signature Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Sort imports Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> --------- Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> Signed-off-by: gotjosh <josue.abreu@gmail.com> Co-authored-by: gotjosh <josue.abreu@gmail.com> Make concurrentFetchers change its concurrency dynamically (#9437) * Make concurrentFetchers change its concurrency dynamically Signed-off-by: gotjosh <josue.abreu@gmail.com> * address review comments Signed-off-by: gotjosh <josue.abreu@gmail.com> * `make doc` Signed-off-by: gotjosh <josue.abreu@gmail.com> * inline the stop method Signed-off-by: gotjosh <josue.abreu@gmail.com> * Fix panic when creating concurrent fetchers fails Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Disabled by default Signed-off-by: gotjosh <josue.abreu@gmail.com> * we don't need to handle the context in start Signed-off-by: gotjosh <josue.abreu@gmail.com> * don't store concurrency or records per fetch Signed-off-by: gotjosh <josue.abreu@gmail.com> * add validation to the flags Signed-off-by: gotjosh <josue.abreu@gmail.com> * Ensure we don't leak any goroutines. Signed-off-by: gotjosh <josue.abreu@gmail.com> * remove concurrent and recordsperfetch from the main struct Signed-off-by: gotjosh <josue.abreu@gmail.com> --------- Signed-off-by: gotjosh <josue.abreu@gmail.com> Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> Co-authored-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> kafka replay speed: fix concurrent fetching concurrency transition (#9447) * kafka replay speed: fix concurrent fetching concurrency transition Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Update pkg/storage/ingest/reader.go * Make sure we evaluate r.lastReturnedRecord WHEN we return Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Redistribute wg.Add Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Add tests Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Remove defer causing data race Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Move defer to a different place Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * WIP Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Give more time to catch up with target_lag Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> * Clarify comment Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com> --------- Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>

memberlist: Log fast-join failures at info instead of debug

f4115dd

Not being able to fast-join via contacting a node is suspicious and might indicate a problem. Logging at info makes this easier to troubleshoot. Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

56quarters force-pushed the 56quarters/memberlist-logging branch from fc7d9f0 to f4115dd Compare September 20, 2024 17:56

56quarters marked this pull request as ready for review September 20, 2024 18:03

pr00se approved these changes Sep 20, 2024

View reviewed changes

56quarters merged commit 560bb26 into main Sep 20, 2024
5 checks passed

56quarters deleted the 56quarters/memberlist-logging branch September 20, 2024 18:38

pracucci reviewed Sep 20, 2024

View reviewed changes

56quarters added a commit to grafana/mimir that referenced this pull request Sep 20, 2024

Update to latest commit of dskit main

ab78be0

Specifically pulls in grafana/dskit#585 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

56quarters mentioned this pull request Sep 20, 2024

Update to latest commit of dskit main grafana/mimir#9356

Merged

4 tasks

56quarters added a commit to grafana/mimir that referenced this pull request Sep 20, 2024

Update to latest commit of dskit main (#9356)

6be86e8

Specifically pulls in grafana/dskit#585 Signed-off-by: Nick Pillitteri <nick.pillitteri@grafana.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memberlist: Log fast-join failures at `info` instead of `debug` #585

memberlist: Log fast-join failures at `info` instead of `debug` #585

56quarters commented Sep 20, 2024

pracucci left a comment

memberlist: Log fast-join failures at info instead of debug #585

memberlist: Log fast-join failures at info instead of debug #585

Conversation

56quarters commented Sep 20, 2024

pracucci left a comment

Choose a reason for hiding this comment

memberlist: Log fast-join failures at `info` instead of `debug` #585

memberlist: Log fast-join failures at `info` instead of `debug` #585