Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest storage: proper shutdown of partitionCommitter #9436

Merged

Conversation

dimitarvdimitrov
Copy link
Contributor

What this PR does

This partitionCommitter would be shut down via the services manager as soon as the service context is cancelled. This means that they shut down in parallel with the PartitionReader. The race comes when the partitionCommitter has already shut down while the PartitionReader is still processing some records. Then when the PartitionReader tries to enqueueCommit, that sets the atomic, but does not send this to Kafka.

As a result we may not always persist the latest commit to Kafka on shutdown.

Which issue(s) this PR fixes or relates to

Fixes #8697

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

This `partitionCommitter` would be shut down via the services manager as soon as the service context is cancelled. This means that they shut down in parallel with the `PartitionReader`. The race comes when the `partitionCommitter` has already shut down while the `PartitionReader` is still processing some records. Then when the `PartitionReader` tries to `enqueueCommit`, that sets the atomic, but does not send this to Kafka.

As a result we may not always persist the latest commit to Kafka.

Signed-off-by: Dimitar Dimitrov <dimitar.dimitrov@grafana.com>
@dimitarvdimitrov dimitarvdimitrov requested a review from a team as a code owner September 26, 2024 17:14
Copy link
Contributor

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

err = services.StartManagerAndAwaitHealthy(ctx, r.dependencies)
// Use context.Background() because we want to stop all dependencies when the PartitionReader stops
// instead of stopping them when ctx is cancelled and while the PartitionReader is still running.
err = services.StartManagerAndAwaitHealthy(context.Background(), r.dependencies)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the tradeoff here is that if we get interrupted during startup, we wouldn't respect that and we'd wait for the full startup to finish

not sure if I should try to solve that (maybe with something like context.AfterFunc())

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the PartitionReader gets terminated (context canceled) while running PartitionReader.start(), do we evern call stopDependencies() at all? If I remember correctly, PartitionReader.stop() will get called only if PartitionReader.start() has successfully terminated, that is if it has returned nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we take care of it a few lines above

// Stop dependencies if the start() fails.
defer func() {
if returnErr != nil {
_ = r.stopDependencies()
}
}()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, I missed it. LGTM Then.

@pracucci pracucci self-requested a review September 27, 2024 10:38
@gotjosh
Copy link
Contributor

gotjosh commented Sep 30, 2024

@pracucci should we wait on your to review this or can we proceed with the merge?

@dimitarvdimitrov dimitarvdimitrov merged commit ab91e9d into main Oct 1, 2024
29 checks passed
@dimitarvdimitrov dimitarvdimitrov deleted the dimitar/ingester/fix-TestPartitionReader_Commit branch October 1, 2024 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky TestPartitionReader_Commit
3 participants