-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: detect context cancellation in limitBulkIOWrite, avoid log spam #73279
kv: detect context cancellation in limitBulkIOWrite, avoid log spam #73279
Conversation
This commit adds logic to propagate context cancellation in `limitBulkIOWrite`. This function is used in two places, 1) when ingesting ssts, and 2) when receiving a snapshot. The first case uses the Raft scheduler goroutine's context, so it never gets cancelled. The second case uses the context of the sender of a Raft snapshot, so it can get cancelled. In customer clusters, we were seeing Raft snapshots hit their deadline and begin spamming `error rate limiting bulk io write: context deadline exceeded` errors messages. This was bad for two reasons. First, it was very noisy. Second, it meant that a Raft snapshot that was no longer going to succeed was still writing out full SSTs while holding on to the `snapshotApplySem`. This contributed to the snapshot starvation we saw in the issue. With this commit, `limitBulkIOWrite` will eagerly detect context cancellation and will propagate the cancellation up to the caller, allowing the caller to quickly release resources. Release notes (bug fix): Raft snapshots now detect timeouts earlier and avoid spamming the logs with `context deadline exceeded` errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @erikgrinaker)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, thanks!
TFTRs! bors r+ |
Build succeeded: |
This commit adds logic to propagate context cancellation in
limitBulkIOWrite
.This function is used in two places, 1) when ingesting ssts, and 2) when
receiving a snapshot. The first case uses the Raft scheduler goroutine's
context, so it never gets cancelled. The second case uses the context of the
sender of a Raft snapshot, so it can get cancelled.
In customer clusters, we were seeing Raft snapshots hit their deadline and begin
spamming
error rate limiting bulk io write: context deadline exceeded
errorsmessages. This was bad for two reasons. First, it was very noisy. Second, it
meant that a Raft snapshot that was no longer going to succeed was still writing
out full SSTs while holding on to the
snapshotApplySem
. This contributed tothe snapshot starvation we saw in the issue.
With this commit,
limitBulkIOWrite
will eagerly detect context cancellationand will propagate the cancellation up to the caller, allowing the caller to
quickly release resources.
Release notes (bug fix): Raft snapshots now detect timeouts earlier and avoid
spamming the logs with
context deadline exceeded
errors.