Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: detect context cancellation in limitBulkIOWrite, avoid log spam #73279

Merged

Conversation

nvanbenschoten
Copy link
Member

This commit adds logic to propagate context cancellation in limitBulkIOWrite.
This function is used in two places, 1) when ingesting ssts, and 2) when
receiving a snapshot. The first case uses the Raft scheduler goroutine's
context, so it never gets cancelled. The second case uses the context of the
sender of a Raft snapshot, so it can get cancelled.

In customer clusters, we were seeing Raft snapshots hit their deadline and begin
spamming error rate limiting bulk io write: context deadline exceeded errors
messages. This was bad for two reasons. First, it was very noisy. Second, it
meant that a Raft snapshot that was no longer going to succeed was still writing
out full SSTs while holding on to the snapshotApplySem. This contributed to
the snapshot starvation we saw in the issue.

With this commit, limitBulkIOWrite will eagerly detect context cancellation
and will propagate the cancellation up to the caller, allowing the caller to
quickly release resources.

Release notes (bug fix): Raft snapshots now detect timeouts earlier and avoid
spamming the logs with context deadline exceeded errors.

This commit adds logic to propagate context cancellation in `limitBulkIOWrite`.
This function is used in two places, 1) when ingesting ssts, and 2) when
receiving a snapshot. The first case uses the Raft scheduler goroutine's
context, so it never gets cancelled. The second case uses the context of the
sender of a Raft snapshot, so it can get cancelled.

In customer clusters, we were seeing Raft snapshots hit their deadline and begin
spamming `error rate limiting bulk io write: context deadline exceeded` errors
messages. This was bad for two reasons. First, it was very noisy. Second, it
meant that a Raft snapshot that was no longer going to succeed was still writing
out full SSTs while holding on to the `snapshotApplySem`. This contributed to
the snapshot starvation we saw in the issue.

With this commit, `limitBulkIOWrite` will eagerly detect context cancellation
and will propagate the cancellation up to the caller, allowing the caller to
quickly release resources.

Release notes (bug fix): Raft snapshots now detect timeouts earlier and avoid
spamming the logs with `context deadline exceeded` errors.
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @erikgrinaker)

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thanks!

@nvanbenschoten
Copy link
Member Author

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Nov 30, 2021

Build succeeded:

@craig craig bot merged commit 2e038a1 into cockroachdb:master Nov 30, 2021
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/limitBulkIOWriteCtx branch December 6, 2021 03:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants