Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: Replace rangefeed catchup semaphore with rate limiter #110919

Merged
merged 1 commit into from
Sep 21, 2023

Conversation

miretskiy
Copy link
Contributor

@miretskiy miretskiy commented Sep 19, 2023

The catchup scan limit was added in #77725 in order to attempt
to restrict a single rangefeed from consuming all catchup
scan slots on the KV side. This client side limit has
been largely ineffective.

More recently, this semaphore has been coopted in #109346 in order
to pace goroutine creation rate on the client. This functionality
is important, but the implementation is not very good.

Namely, we are interested in controling the rate of new catchup scans
being started by the client (and thus, control the rate of goroutine
creation). This implementation replaces the old implementation
with rate limit based approach. The rate limits are configured using
kv.rangefeed.client.stream_startup_rate setting which sets
the rate on the number of newly established rangefeed
streams (default 100).

Closes #110439
Epic: CRDB-26372

Release note: None

@miretskiy miretskiy requested review from erikgrinaker and a team September 19, 2023 17:36
@miretskiy miretskiy requested a review from a team as a code owner September 19, 2023 17:36
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@miretskiy miretskiy force-pushed the catchup branch 3 times, most recently from cf5c211 to 5c38983 Compare September 19, 2023 19:08
@miretskiy miretskiy requested a review from a team as a code owner September 19, 2023 19:08
@miretskiy miretskiy requested review from rachitgsrivastava and DarrylWong and removed request for a team September 19, 2023 19:08
"kv.rangefeed.catchup_scan_concurrency",
"number of catchup scans that a single rangefeed can execute concurrently; 0 implies unlimited",
8,
"kv.rangefeed.catchup_scan_duration_estimate",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of framing this in terms of a catchup scan estimate, I'd just frame it directly as the rangefeed client rate. It's hard to come up with a general catchup scan estimate, because it depends on a lot of factors, and will often vary even between different rangefeeds on the same cluster.

if err != nil {
return err
}
metrics.RangefeedCatchupRanges.Inc(1)
a.catchupRes = func() {
metrics.RangefeedCatchupRanges.Dec(1)
res.Release()
alloc.Return()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we return this? This subverts the rate limiting, since we're saying this request failed and should not count towards the rate limit.

var catchupScanBurst = settings.RegisterIntSetting(
settings.TenantWritable,
"kv.rangefeed.max_catchup_scans",
"maximum number of ranges that may run concurrently; 0 implies unlimited",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really what it means -- this is only true if the catchup scan estimate is equal to or below the actual catchup scan duration, not if it's above.

Copy link
Contributor Author

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @erikgrinaker, and @rachitgsrivastava)


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 69 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Instead of framing this in terms of a catchup scan estimate, I'd just frame it directly as the rangefeed client rate. It's hard to come up with a general catchup scan estimate, because it depends on a lot of factors, and will often vary even between different rangefeeds on the same cluster.

Thoughts on this new name?


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 79 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This isn't really what it means -- this is only true if the catchup scan estimate is equal to or below the actual catchup scan duration, not if it's above.

Naming is darn hard; I wanted to avoid burst; but... you know, I think I'll use that instead. If it sounds too technical, then that's the goal.
Thoughts on this new name?


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 714 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Why do we return this? This subverts the rate limiting, since we're saying this request failed and should not count towards the rate limit.

Ahh.. I see; you're right.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 8 files at r1, 2 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @miretskiy, and @rachitgsrivastava)


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 69 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Thoughts on this new name?

Well, this isn't really correct either.

The rate limiter is a token bucket. The rate specifies the rate at which tokens are added to the bucket. Burst specifies the size of the bucket.

If we set rate to 1 per second, with 0 burst, then we will spawn 60 per minute (1 each second at a fixed rate). If we set burst to 100 then we'll spawn 100 immediately (without a rate limit), then 1 per second after that (because we never allow the bucket to fill back up, we just immediately consume the tokens as they're added). The rate and burst are independent parameters.

The way it's set up here, we'll spawn 180 immediately, then 60 per second after that.

Do we even need or want burst here, instead of just a smooth rate?


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 79 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Naming is darn hard; I wanted to avoid burst; but... you know, I think I'll use that instead. If it sounds too technical, then that's the goal.
Thoughts on this new name?

Well, this isn't really correct either.

The rate limiter is a token bucket. The rate specifies the rate at which tokens are added to the bucket. Burst specifies the size of the bucket.

If we set rate to 1 per second, with 0 burst, then we will spawn 60 per minute (1 each second at a fixed rate). If we also set burst to 100 then we'll spawn 100 immediately (without a rate limit), then 1 per second after that (because we never allow the bucket to fill back up, we just immediately consume the tokens as they're added). The rate and burst are independent parameters.

The way it's set up here, we'll spawn 180 immediately, then 60 per second after that.

Do we even need or want burst here, instead of just a smooth rate?

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @miretskiy, and @rachitgsrivastava)


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 79 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Well, this isn't really correct either.

The rate limiter is a token bucket. The rate specifies the rate at which tokens are added to the bucket. Burst specifies the size of the bucket.

If we set rate to 1 per second, with 0 burst, then we will spawn 60 per minute (1 each second at a fixed rate). If we also set burst to 100 then we'll spawn 100 immediately (without a rate limit), then 1 per second after that (because we never allow the bucket to fill back up, we just immediately consume the tokens as they're added). The rate and burst are independent parameters.

The way it's set up here, we'll spawn 180 immediately, then 60 per second after that.

Do we even need or want burst here, instead of just a smooth rate?

Whoops, didn't mean to post this twice, was trying to move it to the other thread.

@miretskiy
Copy link
Contributor Author

Reviewed 3 of 8 files at r1, 2 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @miretskiy, and @rachitgsrivastava)

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 69 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Well, this isn't really correct either.

The rate limiter is a token bucket. The rate specifies the rate at which tokens are added to the bucket. Burst specifies the size of the bucket.

If we set rate to 1 per second, with 0 burst, then we will spawn 60 per minute (1 each second at a fixed rate). If we set burst to 100 then we'll spawn 100 immediately (without a rate limit), then 1 per second after that (because we never allow the bucket to fill back up, we just immediately consume the tokens as they're added). The rate and burst are independent parameters.

The way it's set up here, we'll spawn 180 immediately, then 60 per second after that.

Do we even need or want burst here, instead of just a smooth rate?

pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 79 at r1 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Naming is darn hard; I wanted to avoid burst; but... you know, I think I'll use that instead. If it sounds too technical, then that's the goal.
Thoughts on this new name?

Well, this isn't really correct either.

The rate limiter is a token bucket. The rate specifies the rate at which tokens are added to the bucket. Burst specifies the size of the bucket.

If we set rate to 1 per second, with 0 burst, then we will spawn 60 per minute (1 each second at a fixed rate). If we also set burst to 100 then we'll spawn 100 immediately (without a rate limit), then 1 per second after that (because we never allow the bucket to fill back up, we just immediately consume the tokens as they're added). The rate and burst are independent parameters.

The way it's set up here, we'll spawn 180 immediately, then 60 per second after that.

Kinda what I wanted to do...

Do we even need or want burst here, instead of just a smooth rate?

I don't know if we need burst; Just going to set 100/sec rate, 0 burst.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda what I wanted to do...

Ah ok, I thought there might be a misunderstanding about how the rate limiter worked. Anyway, I don't have a good sense of what the optimal strategy is (might have performance implications), but since we want to reduce Go scheduler tail latencies it seems like we want to avoid bursts -- although I'm not sure it matters until we go past O(1000) goroutines anyway, and I guess the burst doesn't really make much of a difference either way in that case. This is simpler though.

Thanks!

Reviewed 1 of 8 files at r1, 4 of 4 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @miretskiy, and @rachitgsrivastava)


pkg/kv/kvclient/kvcoord/dist_sender_rangefeed.go line 70 at r4 (raw file):

	settings.TenantWritable,
	"kv.rangefeed.client.stream_startup_rate",
	"controls the rate the client will initiate new rangefeed stream for a single range; 0 implies unlimited",

nit: per-second rate

The catchup scan limit was added in cockroachdb#77725 in order to attempt
to restrict a single rangefeed from consuming all catchup
scan slots on the KV side.  This client side limit has
been largely ineffective.

More recently, this semaphore has been coopted in cockroachdb#109346 in order
to pace goroutine creation rate on the client. This functionality
is important, but the implementation is not very good.

Namely, we are interested in controling the rate of new catchup scans
being started by the client (and thus, control the rate of goroutine
creation).  This implementation replaces the old implementation
with rate limit based approach.  The rate limits are configured using
`kv.rangefeed.client.stream_startup_rate` setting which sets
the rate on the number of newly established rangefeed
streams (default 100).

Closes cockroachdb#110439
Epic: CRDB-26372

Release note: None
@miretskiy
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 21, 2023

Build succeeded:

miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Sep 26, 2023
Previous PR cockroachdb#110919 modified rangefeed startup logic
to rely on rate limit, instead of a semaphore.

The issue exposed by catchup scan rate limiter is that
it allowed many more (100/sec vs 8 prior) catchup scans
to be started.  If the range resides on a local node, a
local "bypass" rpc is created instead (rpc/context.go)
and this RPC bypass is implemented via buffered channels.

The deadlock occurs when the client attempts to send
request to the server, while holding the lock, but blocks
(because channel is full -- i.e. we have sent 128 outstanding requests),
and the server blocks for the same reason because the client
mux goroutine is blocked attempting to acquire the lock to
lookup the stream for the rangefeed message.

The fix moves the state shared by the sender and the consumer --
namely the stream map -- outside of the sender lock.

This fix adds a test that reliably fails under deadlock detection.

Fixes  cockroachdb#111111
Fixes cockroachdb#111165

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Sep 27, 2023
Previous PR cockroachdb#110919 modified rangefeed startup logic
to rely on rate limit, instead of a semaphore.

The issue exposed by catchup scan rate limiter is that
it allowed many more (100/sec vs 8 prior) catchup scans
to be started.  If the range resides on a local node, a
local "bypass" rpc is created instead (rpc/context.go)
and this RPC bypass is implemented via buffered channels.

The deadlock occurs when the client attempts to send
request to the server, while holding the lock, but blocks
(because channel is full -- i.e. we have sent 128 outstanding requests),
and the server blocks for the same reason because the client
mux goroutine is blocked attempting to acquire the lock to
lookup the stream for the rangefeed message.

The fix moves the state shared by the sender and the consumer --
namely the stream map -- outside of the sender lock.

This fix adds a test that reliably fails under deadlock detection.

Fixes  cockroachdb#111111
Fixes cockroachdb#111165

Release note: None
adityamaru pushed a commit to adityamaru/cockroach that referenced this pull request Sep 27, 2023
Previous PR cockroachdb#110919 modified rangefeed startup logic
to rely on rate limit, instead of a semaphore.

The issue exposed by catchup scan rate limiter is that
it allowed many more (100/sec vs 8 prior) catchup scans
to be started.  If the range resides on a local node, a
local "bypass" rpc is created instead (rpc/context.go)
and this RPC bypass is implemented via buffered channels.

The deadlock occurs when the client attempts to send
request to the server, while holding the lock, but blocks
(because channel is full -- i.e. we have sent 128 outstanding requests),
and the server blocks for the same reason because the client
mux goroutine is blocked attempting to acquire the lock to
lookup the stream for the rangefeed message.

The fix moves the state shared by the sender and the consumer --
namely the stream map -- outside of the sender lock.

This fix adds a test that reliably fails under deadlock detection.

Fixes  cockroachdb#111111
Fixes cockroachdb#111165

Release note: None
craig bot pushed a commit that referenced this pull request Sep 27, 2023
111312: kvcoord: Fix mux rangefeed startup deadlock r=miretskiy a=miretskiy

Previous PR #110919 modified rangefeed startup logic to rely on rate limit, instead of a semaphore.

The issue exposed by catchup scan rate limiter is that it allowed many more (100/sec vs 8 prior) catchup scans to be started.  If the range resides on a local node, a local "bypass" rpc is created instead (rpc/context.go) and this RPC bypass is implemented via buffered channels.

The deadlock occurs when the client attempts to send request to the server, while holding the lock, but blocks (because channel is full -- i.e. we have sent 128 outstanding requests), and the server blocks for the same reason because the client mux goroutine is blocked attempting to acquire the lock to lookup the stream for the rangefeed message.

The fix moves the state shared by the sender and the consumer -- namely the stream map -- outside of the sender lock.

This fix adds a test that reliably fails under deadlock detection.

Fixes  #111111
Fixes #111165

Release note: None

Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
THardy98 pushed a commit to THardy98/cockroach that referenced this pull request Oct 6, 2023
Previous PR cockroachdb#110919 modified rangefeed startup logic
to rely on rate limit, instead of a semaphore.

The issue exposed by catchup scan rate limiter is that
it allowed many more (100/sec vs 8 prior) catchup scans
to be started.  If the range resides on a local node, a
local "bypass" rpc is created instead (rpc/context.go)
and this RPC bypass is implemented via buffered channels.

The deadlock occurs when the client attempts to send
request to the server, while holding the lock, but blocks
(because channel is full -- i.e. we have sent 128 outstanding requests),
and the server blocks for the same reason because the client
mux goroutine is blocked attempting to acquire the lock to
lookup the stream for the rangefeed message.

The fix moves the state shared by the sender and the consumer --
namely the stream map -- outside of the sender lock.

This fix adds a test that reliably fails under deadlock detection.

Fixes  cockroachdb#111111
Fixes cockroachdb#111165

Release note: None
miretskiy pushed a commit to miretskiy/cockroach that referenced this pull request Nov 7, 2023
Re-introduce catchup scan semaphore limit, removed by cockroachdb#110919,
for regular rangefeed.  This hard limit on the number of catchup
scans is necessary to avoid OOMs when handling large
scan rangefeeds (large fan-in factor) when executing many
non-local ranges.

Fixes cockroachdb#113489

Release note: None
craig bot pushed a commit that referenced this pull request Nov 8, 2023
113966: kvcoord: Reintroduce catchup scan semaphore for regular rangefeed r=miretskiy a=miretskiy

Re-introduce catchup scan semaphore limit, removed by #110919, for regular rangefeed.  This hard limit on the number of catchup scans is necessary to avoid OOMs when handling large scan rangefeeds (large fan-in factor) when executing many non-local ranges.

Fixes #113489

Release note: None

114000: colfetcher: disable metamorphic randomization for direct scans r=yuzefovich a=yuzefovich

This commit makes it so that we no longer - for now - use metamorphic randomization for the default value of
`sql.distsql.direct_columnar_scans.enabled` cluster setting that controls whether the direct columnar scans (aka "KV projection pushdown") is enabled. It appears that we might be missing some memory accounting in the local fast path of this feature, and some backup-related roachtests run into OOMs with binaries with "enabled assertions". Disabling this metamorphization for now seems good to silence failures in case of this now-known issue.

Informs: #113816

Epic: None

Release note: None

114026: kvnemesis: bump default steps to 100 r=erikgrinaker a=erikgrinaker

50 steps is usually too small to trigger interesting behaviors. Bump it to 100, which is still small enough to be easily debuggable.

The nightlies already run with 1000 steps.

Epic: none
Release note: None

Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this pull request Nov 8, 2023
Re-introduce catchup scan semaphore limit, removed by #110919,
for regular rangefeed.  This hard limit on the number of catchup
scans is necessary to avoid OOMs when handling large
scan rangefeeds (large fan-in factor) when executing many
non-local ranges.

Fixes #113489

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rangefeed: tune catchup scan semaphores
3 participants