changefeedccl: fix panics when min_checkpoint_frequency is 0 #125317

wenyihu6 · 2024-06-07T16:20:46Z

Previously, we introduced the cluster setting changefeed.aggregator.flush_jitter
to apply some jitter to the aggregator flush frequency. This was done to reduce
the load on the coordinator by spreading out aggregator flush times. However,
that PR did not account for the case where min_checkpoint_frequency is set to
zero, which could cause panics due to rand.Int63(0). This patch fixes the issue
by adding a check for when the frequency is zero, ensuring that jitter is not
applied in such cases.

Fixes: #125312

Release note (bug fix): Fixed a bug in v24.1, v23.2, and v23.1 where using
changefeed.aggregator.flush_jitter with min_checkpoint_frequency set to zero
could cause panics.

cockroach-teamcity · 2024-06-07T16:20:56Z

This change is

andyyang890

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rharding6373 and @wenyihu6)

pkg/ccl/changefeedccl/changefeed_processors.go line 644 at r1 (raw file):

func nextFlushWithJitter(s timeutil.TimeSource, d time.Duration, j float64) time.Time {
	if j < 0 || d < 0 {
		log.Fatalf(context.Background(), "invalid jitter value: %f, duration: %s", j, d)

Should this be at the level of a log.Fatalf? Would a log.Warningf be sufficient?

wenyihu6 · 2024-06-07T17:48:08Z

pkg/ccl/changefeedccl/changefeed_processors.go line 644 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

Should this be at the level of a log.Fatalf? Would a log.Warningf be sufficient?

I'm okay with either. Changed to Warningf.

andyyang890

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @rharding6373 and @wenyihu6)

pkg/ccl/changefeedccl/changefeed_processors.go line 644 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

I'm okay with either. Changed to Warningf.

On second thought, since this is for catching developer error, I think you probably want to return an error instead, i.e. a errors.AssertionFailedf. The user isn't going to be able to do very much with a log message.

rharding6373

Thanks for the quick fix!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @wenyihu6)

pkg/ccl/changefeedccl/changefeed_processors.go line 644 at r1 (raw file):

Previously, andyyang890 (Andy Yang) wrote…

On second thought, since this is for catching developer error, I think you probably want to return an error instead, i.e. a errors.AssertionFailedf. The user isn't going to be able to do very much with a log message.

This seems like it's going to be very noisy, since this is based on cluster settings/options that aren't going to change often. But also the way we currently use this function this should never happen.

I'm with Andy, this should surface as an unexpected error.

pkg/ccl/changefeedccl/changefeed_test.go line 7056 at r2 (raw file):

		100 * time.Millisecond,
	} {
		t.Run(fmt.Sprintf("Disabled_FlushFrequency=%s", flushFrequency), func(t *testing.T) {

optional super minor nit: most tests in this file use lower case test names

wenyihu6 · 2024-06-10T12:19:49Z

pkg/ccl/changefeedccl/changefeed_processors.go line 832 at r3 (raw file):

	if checkpointFrontier {
		defer func() {
			ca.nextHighWaterFlush, returnErr = nextFlushWithJitter(

I'm not sure if this is a common way to deal with changing return errors in a defer statement. Is there a cleaner way to handle this?

wenyihu6

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @rharding6373)

pkg/ccl/changefeedccl/changefeed_test.go line 7056 at r2 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

optional super minor nit: most tests in this file use lower case test names

Done.

rharding6373

Seems like it's really close!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890 and @wenyihu6)

pkg/ccl/changefeedccl/changefeed_processors.go line 644 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Hm it is a bit hard to return as an error since the function call is wrapped in a defer function. Do you see an easier way to do this?

You could do it with a named return value for the error in the parent function and errors.CombineErrors. Here's one such example in the code base:

cockroach/pkg/workload/rand/rand.go

Line 182 in 4392d13

defer func() { retErr = errors.CombineErrors(retErr, rows.Close()) }()

So this would look something like:

func nextFlushWithJitter(s timeutil.TimeSource, d time.Duration, j float64) (time.Time, error) { ... }
...
func (ca *changeAggregator) noteResolvedSpan(resolved jobspb.ResolvedSpan) (err error) {
...
defer func() { 
  t, retErr := nextFlushWithJitter(...)
  err = errors.CombineErrors(err, retErr) }()
...

Please try to add test coverage for this new code path (though it definitely seems like we should never hit it given current guardrails on the inputs).

pkg/ccl/changefeedccl/changefeed_processors.go line 832 at r3 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

I'm not sure if this is a common way to deal with changing errors in a defer statement. Is there a cleaner way to handle this?

See other comment.

pkg/ccl/changefeedccl/changefeed_test.go line 7056 at r2 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Done.

Nice

wenyihu6

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andyyang890)

pkg/ccl/changefeedccl/changefeed_processors.go line 644 at r1 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

You could do it with a named return value for the error in the parent function and errors.CombineErrors. Here's one such example in the code base:

cockroach/pkg/workload/rand/rand.go

Line 182 in 4392d13

defer func() { retErr = errors.CombineErrors(retErr, rows.Close()) }()

So this would look something like:
func nextFlushWithJitter(s timeutil.TimeSource, d time.Duration, j float64) (time.Time, error) { ... }
...
func (ca *changeAggregator) noteResolvedSpan(resolved jobspb.ResolvedSpan) (err error) {
...
defer func() { 
  t, retErr := nextFlushWithJitter(...)
  err = errors.CombineErrors(err, retErr) }()
...
Please try to add test coverage for this new code path (though it definitely seems like we should never hit it given current guardrails on the inputs).

Done. I added a few more test cases for error. Are there any other test cases you would want me to add? On a side note, is this how we typically handle developer errors? I have a slight preference for using log.Fatal since we should never reach this state, and error handling here makes the code less clean. But I also understand that this is not a serious issue and we may want to be more defensive.

rharding6373

Nice work!

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andyyang890 and @wenyihu6)

pkg/ccl/changefeedccl/changefeed_processors.go line 644 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Done. I added a few more test cases for error. Are there any other test cases you would want me to add? On a side note, is this how we typically handle developer errors? I have a slight preference for using log.Fatal since we should never reach this state, and error handling here makes the code less clean. But I also understand that this is not a serious issue and we may want to be more defensive.

Ideally when the user provides a bad input, we provide immediate feedback that their operation failed and why. I think we already do that for the user-provided input that is ultimately used in all production calls to this code.

We should generally try to handle errors gracefully and not with panics. The exception is that if you like you can add a test-only runtime assertion here, following a pattern like those described here: #94986

pkg/ccl/changefeedccl/changefeed_test.go line 7138 at r4 (raw file):

				if tc.expectedErr {
					require.Error(t, err)
				}

else { require.NoError }

Previously, we introduced the cluster setting changefeed.aggregator.flush_jitter to apply some jitter to the aggregator flush frequency. This was done to reduce the load on the coordinator by spreading out aggregator flush times. However, that PR did not account for the case where min_checkpoint_frequency is set to zero, which could cause panics due to rand.Int63(0). This patch fixes the issue by adding a check for when the frequency is zero, ensuring that jitter is not applied in such cases. Fixes: cockroachdb#125312 Release note (bug fix): Fixed a bug in v24.1, v23.2, and v23.1 where using changefeed.aggregator.flush_jitter with min_checkpoint_frequency set to zero could cause panics.

wenyihu6 · 2024-06-10T22:24:54Z

Thanks for the quick review!

bors r=rharding6373

craig · 2024-06-10T23:07:27Z

Build succeeded:

blathers-crl · 2024-06-10T23:07:39Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from f28b826 to blathers/backport-release-23.1-125317: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.1.x failed. See errors above.

error creating merge commit from f28b826 to blathers/backport-release-23.2-125317: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

wenyihu6 added backport-23.1.x Flags PRs that need to be backported to 23.1 backport-23.2.x Flags PRs that need to be backported to 23.2. backport-24.1.x Flags PRs that need to be backported to 24.1. labels Jun 7, 2024

wenyihu6 self-assigned this Jun 7, 2024

wenyihu6 force-pushed the panicfix branch 3 times, most recently from 864728c to 386b66a Compare June 7, 2024 16:31

wenyihu6 marked this pull request as ready for review June 7, 2024 16:37

wenyihu6 requested a review from a team as a code owner June 7, 2024 16:37

wenyihu6 requested review from rharding6373 and removed request for a team June 7, 2024 16:37

andyyang890 reviewed Jun 7, 2024

View reviewed changes

wenyihu6 force-pushed the panicfix branch from 386b66a to 7344a22 Compare June 7, 2024 17:47

andyyang890 reviewed Jun 7, 2024

View reviewed changes

rharding6373 requested a review from andyyang890 June 7, 2024 19:46

rharding6373 reviewed Jun 7, 2024

View reviewed changes

This comment was marked as outdated.

Sign in to view

wenyihu6 commented Jun 10, 2024

View reviewed changes

wenyihu6 force-pushed the panicfix branch from 4ad7a83 to f3e4a10 Compare June 10, 2024 12:58

wenyihu6 requested a review from rharding6373 June 10, 2024 14:25

wenyihu6 force-pushed the panicfix branch 2 times, most recently from 10ee3e9 to cf2b4b5 Compare June 10, 2024 14:28

rharding6373 reviewed Jun 10, 2024

View reviewed changes

wenyihu6 commented Jun 10, 2024

View reviewed changes

wenyihu6 requested a review from rharding6373 June 10, 2024 19:28

rharding6373 approved these changes Jun 10, 2024

View reviewed changes

wenyihu6 force-pushed the panicfix branch from cf2b4b5 to f28b826 Compare June 10, 2024 21:02

craig bot merged commit 36905d3 into cockroachdb:master Jun 10, 2024
22 checks passed

blathers-crl bot mentioned this pull request Jun 10, 2024

release-24.1: changefeedccl: fix panics when min_checkpoint_frequency is 0 #125459

Merged

This was referenced Jun 11, 2024

release-23.2: changefeedccl: fix panics when min_checkpoint_frequency is 0 #125469

Merged

release-23.1: changefeedccl: fix panics when min_checkpoint_frequency is 0 #125495

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: fix panics when min_checkpoint_frequency is 0 #125317

changefeedccl: fix panics when min_checkpoint_frequency is 0 #125317

wenyihu6 commented Jun 7, 2024

cockroach-teamcity commented Jun 7, 2024

andyyang890 left a comment

wenyihu6 commented Jun 7, 2024

andyyang890 left a comment

rharding6373 left a comment

This comment was marked as outdated.

wenyihu6 commented Jun 10, 2024 •

edited

Loading

wenyihu6 left a comment

rharding6373 left a comment

wenyihu6 left a comment

rharding6373 left a comment

wenyihu6 commented Jun 10, 2024

craig bot commented Jun 10, 2024

blathers-crl bot commented Jun 10, 2024

changefeedccl: fix panics when min_checkpoint_frequency is 0 #125317

changefeedccl: fix panics when min_checkpoint_frequency is 0 #125317

Conversation

wenyihu6 commented Jun 7, 2024

cockroach-teamcity commented Jun 7, 2024

andyyang890 left a comment

Choose a reason for hiding this comment

wenyihu6 commented Jun 7, 2024

andyyang890 left a comment

Choose a reason for hiding this comment

rharding6373 left a comment

Choose a reason for hiding this comment

This comment was marked as outdated.

wenyihu6 commented Jun 10, 2024 • edited Loading

wenyihu6 left a comment

Choose a reason for hiding this comment

rharding6373 left a comment

Choose a reason for hiding this comment

wenyihu6 left a comment

Choose a reason for hiding this comment

rharding6373 left a comment

Choose a reason for hiding this comment

wenyihu6 commented Jun 10, 2024

craig bot commented Jun 10, 2024

blathers-crl bot commented Jun 10, 2024

wenyihu6 commented Jun 10, 2024 •

edited

Loading