Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: support the updated and mvcc with parquet format #104407

Merged
merged 2 commits into from
Jun 7, 2023

Conversation

jayshrivastava
Copy link
Contributor

@jayshrivastava jayshrivastava commented Jun 6, 2023

changefeedccl: use buildutil.CrdbTestBuild for parquet testing

Parquet testing requires that extra metadata be written to parquet files
so tests can create CDC rows from the raw data. Previously, the production
parquet code relied on a testing knob to be passed to write this extra
metadata. This is problematic as not all tests would pass the testing knob,
making it so that we could not randomly use parquet in those tests for
metamorphic testing purposes. With this change, the parquet production code
uses buildutil.CrdbTestBuild, which is a global flag enabled in tests.
Now, metamorphic parquet testing can be applied to more tests.

Epic: None
Release note: None


changefeedccl: support the updated and mvcc timestamp options with parquet format

Previously, using the parquet format in changefeeds did not support
using the mvcc or updated options. This change adds support for using
these options with parquet.

Epic: None
Release note: None


do not merge: force parquet when possible during tests

This change will not be merged. When it is possible for tests
to use parquet, this change forces them to use parquet.

This change also forces tests for updated timestamps and mvcc
timestamps to use the cloudstorage sink + parquet format.

Release note: None
Epic: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jayshrivastava jayshrivastava force-pushed the parquet-options-v2 branch 3 times, most recently from 2854200 to ac73fea Compare June 6, 2023 14:48
Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/ccl/changefeedccl/changefeedbase/options.go line 446 at r3 (raw file):

//
// TODO(#103129): add support for some of these
var ParquetFormatUnsupportedOptions OptionsSet = makeStringSet(OptEndTime, OptDiff, OptKeyInValue, OptTopicInValue)

I realized topic in value is not explicitly supported or unsupported, so I made it explicitly unsupported for now.

@jayshrivastava jayshrivastava changed the title Parquet options v2 changefeedccl: support the updated and mvcc timestamp options with parquet format Jun 6, 2023
@jayshrivastava jayshrivastava changed the title changefeedccl: support the updated and mvcc timestamp options with parquet format changefeedccl: support the updated and mvcc with parquet format Jun 6, 2023
@jayshrivastava jayshrivastava force-pushed the parquet-options-v2 branch 2 times, most recently from aa1a486 to 1332113 Compare June 6, 2023 18:40
@jayshrivastava jayshrivastava marked this pull request as ready for review June 6, 2023 18:41
@jayshrivastava jayshrivastava requested a review from a team as a code owner June 6, 2023 18:41
@jayshrivastava jayshrivastava requested review from shermanCRL and removed request for a team June 6, 2023 18:41
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do not merge?

Should the release note be specified (for the second commit)?

Reviewed 2 of 8 files at r1, 7 of 8 files at r5, 3 of 3 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava and @shermanCRL)


pkg/ccl/changefeedccl/parquet.go line 292 at r6 (raw file):

func init() {
	if buildutil.CrdbTestBuild {
		includeParquestTestMetadata = true

I would get rid of this local variable altogether... You can just inline buildutil.CrdbTestBuild...
Or... if you prefer, add a helper:

func includePartquetMetadataForTests() bool {
  return buildutil.CrdbTestBuild
}

or just includeParquetMetadataForTests = buildutil.CrdbTestBuild

Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the release note be specified (for the second commit)?

I spoke to @kathancox. It's easier just to write one release note at the end. I'll link the relevant PRs when I write that release note.

why do not merge?

A bit of a long story. If tests don't specify a format, defaulting to parquet is a bit confusing. Might as well use default for production changefeeds, which is JSON. Speaking from experience, it's confusing when tests do spooky things that differ to what happens in prod.

We also don't want to permanently alter the changefeed updated and mvcc timestamps tests to use parquet + cloud storage. We want to keep the randomized sinks. I thought about copying those two tests into parquet_test.go with parquet specified, but that's unnecessary code duplication IMO. We can continue to randomly use parquet in testing.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @shermanCRL)


pkg/ccl/changefeedccl/parquet.go line 292 at r6 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I would get rid of this local variable altogether... You can just inline buildutil.CrdbTestBuild...
Or... if you prefer, add a helper:

func includePartquetMetadataForTests() bool {
  return buildutil.CrdbTestBuild
}

or just includeParquetMetadataForTests = buildutil.CrdbTestBuild

ACK. I'll make a change.

Parquet testing requires that extra metadata be written to parquet files
so tests can create CDC rows from the raw data. Previously, the production
parquet code relied on a testing knob to be passed to write this extra
metadata. This is problematic as not all tests would pass the testing knob,
making it so that we could not randomly use parquet in those tests for
metamorphic testing purposes. With this change, the parquet production code
uses `buildutil.CrdbTestBuild`, which is a global flag enabled in tests.
Now, metamorphic parquet testing can be applied to more tests.

Epic: None
Release note: None
…rquet format

Previously, using the parquet format in changefeeds did not support
using the `mvcc` or `updated` options. This change adds support for using
these options with parquet.

Epic: None
Release note: None
@jayshrivastava
Copy link
Contributor Author

bors r=miretskiy

@craig
Copy link
Contributor

craig bot commented Jun 7, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants