Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: use new parquet library #103293

Merged
merged 2 commits into from
May 18, 2023

Conversation

jayshrivastava
Copy link
Contributor

@jayshrivastava jayshrivastava commented May 15, 2023

This change updates changefeeds to use the new parquet library
added in pkg/util/parquet when using format=parquet.

Informs: #99028
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jayshrivastava jayshrivastava force-pushed the parquet-cdc branch 7 times, most recently from 16142f4 to d926bd5 Compare May 15, 2023 20:06
Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/ccl/changefeedccl/changefeedbase/options.go line 1061 at r2 (raw file):

	// Right now parquet does not support any of these options
	if s.m[OptFormat] == string(OptFormatParquet) {
		if isPredicateChangefeed {

Enabling parquet + predicate changefeeds + diff was added in #94653, but there's not much of an explanation for why. If you look at the parquetPossible randomized logic in pkg/ccl/changefeedccl/testfeed_test.go, it looks like we disabled testing the parquet format with diff. I've decided to leave diff unsupported until we add it explicitly with proper testing in #103129.

@jayshrivastava jayshrivastava marked this pull request as ready for review May 15, 2023 20:12
@jayshrivastava jayshrivastava requested a review from a team as a code owner May 15, 2023 20:12
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to address the fact that tuples are not supported; this breaks cdc queries, and cdc_prev tuple.
cdc_prev also turns on "diff" option.

if includeParquetTestMetadata {
if opts, err = addCDCTestMetadata(row, opts); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can get rid of includeParquetTestMetadata altogether? LIke at the call site (testfeed?) if we decide to use parquet, then.. just explicitly specify additional metadata?

if typ.ArrayContents().Family() == types.ArrayFamily {
return result, pgerror.Newf(pgcode.FeatureNotSupported,
"parquet writer does not support nested arrays")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not?

Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is okay to merge. Supporting tuples (needed for cdc_prev) is on the list of items to complete before making the new library public #99028. I'm working on those changes concurrently to this PR and plan to merge them next. There are some issues with the old library, so my changes are strictly better than what is in the main branch IMO:

The old library "supports" cdc_prev, tuples, and diff, but they don't work. You can create a changefeed ex. create changefeed into 'nodelocal://0/baz' with format='parquet', diff as select *, cdc_prev from foo;. The cdc_prev tuple causes an error, because tuples are not supported. If you pass diff, parquet does not emit the diff. The reason for this surprising behavior is that we disable parquet in tests that use diff, but enable parquet in production with diff when using cdc queries. My change explicitly disallows these cases in production and testing. I will incrementally enable them as I add more features.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB and @miretskiy)


pkg/ccl/changefeedccl/parquet.go line 67 at r4 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I wonder if we can get rid of includeParquetTestMetadata altogether? LIke at the call site (testfeed?) if we decide to use parquet, then.. just explicitly specify additional metadata?

Can you please clarify this? The metadata contains information from the rows (the keys and non keys) which are used in testing ex. assertPayloads([]string{`[key] -> {"a": 1}`}). I think this function is the best place to get that information easily.


pkg/util/parquet/schema.go line 350 at r4 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

why not?

A few reasons.

If I understand correctly, we don't support multidimensional arrays, even in projections, so we should never see them. @HonoreDB also tried a few cases with me, but we couldn't make a nested array.

select array[[1,2,3],[4,5,6]] as col;                                                                                                                                                                       
ERROR: unimplemented: arrays cannot have arrays as element type
SQLSTATE: 0A000
HINT: You have attempted to use a feature that is not yet implemented.
See: https://go.crdb.dev/issue-v/32552/v23.1

The code I wrote definitely does not support nested arrays, it assumes arrays are arrays of scalar, so I just made the assertion explicit.

I'm fairly confident arbitrary nesting is not supported in parquet because I could not find anything about it in the docs/spec or by searching online. At best, there are some hacks (https://medium.com/hadoop-noob/recursive-avro-schema-for-parquet-a87c84b6aca1), but I don't feel confident relying on them. Chatgpt does not recommend it either :)

@miretskiy miretskiy requested a review from HonoreDB May 16, 2023 15:02
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay; just for my info: what kind of error?
If you don't have explicit issue, could you file one to get cdc_prev/diff/parquet working in a sane way?

Reviewed 1 of 13 files at r1, 1 of 2 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB and @jayshrivastava)


pkg/ccl/changefeedccl/parquet.go line 114 at r3 (raw file):

// addCDCTestMetadata appends options to the provided options to configure the
// parquet writer to write metadata required by cdc test feed factories.
func addCDCTestMetadata(row cdcevent.Row, opts []parquet.Option) ([]parquet.Option, error) {

nit: perhaps addParquetTestMetadata?


pkg/ccl/changefeedccl/parquet.go line 67 at r4 (raw file):

Previously, jayshrivastava (Jayant) wrote…

Can you please clarify this? The metadata contains information from the rows (the keys and non keys) which are used in testing ex. assertPayloads([]string{`[key] -> {"a": 1}`}). I think this function is the best place to get that information easily.

I was just curious if we can explicitly pass an argument indicating whether or not we should include test metadata.
Perhaps rely on some knobs -- like if changefeed knob is non-nil, then this is test code, and so, add this metdata.
I understand why we need this metadata -- just not sure we need this global boolean flag to achieve this.
Feel free to ignore if too much hassle.


pkg/ccl/changefeedccl/changefeedbase/options.go line 1061 at r2 (raw file):

Previously, jayshrivastava (Jayant) wrote…

Enabling parquet + predicate changefeeds + diff was added in #94653, but there's not much of an explanation for why. If you look at the parquetPossible randomized logic in pkg/ccl/changefeedccl/testfeed_test.go, it looks like we disabled testing the parquet format with diff. I've decided to leave diff unsupported until we add it explicitly with proper testing in #103129.

As I recall it, it was made purely to make tests work -- we randomly pick parquet sink sometimes.


pkg/util/parquet/schema.go line 350 at r4 (raw file):

Previously, jayshrivastava (Jayant) wrote…

A few reasons.

If I understand correctly, we don't support multidimensional arrays, even in projections, so we should never see them. @HonoreDB also tried a few cases with me, but we couldn't make a nested array.

select array[[1,2,3],[4,5,6]] as col;                                                                                                                                                                       
ERROR: unimplemented: arrays cannot have arrays as element type
SQLSTATE: 0A000
HINT: You have attempted to use a feature that is not yet implemented.
See: https://go.crdb.dev/issue-v/32552/v23.1

The code I wrote definitely does not support nested arrays, it assumes arrays are arrays of scalar, so I just made the assertion explicit.

I'm fairly confident arbitrary nesting is not supported in parquet because I could not find anything about it in the docs/spec or by searching online. At best, there are some hacks (https://medium.com/hadoop-noob/recursive-avro-schema-for-parquet-a87c84b6aca1), but I don't feel confident relying on them. Chatgpt does not recommend it either :)

Ack. Thanks for the explanation. PostgreSQL supports multi dimensional arrays, and I think it's a matter of time before we do too.

I wonder if we should have a fall back mechanism i this library: if unsure -- convert to JSON. Every type must be convertable to json.

At any array, could you add a bit of a commentary here, and link to
https://go.crdb.dev/issue-v/32552/v23.1


pkg/ccl/changefeedccl/testfeed_test.go line 1347 at r3 (raw file):

		}

		keyBuf := bytes.Buffer{}

var keyBuf bytes.buffer?

This change updates changefeeds to use the new parquet library
added in `pkg/util/parquet` when using `format=parquet`.

Informs: cockroachdb#99028
Epic: https://cockroachlabs.atlassian.net/browse/CRDB-15071
Release note: None
Copy link
Contributor Author

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is transient error: parquet export does not support the TupleFamily type yet. At least it's a sane error. With this change, you won't even be able to create the changefeed. I'm working on adding tuples now - #103589.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB and @miretskiy)


pkg/ccl/changefeedccl/parquet.go line 114 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

nit: perhaps addParquetTestMetadata?

Done.


pkg/ccl/changefeedccl/parquet.go line 67 at r4 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I was just curious if we can explicitly pass an argument indicating whether or not we should include test metadata.
Perhaps rely on some knobs -- like if changefeed knob is non-nil, then this is test code, and so, add this metdata.
I understand why we need this metadata -- just not sure we need this global boolean flag to achieve this.
Feel free to ignore if too much hassle.

Done. I made it a testing knob.


pkg/util/parquet/schema.go line 350 at r4 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Ack. Thanks for the explanation. PostgreSQL supports multi dimensional arrays, and I think it's a matter of time before we do too.

I wonder if we should have a fall back mechanism i this library: if unsure -- convert to JSON. Every type must be convertable to json.

At any array, could you add a bit of a commentary here, and link to
https://go.crdb.dev/issue-v/32552/v23.1

Ack. Will implement this as discussed IRL. Working on adding tuple support as well.


pkg/ccl/changefeedccl/testfeed_test.go line 1347 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

var keyBuf bytes.buffer?

Done.

@jayshrivastava
Copy link
Contributor Author

bors r=miretskiy

@craig
Copy link
Contributor

craig bot commented May 18, 2023

Build succeeded:

@craig craig bot merged commit 30f3d20 into cockroachdb:master May 18, 2023
@jayshrivastava jayshrivastava mentioned this pull request May 25, 2023
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants